Article

Evaluating L2 Training Methods in Neural Language Models

Jaemin Lee1, Jeong-Ah Shin1,
Author Information & Copyright
1Dongguk University
Corresponding author: Professor Department of English Language and Literature Dongguk University 30 Pildong-ro 1-gil Jung-gu Seoul, 04620, Korea, E-mail: jashin@dongguk.edu

ⓒ Copyright 2024 Language Education Institute, Seoul National University. This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Received: Nov 14, 2024 ; Revised: Dec 15, 2024 ; Accepted: Dec 16, 2024

Published Online: Dec 31, 2024

ABSTRACT

Recent advancements in language models (LMs) have significantly improved language processing capabilities; however, these models remain less efficient than human learning, especially when trained on developmentally plausible data volumes similar to those encountered by children (Warstadt & Bowman, 2022; Linzen, 2020). The inefficiency is even more pronounced in second language (L2) acquisition contexts, where cross-linguistic transfer is a key phenomenon (Papadimitriou & Jurafsky, 2020; Yadavalli et al., 2023). This study evaluates L2 training methods in neural language models by examining mutual L1-L2 influences during learning with developmentally plausible data volumes. We propose two approaches to mitigate catastrophic forgetting: the One-Stage Training (OST) method, which integrates L1 and L2 learning into a single stage, and the One-Stage Mixed Training (OSMT) method, which refines OST by incorporating L1 data into the L2 stage for more realistic simulation of bilingual learning. Through continuous syntactic evaluations throughout training, we analyzed how L1 performance changes during L2 acquisition and how cross-linguistics transfer emerges in Korean and English. The results indicate that OST and OSMT effectively mitigated catastrophic forgetting and supported more stable learning compared to the conventional Two-Stage Training method. OSMT achieved superior integration of L1 and L2 structures while revealing negative transfer effects from Korean (L1) to English (L2). These findings provide valuable insights into both neural model training and human-like L2 acquisition processes.

Keywords: developmentally plausible data; cross-linguistic transfer; second language acquisition; neural language models; L2 language models; catastrophic forgetting

References

1.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901 .

2.

Chiswick, B. R., & Miller, P. W. (2005). Linguistic distance: A quantitative measure of the distance between English and other languages. Journal of Multilingual and Multicultural Development, 26(1), 1-11 .

3.

Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., & Stoyanov, V. (2018). XNLI: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 .

4.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179-211 .

5.

Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, 1-8 .

6.

Hawkins, R., & Chan, C. Y. (1997). The partial availability of Universal Grammar in second language acquisition: The 'failed functional features hypothesis'. Second Language Research, 13(3), 187-226 .

7.

Haznedar, B., & Schwartz, B. D. (1997). Are there optional infinitives in child L2 acquisition. Proceedings of the 21st Annual Boston University Conference on Language Development, 21, 257-268 .

8.

Huebner, P. A., Sulem, E., Fisher, C., & Roth, D. (2021). BabyBERTa: Learning more grammar with small-scale child-directed language. Proceedings of the 25th Conference on Computational Natural Language Learning, 624-646 .

9.

Jarvis, S. (2000). Methodological rigor in the study of transfer: Identifying L1 influence in them interlanguage lexicon. Language Learning, 50(2), 245-309 .

10.

Jarvis, S., & Pavlenko, A. (2007). Crosslinguistic influence in language and cognition. Routledge .

11.

Johnson, J. S., & Newport, E. L. (1989). Critical period effects in second language learning: The influence of maturational state on the acquisition of English as a second language. Cognitive Psychology, 21(1), 60-99 .

12.

Johnson, J. S., & Newport, E. L. (1991). Critical period effects on universal properties of language: The status of subjacency in the acquisition of a second language. Cognition, 39(3), 215-258 .

13.

Kaushik, P., Gain, A., Kortylewski, A., & Yuille, A. (2021). Understanding catastrophic forgetting and remembering in continual learning with optimal relevance mapping. arXiv preprint arXiv:2102.11343 .

14.

Kemker, R., McClure, M., Abitino, A., Hayes, T. L., & Kanan, C. (2018). Measuring catastrophic forgetting in neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1) .

15.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., & Hadsell, R. (2016). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521-3526 .

16.

Koo, K. W., Lee, J. M., & Park, M. K. (2024). Investigating syntactic interference effects in neural language models for second language acquisition. English Language and Linguistics, 30(1), 69-88 .

17.

Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2935-2947 .

18.

Linzen, T. (2020). How can we accelerate progress towards human-like linguistic generalization? arXiv preprint arXiv:2005.00955 .

19.

Lopez-Paz, D., & Ranzato, M. A. (2017). Gradient episodic memory for continual learning. Advances in Neural Information Processing Systems, 30 .

20.

McManus, K. (2021). Crosslinguistic influence and second language learning. Routledge .

21.

National Institute of Korean Language. (2020). Modu corpus: Open Korean language corpus. National Institute of Korean Language. https://corpus.korean.go.kr/ .

22.

Oba, M., Kuribayashi, T., Ouchi, H., & Watanabe, T. (2023). Second language acquisition of neural language models. arXiv preprint arXiv:2306.02920 .

23.

Papadimitriou, I., & Jurafsky, D. (2020). Learning music helps you read: Using transfer to study linguistic structure in language models. arXiv preprint arXiv:2004.14601 .

24.

Prévost, P., & White, L. (2000). Missing surface inflection or impairment in second language acquisition? Evidence from tense and agreement. Second Language Research, 16(2), 103-133 .

25.

Reali, F., & Christiansen, M. H. (2005). Uncovering the richness of the stimulus: Structure dependence and indirect statistical evidence. Cognitive Science, 29(6), 1007-1028 .

26.

Van Schijndel, M. (2019). Quantity doesn't buy quality syntax with neural language models. arXiv preprint arXiv:1909.00111 .

27.

Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S. F., & Bowman, S. R. (2020). BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8, 377-392 .

28.

Warstadt, A., Bowman, S. R. (2022). What artificial neural networks can tell us about human language acquisition. Algebraic Structures in Natural Language, 17-60, CRC Press .

29.

Warstadt, A., Williams, A., Liu, H., Warstadt, H., Fish, J., & Bowman, S. R. (2023). Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora. Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning .

30.

Yadavalli, A., Yadavalli, A., & Tobin, V. (2023). SLABERT talk pretty one day: Modeling second language acquisition with BERT. arXiv preprint arXiv:2305.19589 .

31.

Zhang, Y., Warstadt, A., Li, H. S., & Bowman, S. R. (2020). When do you need billions of words of pretraining data? arXiv preprint arXiv:2011.04946 .