The birth of Romanian BERT

: Stefan Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo

: Trevor Cohn, Yulan He, Yang Liu

: Empirical Methods in Natural Language Processing

: 2020

: Annual Meeting of the Association for Computational Linguistics

: Findings of the Association for Computational Linguistics: EMNLP 2020

: 4324

: 4328

: 978-1-952148-90-3

DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.387

: https://www.aclweb.org/anthology/2020.findings-emnlp.387/

: https://arxiv.org/abs/2009.08712

Large-scale pretrained language models have
become ubiquitous in Natural Language Processing. However, most of these
models are available either in high-resource languages, in particular
English, or as multilingual models that compromise performance on
individual languages for coverage. This paper introduces Romanian BERT,
the first purely Romanian transformer-based language model, pretrained
on a large text corpus. We discuss corpus com-position and cleaning, the
model training process, as well as an extensive evaluation of the model
on various Romanian datasets. We opensource not only the model itself,
but also a repository that contains information on how to obtain the
corpus, fine-tune and use this model in production (with practical
examples), and how to fully replicate the evaluation process.

		2020.findings-emnlp.387.pdf
		2009.08712.pdf