A4 Refereed article in a conference publication
The birth of Romanian BERT
Authors: Stefan Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo
Editors: Trevor Cohn, Yulan He, Yang Liu
Conference name: Empirical Methods in Natural Language Processing
Publication year: 2020
Journal: Annual Meeting of the Association for Computational Linguistics
Book title : Findings of the Association for Computational Linguistics: EMNLP 2020
First page : 4324
Last page: 4328
ISBN: 978-1-952148-90-3
DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.387
Web address : https://www.aclweb.org/anthology/2020.findings-emnlp.387/
Self-archived copy’s web address: https://arxiv.org/abs/2009.08712
Large-scale pretrained language models have
become ubiquitous in Natural Language Processing. However, most of these
models are available either in high-resource languages, in particular
English, or as multilingual models that compromise performance on
individual languages for coverage. This paper introduces Romanian BERT,
the first purely Romanian transformer-based language model, pretrained
on a large text corpus. We discuss corpus com-position and cleaning, the
model training process, as well as an extensive evaluation of the model
on various Romanian datasets. We opensource not only the model itself,
but also a repository that contains information on how to obtain the
corpus, fine-tune and use this model in production (with practical
examples), and how to fully replicate the evaluation process.
Downloadable publication This is an electronic reprint of the original article. | ||
Downloadable publication This is an electronic reprint of the original article. |