A4 Vertaisarvioitu artikkeli konferenssijulkaisussa
The birth of Romanian BERT
Tekijät: Stefan Dumitrescu, Andrei-Marius Avram, Sampo Pyysalo
Toimittaja: Trevor Cohn, Yulan He, Yang Liu
Konferenssin vakiintunut nimi: Empirical Methods in Natural Language Processing
Julkaisuvuosi: 2020
Journal: Annual Meeting of the Association for Computational Linguistics
Kokoomateoksen nimi: Findings of the Association for Computational Linguistics: EMNLP 2020
Aloitussivu: 4324
Lopetussivu: 4328
ISBN: 978-1-952148-90-3
DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.387
Verkko-osoite: https://www.aclweb.org/anthology/2020.findings-emnlp.387/
Rinnakkaistallenteen osoite: https://arxiv.org/abs/2009.08712
Large-scale pretrained language models have
become ubiquitous in Natural Language Processing. However, most of these
models are available either in high-resource languages, in particular
English, or as multilingual models that compromise performance on
individual languages for coverage. This paper introduces Romanian BERT,
the first purely Romanian transformer-based language model, pretrained
on a large text corpus. We discuss corpus com-position and cleaning, the
model training process, as well as an extensive evaluation of the model
on various Romanian datasets. We opensource not only the model itself,
but also a repository that contains information on how to obtain the
corpus, fine-tune and use this model in production (with practical
examples), and how to fully replicate the evaluation process.
Ladattava julkaisu This is an electronic reprint of the original article. | ||
Ladattava julkaisu This is an electronic reprint of the original article. |