A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Multilingual and Zero-Shot is Closing in on Monolingual Web Register Classification




TekijätRönnqvist Samuel, Skantsi Valtteri, Oinonen Miika, Laippala Veronika

ToimittajaSimon Dobnik, Lilja Øvrelid

Konferenssin vakiintunut nimiNordic Conference on Computational Linguistics

Julkaisuvuosi2021

Lehti:Linköping Electronic Conference Proceedings

Kokoomateoksen nimiProceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Sarjan nimiLinköping Electronic Conference Proceedings

Numero sarjassa178

Aloitussivu157

Lopetussivu165

ISBN978-91-7929-614-8

ISSN1650-3686

Verkko-osoitehttps://ep.liu.se/en/conference-article.aspx?series=ecp&issue=178&Article_No=16

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/56911747


Tiivistelmä

This article studies register classification of documents from the unrestricted web, such as news articles or opinion blogs, in a multilingual setting, exploring both the benefit of training on multiple languages and the capabilities for zero-shot cross-lingual transfer. While the wide range of linguistic variation found on the web poses challenges for register classification, recent studies have shown that good levels of cross-lingual transfer from the extensive English CORE corpus to other languages can be achieved. In this study, we show that training on multiple languages 1) benefits languages with limited amounts of register-annotated data, 2) on average achieves performance on par with monolingual models, and 3) greatly improves upon previous zero-shot results in Finnish, French and Swedish. The best results are achieved with the multilingual XLM-R model. As data, we use the CORE corpus series featuring register annotated data from the unrestricted web.


Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on 2024-26-11 at 21:46