Poro 34B and the Blessing of Multilinguality - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Poro 34B and the Blessing of Multilinguality

Tekijät: Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo

Toimittaja: Johansson, Richard; Stymne, Sara

Konferenssin vakiintunut nimi: Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies

Julkaisuvuosi: 2025

Lehti: NEALT proceedings series

Kokoomateoksen nimi: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Vuosikerta: 57

Aloitussivu: 367

Lopetussivu: 382

ISBN: 978-9908-53-109-0

ISSN: 1736-8197

eISSN: 1736-6305

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://aclanthology.org/2025.nodalida-1.40/

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/506554658

Tiivistelmä

The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

2025.nodalida-1.40.pdf

Julkaisussa olevat rahoitustiedot:
This project has received funding from the European Union’s Horizon Europe research and innovation programme under Grant agreement No 101070350.