An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)

Tekijät: Burchell, Laurie; De Gibert Bonet, Ona; Arefyev, Nikolay; Aulamo, Mikko; Bañón, Marta; Chen, Pinzhen; Fedorova, Mariia; Guillou, Liane; Haddow, Barry; Hajič, Jan; Helcl, Jindřich; Henriksson, Erik; Klimaszewski, Mateusz; Komulainen, Ville; Kutuzov, Andrey; Kytöniemi, Joona; Laippala, Veronika; Mæhlum, Petter; Malik, Bhavitvya; Mehryary, Farrokh; Mikhailov, Vladislav; Moghe, Nikita; Myntti, Amanda; O’Brien, Dayyán; Oepen, Stephan; Pal, Proyag; Piha, Jousia; Pyysalo, Sampo; Ramírez-Sánchez, Gema; Samuel, David; Stepachev, Pavel; Tiedemann, Jörg; Variš, Dušan; Vojtěchová, Tereza; Zaragoza-Bernabeu, Jaume

Toimittaja: Che, Wanxiang; Nabende, Joyce; Shutova, Ekaterina; Pilehvar, Mohammad Taher

Konferenssin vakiintunut nimi: Annual Meeting of the Association for Computational Linguistics

Kustantaja: Association for Computational Linguistics

Julkaisuvuosi: 2025

Lehti: Annual Meeting of the Association for Computational Linguistics

Kokoomateoksen nimi: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Aloitussivu: 17452

Lopetussivu: 17485

ISSN: 0736-587X

DOI: https://doi.org/10.18653/v1/2025.acl-long.854

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.18653/v1/2025.acl-long.854

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/505515303

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

2025.acl-long.854.pdf

Julkaisussa olevat rahoitustiedot:
This project has received funding from the European Union’s Horizon Europe research and innovation programme under Grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546].