A New Massive Multilingual Dataset for High-Performance Language Technologies - UTU Research Portal

A4 Refereed article in a conference publication

A New Massive Multilingual Dataset for High-Performance Language Technologies

Authors: de Gibert, Ona; Nail, Graeme; Arefyev, Nikolay; Bañón, Marta; van der Linde, Jelmer; Ji, Shaoxiong; Zaragoza-Bernabeu, Jaume; Aulamo, Mikko; Ramírez-Sánchez, Gema; Kutuzov, Andrey; Pyysalo, Sampo; Oepen, Stephan; Tiedemann, Jörg

Editors: Calzolari, Nicoletta; Kan, Min-Yen; Hoste, Veronique; Lenci, Alessandro; Sakti, Sakriani; Xue, Nianwen

Conference name: Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)

Publisher: European Language Resources Association (ELRA)

Publication year: 2024

Journal: LREC Proceedings

Book title : Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

First page : 1116

Last page: 1128

ISBN: 978-2-493814-10-4

ISSN: 2522-2686

Publication's open availability at the time of reporting: Open Access

Publication channel's open availability : Open Access publication channel

Web address : https://aclanthology.org/2024.lrec-main.100

Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/457541413

Abstract

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

PyysaloEtAl2024ANewMassiveMultilingualDataset.pdf

Funding information in the publication:
This project has received funding from the European Union’s Horizon Europe research and innovation programme under Grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546].