A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Creating a parallel Finnish—Easy Finnish dataset from news articles




TekijätDmitrieva Anna, Konovalova Aleksandra

ToimittajaMiquel Espl`a-Gomis (Universitat d’Alacant, Spain), Mikel L. Forcada (Universitat d’Alacant,
Spain), Taja Kuzman (Joˇzef Stefan Institute, Slovenia), Nikola Ljubeˇsi´c (University of Ljubljana, Slovenia), Rik van Noord (University of Groningen, The Netherlands), Gema Ram´ırez-S´anchez (Prompsit Language Engineering, Spain), J¨org Tiedemann (University of Helsinki, Finland), Antonio Toral (University of Groningen, The Netherlands)

Konferenssin vakiintunut nimiWorkshop on Open Community-Driven Machine Translation

Julkaisuvuosi2023

Kokoomateoksen nimiProceedings of the 1st Workshop on Open Community-Driven Machine Translation

ISBN978-84-1302-228-4

Verkko-osoitehttps://macocu.eu/static/media/proceedings.37b7e88ce3dbab99adf9.pdf#page=27

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/180195017


Tiivistelmä

Modern natural language processing tasks such as text simplification or summarization are typically formulated as monolingual machine translation tasks. This requires appropriate datasets to train, tune, and evaluate the models. This paper describes the creation of a parallel Finnish–Easy Finnish dataset from the Yle News archives. The dataset contains 1919 manually verified pairs of articles, each containing an article in Easy Finnish (selkosuomi) and a corresponding article from Standard Finnish news. Standard Finnish texts total 687555 words, and Easy Finnish texts have 106733 words. This new aligned resource was created automatically based on the Yle News archives from the Language Bank of Finland (Kielipankki) and manually checked by a human expert. The dataset is available for download from Kielipankki. This resource will allow for more effective Easy Language research and for creating applications for automatic simplification and/or summarization of Finnish texts.


Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on 2025-27-02 at 13:56