A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora




TekijätMyntti, Amanda; Repo, Liina; Freyermuth, Elian; Kanner, Antti; Laippala, Veronika; Henriksson, Erik

ToimittajaMika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni

Konferenssin vakiintunut nimiInternational Conference on Natural Language Processing for Digital Humanities

Julkaisuvuosi2024

Kokoomateoksen nimiProceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

ISBN979-8-89176-181-0

DOIhttps://doi.org/10.18653/v1/2024.nlp4dh-1.38

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/477956278


Tiivistelmä

Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.


Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Julkaisussa olevat rahoitustiedot
This project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant numbers 358720 and 331297


Last updated on 2025-05-02 at 14:50