A4 Refereed article in a conference publication

Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora




AuthorsMyntti, Amanda; Repo, Liina; Freyermuth, Elian; Kanner, Antti; Laippala, Veronika; Henriksson, Erik

EditorsMika Hämäläinen, Emily Öhman, So Miyagawa, Khalid Alnajjar, Yuri Bizzoni

Conference nameInternational Conference on Natural Language Processing for Digital Humanities

Publication year2024

Book title Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

ISBN979-8-89176-181-0

DOIhttps://doi.org/10.18653/v1/2024.nlp4dh-1.38

Self-archived copy’s web addresshttps://research.utu.fi/converis/portal/detail/Publication/477956278


Abstract

Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.


Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Funding information in the publication
This project has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant numbers 358720 and 331297


Last updated on 2025-05-02 at 14:50