A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Analyzing register variation in web texts through automatic segmentation




TekijätHenriksson, Erik; Hellström, Saara; Laippala, Veronika

ToimittajaHämäläinen, Mika; Öhman, Emily; Bizzoni, Yuri; Miyagawa, So; Alnajjar, Khalid

Konferenssin vakiintunut nimiInternational Conference on Natural Language Processing for Digital Humanities

Julkaisuvuosi2025

Kokoomateoksen nimiProceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Aloitussivu7

Lopetussivu19

ISBN979-8-89176-234-3

DOIhttps://doi.org/10.18653/v1/2025.nlp4dh-1.2

Julkaisun avoimuus kirjaamishetkelläAvoimesti saatavilla

Julkaisukanavan avoimuus Kokonaan avoin julkaisukanava

Verkko-osoitehttps://doi.org/10.18653/v1/2025.nlp4dh-1.2

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/508751027

Rinnakkaistallenteen lisenssiCC BY

Rinnakkaistallennetun julkaisun versioKustantajan versio


Tiivistelmä

This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.


Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on