A4 Refereed article in a conference publication
Analyzing register variation in web texts through automatic segmentation
Authors: Henriksson, Erik; Hellström, Saara; Laippala, Veronika
Editors: Hämäläinen, Mika; Öhman, Emily; Bizzoni, Yuri; Miyagawa, So; Alnajjar, Khalid
Conference name: International Conference on Natural Language Processing for Digital Humanities
Publication year: 2025
Book title : Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
First page : 7
Last page: 19
ISBN: 979-8-89176-234-3
DOI: https://doi.org/10.18653/v1/2025.nlp4dh-1.2
Publication's open availability at the time of reporting: Open Access
Publication channel's open availability : Open Access publication channel
Web address : https://doi.org/10.18653/v1/2025.nlp4dh-1.2
Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/508751027
Self-archived copy's licence: CC BY
Self-archived copy's version: Publisher`s PDF
This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.
Downloadable publication This is an electronic reprint of the original article. |