A4 Refereed article in a conference publication

Analyzing register variation in web texts through automatic segmentation




AuthorsHenriksson, Erik; Hellström, Saara; Laippala, Veronika

EditorsHämäläinen, Mika; Öhman, Emily; Bizzoni, Yuri; Miyagawa, So; Alnajjar, Khalid

Conference nameInternational Conference on Natural Language Processing for Digital Humanities

Publication year2025

Book title Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

First page 7

Last page19

ISBN979-8-89176-234-3

DOIhttps://doi.org/10.18653/v1/2025.nlp4dh-1.2

Publication's open availability at the time of reportingOpen Access

Publication channel's open availability Open Access publication channel

Web address https://doi.org/10.18653/v1/2025.nlp4dh-1.2

Self-archived copy’s web addresshttps://research.utu.fi/converis/portal/detail/Publication/508751027

Self-archived copy's licenceCC BY

Self-archived copy's versionPublisher`s PDF


Abstract

This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.


Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on 12/03/2026 01:39:41 PM