G5 Artikkeliväitöskirja

Machine learning in modeling historical registers – A new perspective to text linguistics




TekijätRepo, Liina

KustannuspaikkaTurku

Julkaisuvuosi2026

Sarjan nimiTurun yliopiston julkaisuja - Annales Universitatis B: Humaniora

Numero sarjassa759

ISBN978-952-02-0514-0

eISBN978-952-02-0515-7

ISSN0082-6987

eISSN2343-3191

Julkaisun avoimuus kirjaamishetkelläAvoimesti saatavilla

Julkaisukanavan avoimuus Kokonaan avoin julkaisukanava

Verkko-osoitehttps://urn.fi/URN:ISBN:978-952-02-0515-7


Tiivistelmä

This dissertation explores the insights into historical linguistic variation that can be gained through automatically identifying registers in large historical corpora, as well as the role of register variation in shaping these insights. Registers, i.e., situationally defined text varieties, are central to interpreting linguistic variation. This thesis investigates how existing annotated resources can be leveraged to enrich unannotated datasets, how variation between and within registers affects prediction reliability, and how linguistically interpretable features can deepen our understanding of register-specific language use.

Across three studies, this thesis integrates supervised machine learning with qualitative feature analysis, training models on the manually annotated Corpus of Founding Era American English (COFEA) and applying them to the large, heterogeneous Eighteenth Century Collections Online (ECCO). Study I models register variation within COFEA and demonstrates the feasibility of automatic register classification for historical texts, with feature analyses confirming that the model acquires meaningful, register-specific patterns (e.g., verbal and interpersonal features in letters). Study II extends the classification to ECCO, showing that models trained on COFEA generalize to ECCO for well-defined registers (e.g., letters, cases) but face challenges with hybrid categories, corpus-specific differences, and OCRinduced noise. Model explainability method Integrated Gradients highlights shared situational and linguistic cues behind both correct predictions and systematic misclassifications. Study III shifts focus to intra-document variation, demonstrating that text beginnings are most reliable for register prediction and that models capture stable, meaningful linguistic patterns across text segments. Keyword analyses confirm stable, linguistically motivated cues (e.g., interpersonal and informational features in letters) that persist across text parts.

Together, the studies offer new methods for enriching historical corpora with register information. Moreover, the results clarify how register variation shapes model behavior and deliver interpretable linguistic insights that strengthen corpus usability for research in historical linguistics, legal history, and digital humanities.



Last updated on