A1 Refereed original research article in a scientific journal

From keywords to key embeddings – contrasting French and Swedish web registers using multilingual deep learning




AuthorsHellstrom, Saara; Skantsi, Valtteri; Salmela, Anna; Laippala, Veronika

PublisherWalter de Gruyter GmbH

Publishing placeBERLIN

Publication year2025

JournalCorpus Linguistics and Linguistic Theory

Journal name in sourceCorpus Linguistics and Linguistic Theory

Journal acronymCORPUS LINGUIST LING

Number of pages33

ISSN1613-7027

eISSN1613-7035

DOIhttps://doi.org/10.1515/cllt-2024-0070

Web address https://doi.org/10.1515/cllt-2024-0070


Abstract
The pervasiveness of the internet has given web language use a central role in society. However, the lack of multilingual corpora and scalable methods has led to the focus on English in web language research. To address this gap, the present paper sets itself in the register research tradition and explores French and Swedish web registers from a cross-linguistic angle. Methodologically we combine keyword analysis with multilingual deep learning, suggesting an approach that enables computational comparisons across languages. Specifically, we extract keywords for French and Swedish web registers, then associate the keywords with fastText word embeddings, and finally, cluster these key embeddings. The findings indicate that there are topical and functional clusters, and they are linguistically motivated and multilingual. The same clusters occur within the same registers in both languages pointing to shared topical and functional similarities - the registers are strikingly similar. The dissimilarities, in contrast, indicate that certain registers like Narrative blogs are to some extent different in French and Swedish. Moreover, grammatical specificities such as the location of adjectives explain some dissimilarities.



Last updated on 2025-18-03 at 14:48