A1 Refereed original research article in a scientific journal
From keywords to key embeddings – contrasting French and Swedish web registers using multilingual deep learning
Authors: Hellstrom, Saara; Skantsi, Valtteri; Salmela, Anna; Laippala, Veronika
Publisher: Walter de Gruyter GmbH
Publishing place: BERLIN
Publication year: 2025
Journal: Corpus Linguistics and Linguistic Theory
Journal name in source: Corpus Linguistics and Linguistic Theory
Journal acronym: CORPUS LINGUIST LING
Number of pages: 33
ISSN: 1613-7027
eISSN: 1613-7035
DOI: https://doi.org/10.1515/cllt-2024-0070
Web address : https://doi.org/10.1515/cllt-2024-0070
Abstract
The pervasiveness of the internet has given web language use a central role in society. However, the lack of multilingual corpora and scalable methods has led to the focus on English in web language research. To address this gap, the present paper sets itself in the register research tradition and explores French and Swedish web registers from a cross-linguistic angle. Methodologically we combine keyword analysis with multilingual deep learning, suggesting an approach that enables computational comparisons across languages. Specifically, we extract keywords for French and Swedish web registers, then associate the keywords with fastText word embeddings, and finally, cluster these key embeddings. The findings indicate that there are topical and functional clusters, and they are linguistically motivated and multilingual. The same clusters occur within the same registers in both languages pointing to shared topical and functional similarities - the registers are strikingly similar. The dissimilarities, in contrast, indicate that certain registers like Narrative blogs are to some extent different in French and Swedish. Moreover, grammatical specificities such as the location of adjectives explain some dissimilarities.
The pervasiveness of the internet has given web language use a central role in society. However, the lack of multilingual corpora and scalable methods has led to the focus on English in web language research. To address this gap, the present paper sets itself in the register research tradition and explores French and Swedish web registers from a cross-linguistic angle. Methodologically we combine keyword analysis with multilingual deep learning, suggesting an approach that enables computational comparisons across languages. Specifically, we extract keywords for French and Swedish web registers, then associate the keywords with fastText word embeddings, and finally, cluster these key embeddings. The findings indicate that there are topical and functional clusters, and they are linguistically motivated and multilingual. The same clusters occur within the same registers in both languages pointing to shared topical and functional similarities - the registers are strikingly similar. The dissimilarities, in contrast, indicate that certain registers like Narrative blogs are to some extent different in French and Swedish. Moreover, grammatical specificities such as the location of adjectives explain some dissimilarities.