TCBLex - A lexical database of Finnish literary texts for children
: Nojonen, Tapio; Korsu, Kiia; Ginter, Filip; Laippala, Veronika; Kanerva, Jenna
Publisher: Springer Science and Business Media LLC
: 2025
: Behavior Research Methods
: 312
: 57
: 1554-351X
: 1554-3528
DOI: https://doi.org/10.3758/s13428-025-02832-x
: https://doi.org/10.3758/s13428-025-02832-x
: https://research.utu.fi/converis/portal/detail/Publication/504652992
This work introduces TCBLex, a lexical database of Finnish literary works read by children between the ages of 7 and 15. We explain in detail the work done to build the corpus TCBLex is based on, including how books were sampled and collected, turned into text files, and finally processed. We also touch on legal considerations and how it is possible to build such a corpus in the EU. TCBLex contains over 11 million tokens that are annotated with parts-of-speech tags and lemmatized. We provide 14 different sub-lexicons in total, covering individual intended reading ages, age groups, as well as different genres. We also provide versions with additional morphological information, such as the cases and tenses of words. TCBLex provides various psycholinguistically interesting lexical statistics for both word types and lemmas, such as different frequency metrics, distributions, word lengths, numbers of syllables, morphological paradigm sizes, and for the first time in a Finnish lexicon, ages when words and lemmas are first encountered in books. TCBLex is freely available at https://doi.org/10.5281/zenodo.15655580.
:
Open Access funding provided by University of Turku (including Turku University Central Hospital). The present study is a part of the EDUCA Flagship funded by the Research Council of Finland (#358924, #358947) and the EDUCA-Doc Doctoral Education pilot funded by the Ministry of Education and Culture (Doctoral school pilot #VN/3137/2024-OKM-4).