Från dialektinspelning till talspråkskorpus – beskrivning av ett korpusbygge




beskrivning av ett korpusbygge

Lisa Södergård, Therese Leinonen

J.-O. östman et al.

Helsinki

2017

Ideologi, identitet, intervention. Tionde nordiska dialektologkonferensen

Nordica Helsingiensia

48

978-951-51-2996-3

1795-4428

https://research.utu.fi/converis/portal/detail/Publication/2315952(external)



The Talko corpus of Swedish spoken in Finland is a new research tool consisting of audio files linked to annotation, i.e., transcriptions on two parallel levels and part-of-speech tagging. The corpus is searchable through a web-based interface. The re­cord­ings were made in 2005–2008 in all parts of Swedish-language Finland. They have been transcribed in a broad phonetic transcription as well as in a standard ortho­graphic transcription. The part-of-speech tagging is done with TreeTagger, trained on the Stockholm-Umeå Corpus of written Swedish. The automatically pro­duced part-of-speech tags are manually corrected for subsets of the data, and the manually corrected data are subsequently added to the training data. This will grad­ually improve the result of the automatic tagging and compensate for differences between spoken and written Swedish and between Finland-Swedish and Sweden-Swedish.


Last updated on 2024-26-11 at 11:12