A broad-coverage corpus for finnish named entity recognition
: Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo
: Calzolari N.,Bechet F.,Blache P.,Choukri K.,Cieri C.,Declerck T.,Goggi S.,Isahara H.,Maegaard B.,Mariani J.,Mazo H.,Moreno A.,Odijk J.,Piperidis S.
: International Conference on Language Resources and Evaluation
Publisher: European Language Resources Association (ELRA)
: 2020
: 12th International Conference on Language Resources and Evaluation
: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
: 4615
: 4624
: 979-10-95546-34-4
: https://www.aclweb.org/anthology/2020.lrec-1.567/
We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus
of 754 documents (200,000 tokens) representing ten different genres of
text, we introduce annotation marking person, organization, location,
product and event names as well as dates. The new annotation identifies
in total over 10,000 mentions. An evaluation of inter-annotator
agreement indicates that the quality and consistency of annotation are
high, at 94.5% F-score for exact match. A
comprehensive evaluation using state-of-the-art machine learning
methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity
mentions in texts drawn from most domains at precision and recall
approaching or exceeding 90%. Remaining challenges such as the
identification of names in blog posts and transcribed speech are also
identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus.