Refereed article in conference proceedings (A4)

A broad-coverage corpus for finnish named entity recognition




List of Authors: Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo

Conference name: International Conference on Language Resources and Evaluation

Publisher: European Language Resources Association (ELRA)

Publication year: 2020

Book title *: 12th International Conference on Language Resources and Evaluation

Journal name in source: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

ISBN: 979-10-95546-34-4

URL: https://www.aclweb.org/anthology/2020.lrec-1.567/


Abstract

We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus
of 754 documents (200,000 tokens) representing ten different genres of
text, we introduce annotation marking person, organization, location,
product and event names as well as dates. The new annotation identifies
in total over 10,000 mentions. An evaluation of inter-annotator
agreement indicates that the quality and consistency of annotation are
high, at 94.5% F-score for exact match. A
comprehensive evaluation using state-of-the-art machine learning
methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity
mentions in texts drawn from most domains at precision and recall
approaching or exceeding 90%. Remaining challenges such as the
identification of names in blog posts and transcribed speech are also
identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus.


Last updated on 2022-13-06 at 16:21