A broad-coverage corpus for finnish named entity recognition - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

A broad-coverage corpus for finnish named entity recognition

Tekijät: Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo

Toimittaja: Calzolari N.,Bechet F.,Blache P.,Choukri K.,Cieri C.,Declerck T.,Goggi S.,Isahara H.,Maegaard B.,Mariani J.,Mazo H.,Moreno A.,Odijk J.,Piperidis S.

Konferenssin vakiintunut nimi: International Conference on Language Resources and Evaluation

Kustantaja: European Language Resources Association (ELRA)

Julkaisuvuosi: 2020

Kokoomateoksen nimi: 12th International Conference on Language Resources and Evaluation

Tietokannassa oleva lehden nimi: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Aloitussivu: 4615

Lopetussivu: 4624

ISBN: 979-10-95546-34-4

Verkko-osoite: https://www.aclweb.org/anthology/2020.lrec-1.567/

Tiivistelmä

We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus
of 754 documents (200,000 tokens) representing ten different genres of
text, we introduce annotation marking person, organization, location,
product and event names as well as dates. The new annotation identifies
in total over 10,000 mentions. An evaluation of inter-annotator
agreement indicates that the quality and consistency of annotation are
high, at 94.5% F-score for exact match. A
comprehensive evaluation using state-of-the-art machine learning
methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity
mentions in texts drawn from most domains at precision and recall
approaching or exceeding 90%. Remaining challenges such as the
identification of names in blog posts and transcribed speech are also
identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus.