A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

A broad-coverage corpus for finnish named entity recognition




TekijätJouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo

ToimittajaCalzolari N.,Bechet F.,Blache P.,Choukri K.,Cieri C.,Declerck T.,Goggi S.,Isahara H.,Maegaard B.,Mariani J.,Mazo H.,Moreno A.,Odijk J.,Piperidis S.

Konferenssin vakiintunut nimiInternational Conference on Language Resources and Evaluation

KustantajaEuropean Language Resources Association (ELRA)

Julkaisuvuosi2020

Kokoomateoksen nimi12th International Conference on Language Resources and Evaluation

Tietokannassa oleva lehden nimiLREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings

Aloitussivu4615

Lopetussivu4624

ISBN979-10-95546-34-4

Verkko-osoitehttps://www.aclweb.org/anthology/2020.lrec-1.567/


Tiivistelmä

We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus
of 754 documents (200,000 tokens) representing ten different genres of
text, we introduce annotation marking person, organization, location,
product and event names as well as dates. The new annotation identifies
in total over 10,000 mentions. An evaluation of inter-annotator
agreement indicates that the quality and consistency of annotation are
high, at 94.5% F-score for exact match. A
comprehensive evaluation using state-of-the-art machine learning
methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity
mentions in texts drawn from most domains at precision and recall
approaching or exceeding 90%. Remaining challenges such as the
identification of names in blog posts and transcribed speech are also
identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus.



Last updated on 2024-26-11 at 22:43