A4 Vertaisarvioitu artikkeli konferenssijulkaisussa
A broad-coverage corpus for finnish named entity recognition
Tekijät: Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo
Toimittaja: Calzolari N.,Bechet F.,Blache P.,Choukri K.,Cieri C.,Declerck T.,Goggi S.,Isahara H.,Maegaard B.,Mariani J.,Mazo H.,Moreno A.,Odijk J.,Piperidis S.
Konferenssin vakiintunut nimi: International Conference on Language Resources and Evaluation
Kustantaja: European Language Resources Association (ELRA)
Julkaisuvuosi: 2020
Kokoomateoksen nimi: 12th International Conference on Language Resources and Evaluation
Tietokannassa oleva lehden nimi: LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
Aloitussivu: 4615
Lopetussivu: 4624
ISBN: 979-10-95546-34-4
Verkko-osoite: https://www.aclweb.org/anthology/2020.lrec-1.567/
We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus
of 754 documents (200,000 tokens) representing ten different genres of
text, we introduce annotation marking person, organization, location,
product and event names as well as dates. The new annotation identifies
in total over 10,000 mentions. An evaluation of inter-annotator
agreement indicates that the quality and consistency of annotation are
high, at 94.5% F-score for exact match. A
comprehensive evaluation using state-of-the-art machine learning
methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity
mentions in texts drawn from most domains at precision and recall
approaching or exceeding 90%. Remaining challenges such as the
identification of names in blog posts and transcribed speech are also
identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus.