S1000: a better taxonomic name corpus for biomedical information extraction - UTU Tutkimustietojärjestelmä

A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

S1000: a better taxonomic name corpus for biomedical information extraction

Tekijät: Luoma Jouni, Nastou Katerina, Ohta Tomoko, Toivonen Harttu, Pafilis Evangelos, Jensen Lars Juhl, Pyysalo Sampo

Kustantaja: OXFORD UNIV PRESS

Julkaisuvuosi: 2023

Lehti: Bioinformatics

Tietokannassa oleva lehden nimi: BIOINFORMATICS

Lehden akronyymi: BIOINFORMATICS

Artikkelin numero: btad369

Vuosikerta: 39

Numero: 6

Sivujen määrä: 8

ISSN: 1367-4803

eISSN: 1367-4811

DOI: https://doi.org/10.1093/bioinformatics/btad369

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.1093/bioinformatics/btad369

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/180376416

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Motivation

The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.

Results

We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.

Availability and implementation

All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

btad369.pdf