A1 Refereed original research article in a scientific journal
S1000: a better taxonomic name corpus for biomedical information extraction
Authors: Luoma Jouni, Nastou Katerina, Ohta Tomoko, Toivonen Harttu, Pafilis Evangelos, Jensen Lars Juhl, Pyysalo Sampo
Publisher: OXFORD UNIV PRESS
Publication year: 2023
Journal: Bioinformatics
Journal name in source: BIOINFORMATICS
Journal acronym: BIOINFORMATICS
Article number: btad369
Volume: 39
Issue: 6
Number of pages: 8
ISSN: 1367-4803
eISSN: 1367-4811
DOI: https://doi.org/10.1093/bioinformatics/btad369
Web address : https://doi.org/10.1093/bioinformatics/btad369
Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/180376416
Motivation
The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora.
Results
We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.
Availability and implementation
All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.
Downloadable publication This is an electronic reprint of the original article. |