A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä
RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature
Tekijät: Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl
Kustantaja: OXFORD UNIV PRESS
Kustannuspaikka: OXFORD
Julkaisuvuosi: 2024
Journal: Database: The Journal of Biological Databases and Curation
Tietokannassa oleva lehden nimi: DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
Lehden akronyymi: DATABASE-OXFORD
Artikkelin numero: baae095
Vuosikerta: 2024
Sivujen määrä: 7
ISSN: 1758-0463
DOI: https://doi.org/10.1093/database/baae095
Verkko-osoite: https://doi.org/10.1093/database/baae095
Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/458222413
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
Ladattava julkaisu This is an electronic reprint of the original article. |
Julkaisussa olevat rahoitustiedot:
This project has received funding from the Novo Nordisk Foundation (Grant no.: NNF14CC0001) and from the Academy of Finland (grant no.: 332844). K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie (grant no.: 101023676).