A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature




TekijätNastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl

KustantajaOXFORD UNIV PRESS

KustannuspaikkaOXFORD

Julkaisuvuosi2024

JournalDatabase: The Journal of Biological Databases and Curation

Tietokannassa oleva lehden nimiDATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

Lehden akronyymiDATABASE-OXFORD

Artikkelin numero baae095

Vuosikerta2024

Sivujen määrä7

ISSN1758-0463

DOIhttps://doi.org/10.1093/database/baae095

Verkko-osoitehttps://doi.org/10.1093/database/baae095

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/458222413


Tiivistelmä
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Julkaisussa olevat rahoitustiedot
This project has received funding from the Novo Nordisk Foundation (Grant no.: NNF14CC0001) and from the Academy of Finland (grant no.: 332844). K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie (grant no.: 101023676).


Last updated on 2025-27-01 at 19:15