RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature
: Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl
Publisher: OXFORD UNIV PRESS
: OXFORD
: 2024
: Database: The Journal of Biological Databases and Curation
: DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
: DATABASE-OXFORD
: baae095
: 2024
: 7
: 1758-0463
DOI: https://doi.org/10.1093/database/baae095
: https://doi.org/10.1093/database/baae095
: https://research.utu.fi/converis/portal/detail/Publication/458222413
In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.
:
This project has received funding from the Novo Nordisk Foundation (Grant no.: NNF14CC0001) and from the Academy of Finland (grant no.: 332844). K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie (grant no.: 101023676).