RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature - UTU Tutkimustietojärjestelmä

A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature

Tekijät: Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl

Kustantaja: OXFORD UNIV PRESS

Kustannuspaikka: OXFORD

Julkaisuvuosi: 2024

Lehti: Database: The Journal of Biological Databases and Curation

Tietokannassa oleva lehden nimi: DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION

Lehden akronyymi: DATABASE-OXFORD

Artikkelin numero: baae095

Vuosikerta: 2024

Sivujen määrä: 7

ISSN: 1758-0463

DOI: https://doi.org/10.1093/database/baae095

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.1093/database/baae095

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/458222413

Tiivistelmä

In the field of biomedical text mining, the ability to extract relations from the literature is crucial for advancing both theoretical research and practical applications. There is a notable shortage of corpora designed to enhance the extraction of multiple types of relations, particularly focusing on proteins and protein-containing entities such as complexes and families, as well as chemicals. In this work, we present RegulaTome, a corpus that overcomes the limitations of several existing biomedical relation extraction (RE) corpora, many of which concentrate on single-type relations at the sentence level. RegulaTome stands out by offering 16 961 relations annotated in >2500 documents, making it the most extensive dataset of its kind to date. This corpus is specifically designed to cover a broader spectrum of >40 relation types beyond those traditionally explored, setting a new benchmark in the complexity and depth of biomedical RE tasks. Our corpus both broadens the scope of detected relations and allows for achieving noteworthy accuracy in RE. A transformer-based model trained on this corpus has demonstrated a promising F1-score (66.6%) for a task of this complexity, underscoring the effectiveness of our approach in accurately identifying and categorizing a wide array of biological relations. This achievement highlights RegulaTome's potential to significantly contribute to the development of more sophisticated, efficient, and accurate RE systems to tackle biomedical tasks. Finally, a run of the trained RE system on all PubMed abstracts and PMC Open Access full-text documents resulted in >18 million relations, extracted from the entire biomedical literature.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

baae095.pdf

Julkaisussa olevat rahoitustiedot:
This project has received funding from the Novo Nordisk Foundation (Grant no.: NNF14CC0001) and from the Academy of Finland (grant no.: 332844). K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie (grant no.: 101023676).