A1 Refereed original research article in a scientific journal

STRING-ing together protein complexes: Corpus and methods for extracting physical protein interactions from the biomedical literature




AuthorsMehryary, Farrokh; Nastou, Katerina; Ohta, Tomoko; Jensen, Lars Juhl; Pyysalo, Sampo

PublisherOxford University Press

Publication year2024

JournalBioinformatics

Journal name in sourceBioinformatics (Oxford, England)

Journal acronymBioinformatics

Article numberbtae552

Volume40

Issue9

ISSN1367-4803

eISSN1367-4811

DOIhttps://doi.org/10.1093/bioinformatics/btae552

Web address https://doi.org/10.1093/bioinformatics/btae552

Self-archived copy’s web addresshttps://research.utu.fi/converis/portal/detail/Publication/457893544


Abstract

MOTIVATION: Understanding biological processes relies heavily on curated knowledge of physical interactions between proteins. Yet, a notable gap remains between the information stored in databases of curated knowledge and the plethora of interactions documented in the scientific literature.

RESULTS: To bridge this gap, we introduce ComplexTome, a manually annotated corpus designed to facilitate the development of text-mining methods for the extraction of complex formation relationships among biomedical entities targeting the downstream semantics of the physical interaction sub-network of the STRING database. This corpus comprises 1,287 documents with ∼3,500 relationships. We train a novel relation extraction model on this corpus and find that it can highly reliably identify physical protein interactions (F1-score = 82.8%). We additionally enhance the model's capabilities through unsupervised trigger word detection and apply it to extract relations and trigger words for these relations from all open publications in the domain literature. This information has been fully integrated into the latest version of the STRING database.

AVAILABILITY AND IMPLEMENTATION: We provide the corpus, code, and all results produced by the large-scale runs of our systems biomedical on literature via Zenodo https://doi.org/10.5281/zenodo.8139716, Github https://github.com/farmeh/ComplexTome_extraction, and the latest version of STRING database https://string-db.org/.

SUPPLEMENTARY INFORMATION: Supplementary information are available at Bioinformatics online.


Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Funding information in the publication
This project has received funding from Novo Nordisk Foundation (Grant no.: NNF14CC0001) and from the Academy of Finland (Grant no.: 332844). K.N. has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie (Grant no.: 101023676).


Last updated on 2025-23-04 at 09:32