A1 Refereed original research article in a scientific journal

LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations




AuthorsNourani, Esmaeil; Makri, Evangelia-Mantelena; Mao, Xiqing; Pyysalo, Sampo; Brunak, Søren; Nastou, Katerina; Jensen, Lars Juhl

PublisherOxford University Press (OUP)

Publication year2025

JournalDatabase: The Journal of Biological Databases and Curation

Journal name in sourceDatabase

Journal acronymDatabase (Oxford)

Article numberbaae129

Volume2025

ISSN1758-0463

eISSN1758-0463

DOIhttps://doi.org/10.1093/database/baae129

Web address https://doi.org/10.1093/database/baae129

Self-archived copy’s web addresshttps://research.utu.fi/converis/portal/detail/Publication/477997547


Abstract
Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF-disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF-disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600's quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications. Database URL: https://zenodo.org/records/13952449.

Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Funding information in the publication
This work was supported by the Novo Nordisk Foundation (NNF14CC0001 and NFF17OC0027594). K.N. has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie (101023676).


Last updated on 2025-04-03 at 09:45