CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes - UTU Research Portal

A1 Refereed original research article in a scientific journal

CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes

Authors: Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Juhl

Editors: Zhu Shanfeng

Publisher: Oxford University Press (OUP)

Publishing place: OXFORD

Publication year: 2024

Journal: Bioinformatics Advances

Journal name in source: Bioinformatics Advances

Journal acronym: BIOINFORM ADV

Article number: vbae116

Volume: 4

Issue: 1

Number of pages: 7

eISSN: 2635-0041

DOI: https://doi.org/10.1093/bioadv/vbae116

Publication's open availability at the time of reporting: Open Access

Publication channel's open availability : Open Access publication channel

Web address : https://doi.org/10.1093/bioadv/vbae116

Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/458834899

Abstract

Motivation

Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.

Results

We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.

Availability and implementation

All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.

Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

vbae116.pdf

Funding information in the publication:
This work was supported by the Novo Nordisk Foundation [NNF14CC0001, NNF20SA0035590 to M.K.], the Academy of Finland [332844], and the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie [101023676 to K.N.].