Sampo Pyysalo - UTU Research Portal

Sampo Pyysalo

- University Research Fellow,
Data-analytiikka (Department of Computing)

sampo.pyysalo@utu.fi

ORCID identifier: https://orcid.org/0000-0002-6279-5000

curriculum_vitae.pdf

Publications (Google Scholar)

Areas of expertise

natural language processing; machine learning; scientific text mining

Biography

I am a researcher in the TurkuNLP group (https://turkunlp.org/) and Research Fellow at the Department of Computing, University of Turku. My work focuses on machine learning for natural language processing, with particular application domains including scientific text mining, Finnish language technology, and large language models.

After defending my PhD thesis in computer science at the University of Turku, I held researcher positions at the University of Tokyo, University of Manchester and University of Cambridge before returning to the University of Turku in 2019.

Research

The primary focus of my research is on natural language processing using machine learning approaches, with recent emphasis on deep learning methods and large language models. I have been working on scientific text mining as an application area for nearly 20 years, with specific focus on the English biomedical literature, and have in recent years also addressed a variety of tasks in the processing of Finnish text as well as multi- and cross-lingual applications. My work covers the full range of natural language processing development from initial task design to the development of practical applications and organizing community challenges, including also running manual annotation efforts and developing annotation tools and machine learning methods for various natural language processing tasks.

Teaching

My current teaching focuses on the natural language processing study module shared between the departments of Languages and Computing, with courses ranging from introductory to a course on deep learning for natural language processing.

Publications

1 of 4

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) (2025)
- Annual Meeting of the Association for Computational Linguistics
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Burchell, Laurie; De Gibert Bonet, Ona; Arefyev, Nikolay; Aulamo, Mikko; Bañón, Marta; Chen, Pinzhen; Fedorova, Mariia; Guillou, Liane; Haddow, Barry; Hajič, Jan; Helcl, Jindřich; Henriksson, Erik; Klimaszewski, Mateusz; Komulainen, Ville; Kutuzov, Andrey; Kytöniemi, Joona; Laippala, Veronika; Mæhlum, Petter; Malik, Bhavitvya; Mehryary, Farrokh; Mikhailov, Vladislav; Moghe, Nikita; Myntti, Amanda; O’Brien, Dayyán; Oepen, Stephan; Pal, Proyag; Piha, Jousia; Pyysalo, Sampo; Ramírez-Sánchez, Gema; Samuel, David; Stepachev, Pavel; Tiedemann, Jörg; Variš, Dušan; Vojtěchová, Tereza; Zaragoza-Bernabeu, Jaume
(A4 Refereed article in a conference publication )
Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature (2025)
- Scientific Data
Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorovic, Milica
(A1 Refereed data article in a scientific journal)
LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations (2025)
- Database: The Journal of Biological Databases and Curation
Nourani, Esmaeil; Makri, Evangelia-Mantelena; Mao, Xiqing; Pyysalo, Sampo; Brunak, Søren; Nastou, Katerina; Jensen, Lars Juhl
(A1 Refereed original research article in a scientific journal)
Scaling Data-Constrained Language Models (2025)
- Journal of Machine Learning Research
Muennighoff, Niklas; Rush, Alexander M.; Barak, Boaz; Le Scao, Teven; Piktus, Aleksandra; Tazi, Nouamane; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin
(A1 Refereed original research article in a scientific journal)
The STRING database in 2025: protein networks with directionality of regulation (2025)
- Nucleic Acids Research
Szklarczyk, Damian; Nastou, Katerina; Koutrouli, Mikaela; Kirsch, Rebecca; Mehryary, Farrokh; Hachilif, Radja; Hu, Dewei; Peluso, Matteo E.; Huang, Qingyao; Fang, Tao; Doncheva, Nadezhda T.; Pyysalo, Sampo; Bork, Peer; Jensen, Lars J.; von Mering, Christian
(A1 Refereed original research article in a scientific journal)
A New Massive Multilingual Dataset for High-Performance Language Technologies (2024)
- LREC Proceedings
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) de Gibert, Ona; Nail, Graeme; Arefyev, Nikolay; Bañón, Marta; van der Linde, Jelmer; Ji, Shaoxiong; Zaragoza-Bernabeu, Jaume; Aulamo, Mikko; Ramírez-Sánchez, Gema; Kutuzov, Andrey; Pyysalo, Sampo; Oepen, Stephan; Tiedemann, Jörg
(A4 Refereed article in a conference publication )
Application of the Question Answering method to extract information from materials science literature (2024) Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorović Milica
(Abstract)
Building Question-Answer Data Using Web Register Identification (2024)
- LREC Proceedings
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Eskelinen Anni, Myntti Amanda, Henriksson Erik, Pyysalo Sampo, Laippala Veronika
(A4 Refereed article in a conference publication )
CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes (2024)
- Bioinformatics Advances
Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Juhl
(A1 Refereed original research article in a scientific journal)
Improving dictionary-based named entity recognition with deep learning (2024)
- Bioinformatics
Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Jyhl
(A1 Refereed original research article in a scientific journal)
Lifestyle factors in the biomedical literature: An ontology and comprehensive resources for named entity recognition (2024)
- Bioinformatics
Nourani, Esmaeil; Koutrouli, Mikaela; Xie, Yijia; Vagiaki, Danai; Pyysalo, Sampo; Nastou, Katerina; Brunak, Søren; Jensen, Lars Juhl
(A1 Refereed original research article in a scientific journal)
Linguistic variation beyond the Indo-European web: Analyzing Turkish web registers in TurCORE (2024)
- Register studies
Erten-Johansson, Selcen; Skantsi, Valtteri; Pyysalo, Sampo; Laippala, Veronika
(A1 Refereed original research article in a scientific journal)
Question Answering models for information extraction from perovskite materials science literature (2024) 2024 MRS Fall Meeting and Exhibit Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip, Todorović, Milica
(Abstract)
RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature (2024)
- Database: The Journal of Biological Databases and Curation
Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl
(A1 Refereed original research article in a scientific journal)
STRING-ing together protein complexes: Corpus and methods for extracting physical protein interactions from the biomedical literature (2024)
- Bioinformatics
Mehryary, Farrokh; Nastou, Katerina; Ohta, Tomoko; Jensen, Lars Juhl; Pyysalo, Sampo
(A1 Refereed original research article in a scientific journal)
FinGPT: Large Generative Models for a Small Language (2023) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing Luukkonen Risto, Komulainen Ville, Luoma Jouni, Eskelinen Anni, Kanerva Jenna, Kupari Hanna-Mari, Ginter Filip, Laippala Veronika, Muennighoff Niklas, Piktus Aleksandra, Wang Thomas, Tazi Nouamane, Scao Le Teven, Wolf Thomas, Suominen Osma, Sairanen Samuli, Merioksa Mikko, Heinonen Jyrki, Vahtola Aija, Antao Samuel, Pyysalo Sampo
(A4 Refereed article in a conference publication )
Kohti suomenkielisiä keskustelumalleja: tule kehittämään tekoälyä (2023)
- Hiiskuttua: Turun yliopiston humanistisen tiedekunnan verkkolehti
Kytöniemi Joona, Saarni Jenna, Kupari Hanna-Mari, Pyysalo Sampo
(D1 Article in a professional journal)
Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction (2023)
- NEALT proceedings series
Proceedings of The 24th Nordic Conference on Computational Linguistics (NoDaLiDa) Bassignana Elisa, Ginter Filip, Pyysalo Sampo, Rob van der Goot, Plank Barbara
(A4 Refereed article in a conference publication )
Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations (2023)
- Database: The Journal of Biological Databases and Curation
Miranda-Escalada Antonio, Mehryary Farrokh, Luoma Jouni, Estrada-Zavala Darryl, Gasco Luis, Pyysalo Sampo, Valencia Alfonso, Krallinger Martin
(A1 Refereed original research article in a scientific journal)
S1000: a better taxonomic name corpus for biomedical information extraction (2023)
- Bioinformatics
Luoma Jouni, Nastou Katerina, Ohta Tomoko, Toivonen Harttu, Pafilis Evangelos, Jensen Lars Juhl, Pyysalo Sampo
(A1 Refereed original research article in a scientific journal)