Sampo Pyysalo
sampo.pyysalo@utu.fi |
natural language processing; machine learning; scientific text mining
I am a researcher in the TurkuNLP group (https://turkunlp.org/) and Research Fellow at the Department of Computing, University of Turku. My work focuses on machine learning for natural language processing, with particular application domains including scientific text mining, Finnish language technology, and large language models.
After defending my PhD thesis in computer science at the University of Turku, I held researcher positions at the University of Tokyo, University of Manchester and University of Cambridge before returning to the University of Turku in 2019.
The primary focus of my research is on natural language processing using machine learning approaches, with recent emphasis on deep learning methods and large language models. I have been working on scientific text mining as an application area for nearly 20 years, with specific focus on the English biomedical literature, and have in recent years also addressed a variety of tasks in the processing of Finnish text as well as multi- and cross-lingual applications. My work covers the full range of natural language processing development from initial task design to the development of practical applications and organizing community challenges, including also running manual annotation efforts and developing annotation tools and machine learning methods for various natural language processing tasks.
My current teaching focuses on the natural language processing study module shared between the departments of Languages and Computing, with courses ranging from introductory to a course on deep learning for natural language processing.
- The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest (2023)
- Nucleic Acids Research
- Toxicity Detection in Finnish Using Machine Translation (2023)
- NEALT proceedings series
- Register identification from the unrestricted open Web using the Corpus of Online Registers of English (2022)
- Language Resources and Evaluation
- Towards better structured and less noisy Web data: Oscar with Register annotations (2022)
- International Conference on Computational Linguistics
- Beyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers (2021) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop Repo Liina, Skantsi Valtteri, Rönnqvist Samuel, Hellström Saara, Oinonen Miika, Salmela Anna, Biber Douglas, Egbert Jesse, Pyysalo Sampo, Laippala Veronika
- Correction to 'The STRING database in 2021: customizable protein-protein networks, and functional characterization of user-uploaded gene/measurement sets' (vol 49, pg D605, 2021) (2021)
- Nucleic Acids Research
- Deep learning for sentence clustering in essay grading support (2021) Proceedings of the 14th International Conference on Educational Data Mining (EDM 2021) Chang Li-Hsin, Rastas Iiro, Pyysalo Sampo, Ginter Filip
- Fine-grained Named Entity Annotation for Finnish (2021)
- Linköping Electronic Conference Proceedings
- Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations (2021) Proceedings of the BioCreative VII Challenge Evaluation Workshop Miranda Antonio, Mehryary Farrokh, Luoma Jouni, Pyysalo Sampo, Valencia Alfonso, Krallinger Martin
- Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases (2021) Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age Chang Li-Hsin, Pyysalo Sampo, Kanerva Jenna, Ginter Filip
- The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets (2021)
- Nucleic Acids Research
- WikiBERT Models: Deep Transfer Learning for Many Languages (2021)
- Linköping Electronic Conference Proceedings
- A broad-coverage corpus for finnish named entity recognition (2020) 12th International Conference on Language Resources and Evaluation Jouni Luoma, Miika Oinonen, Maria Pyykönen, Veronika Laippala, Sampo Pyysalo
- Dependency parsing of biomedical text with BERT (2020)
- BMC Bioinformatics
- Exploring Cross-sentence Contexts for Named Entity Recognition with BERT (2020)
- Proceedings of COLING: International Conference on Computational Linguistics
- From Web Crawl to Clean Register-Annotated Corpora (2020) Proceedings of the 12th Web as Corpus Workshop Laippala Veronika, Rönnqvist Samuel, Hellström Saara, Luotolahti, Juhani, Repo Liina, Salmela Anna, Skantsi Valtteri and Pyysalo Sampo
- The birth of Romanian BERT (2020)
- Annual Meeting of the Association for Computational Linguistics
- Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task (2020)
- Annual Meeting of the Association for Computational Linguistics
- Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection (2020) Proceedings of the 12th Language Resources and Evaluation Conference Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman
- Biomedical Named Entity Recognition with Multilingual BERT (2019) Proceedings of The 5th Workshop on BioNLP Open Shared Tasks Hakala Kai, Pyysalo Sampo