Sampo Pyysalo
 


sampo.pyysalo@utu.fi




ORCID-tunnistehttps://orcid.org/0000-0002-6279-5000

Publications (Google Scholar)




Asiantuntijuusalueet
natural language processing; machine learning; scientific text mining

Biografia

I am a researcher in the TurkuNLP group (https://turkunlp.org/) and Research Fellow at the Department of Computing, University of Turku. My work focuses on machine learning for natural language processing, with particular application domains including scientific text mining, Finnish language technology, and large language models.

After defending my PhD thesis in computer science at the University of Turku, I held researcher positions at the University of Tokyo, University of Manchester and University of Cambridge before returning to the University of Turku in 2019.



Tutkimus

The primary focus of my research is on natural language processing using machine learning approaches, with recent emphasis on deep learning methods and large language models. I have been working on scientific text mining as an application area for nearly 20 years, with specific focus on the English biomedical literature, and have in recent years also addressed a variety of tasks in the processing of Finnish text as well as multi- and cross-lingual applications. My work covers the full range of natural language processing development from initial task design to the development of practical applications and organizing community challenges, including also running manual annotation efforts and developing annotation tools and machine learning methods for various natural language processing tasks.



Opetus

My current teaching focuses on the natural language processing study module shared between the departments of Languages and Computing, with courses ranging from introductory to a course on deep learning for natural language processing.



Julkaisut
  
null
  
null
  
1/4
  
null
  
null
  

  • An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)  (2025)  
    • Annual Meeting of the Association for Computational Linguistics
    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Burchell, Laurie; De Gibert Bonet, Ona; Arefyev, Nikolay; Aulamo, Mikko; Bañón, Marta; Chen, Pinzhen; Fedorova, Mariia; Guillou, Liane; Haddow, Barry; Hajič, Jan; Helcl, Jindřich; Henriksson, Erik; Klimaszewski, Mateusz; Komulainen, Ville; Kutuzov, Andrey; Kytöniemi, Joona; Laippala, Veronika; Mæhlum, Petter; Malik, Bhavitvya; Mehryary, Farrokh; Mikhailov, Vladislav; Moghe, Nikita; Myntti, Amanda; O’Brien, Dayyán; Oepen, Stephan; Pal, Proyag; Piha, Jousia; Pyysalo, Sampo; Ramírez-Sánchez, Gema; Samuel, David; Stepachev, Pavel; Tiedemann, Jörg; Variš, Dušan; Vojtěchová, Tereza; Zaragoza-Bernabeu, Jaume
    (
    A4 Vertaisarvioitu artikkeli konferenssijulkaisussa)


  •   (2025)  
    • Scientific DataLREC Proceedings
     Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorovic, Milica
    (
    A1 Vertaisarvioitu data-artikkeli tieteellisessä lehdessä)


  •   (2025)  Proceedings of the 31st International Conference on Computational Linguistics : Industry Track Nakamura, Taishi; Mishra, Mayank; Tedeschi, Simone; Chai, Yekun; Stillerman, Jason T.; Friedrich, Felix; Yadav, Prateek; Laud, Tanmay; Chien, Vu Minh; Zhuo, Terry Yue; Misra, Diganta; Bogin, Ben; Vu, Xuan-Son; Karpinska, Marzena; Dantuluri, Arnav Varma; Kusa, Wojciech; Furlanello, Tommaso; Yokota, Rio; Muennighoff, Niklas; Pai, Suhas; Adewumi, Tosin; Laippala, Veronika; Yao, Xiaozhe; Junior, Adalberto Barbosa; Drozd, Aleksandr; Clive, Jordan; Gupta, Kshitij; Chen, Liangyu; Sun, Qi; Tsui, Ken; Moustafa-Fahmy, Nour; Monti, Nicolo; Dang, Tai; Luo, Ziyang; Bui, Tien-Tung; Navigli, Roberto; Mehta, Virendra; Blumberg, Matthew; May, Victor; Nguyen, Hiep; Pyysalo, Sampo
    (
    A4 Vertaisarvioitu artikkeli konferenssijulkaisussa)


  • Got Compute, but No Data: Lessons From Post-training a Finnish LLM  (2025)  
    • NEALT proceedings seriesBioinformatics
    Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) Zosa, Elaine; Komulainen, Ville; Pyysalo, Sampo
    (
    A4 Vertaisarvioitu artikkeli konferenssijulkaisussa)


  •   (2025)  
    • Database: The Journal of Biological Databases and CurationDatabase: The Journal of Biological Databases and Curation
     Nourani, Esmaeil; Makri, Evangelia-Mantelena; Mao, Xiqing; Pyysalo, Sampo; Brunak, Søren; Nastou, Katerina; Jensen, Lars Juhl
    (
    A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


  •   (2025)  
    • NEALT proceedings seriesRegister studies
    Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025) Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo
    (
    A4 Vertaisarvioitu artikkeli konferenssijulkaisussa)


  •   (2025)  
    • Communications materials
     Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorović, Milica
    (
    A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


  • Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation  (2025)  Proceedings of the Second Conference on Language Modeling, COLM 2025 Myntti, Amanda; Henriksson, Erik; Laippala,Veronika; Pyysalo, Sampo
    (
    D3 Artikkeli ammatillisessa konferenssijulkaisussa )


  • Scaling Data-Constrained Language ModelsSTRING-ing together protein complexes: Corpus and methods for extracting physical protein interactions from the biomedical literature2025
    • Journal of Machine Learning ResearchBioinformatics
     Muennighoff, Niklas; Rush, Alexander M.; Barak, Boaz; Le Scao, Teven; Piktus, Aleksandra; Tazi, Nouamane; Pyysalo, Sampo; Wolf, Thomas; Raffel, Colin
    (
    A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


  • The STRING database in 2025: protein networks with directionality of regulation  (2025)  
    • Nucleic Acids Research
     Szklarczyk, Damian; Nastou, Katerina; Koutrouli, Mikaela; Kirsch, Rebecca; Mehryary, Farrokh; Hachilif, Radja; Hu, Dewei; Peluso, Matteo E.; Huang, Qingyao; Fang, Tao; Doncheva, Nadezhda T.; Pyysalo, Sampo; Bork, Peer; Jensen, Lars J.; von Mering, Christian
    (
    A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


  • A New Massive Multilingual Dataset for High-Performance Language Technologies  (2024)  
    • LREC Proceedings
    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) de Gibert, Ona; Nail, Graeme; Arefyev, Nikolay; Bañón, Marta; van der Linde, Jelmer; Ji, Shaoxiong; Zaragoza-Bernabeu, Jaume; Aulamo, Mikko; Ramírez-Sánchez, Gema; Kutuzov, Andrey; Pyysalo, Sampo; Oepen, Stephan; Tiedemann, Jörg
    (
    A4 Vertaisarvioitu artikkeli konferenssijulkaisussa)


  • Application of the Question Answering method to extract information from materials science literature   (2024)   Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorović Milica
    (
    Abstrakti)


  • Building Question-Answer Data Using Web Register Identification  (2024)  
      Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Eskelinen Anni, Myntti Amanda, Henriksson Erik, Pyysalo Sampo, Laippala Veronika
      (
      A4 Vertaisarvioitu artikkeli konferenssijulkaisussa)


    • CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes  (2024)  
      • Bioinformatics Advances
       Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Juhl
      (
      A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


    • Improving dictionary-based named entity recognition with deep learning  (2024)  
         Nastou, Katerina; Koutrouli, Mikaela; Pyysalo, Sampo; Jensen, Lars Jyhl
        (
        A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


      • Lifestyle factors in the biomedical literature: An ontology and comprehensive resources for named entity recognition  (2024)  
        • Bioinformatics
         Nourani, Esmaeil; Koutrouli, Mikaela; Xie, Yijia; Vagiaki, Danai; Pyysalo, Sampo; Nastou, Katerina; Brunak, Søren; Jensen, Lars Juhl
        (
        A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


      • Linguistic variation beyond the Indo-European web: Analyzing Turkish web registers in TurCORE  (2024)  
           Erten-Johansson, Selcen; Skantsi, Valtteri; Pyysalo, Sampo; Laippala, Veronika
          (
          A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


        • Question Answering models for information extraction from perovskite materials science literature   (2024)  2024 MRS Fall Meeting and Exhibit Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip, Todorović, Milica
          (
          Abstrakti)


        • RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature  (2024)  
             Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl
            (
            A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )


          •   (2024)  
               Mehryary, Farrokh; Nastou, Katerina; Ohta, Tomoko; Jensen, Lars Juhl; Pyysalo, Sampo
              (
              A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä )



            Last updated on