PhD in Computer Science Farrokh Mehryary - UTU Research Portal

Farrokh Mehryary
PhD in Computer Science

- Senior Researcher,
- Project Researcher
Data-analytiikka (Department of Computing)

farmeh@utu.fi

: 451A

Development and optimization of deep learning and LLM-based methods for building large-scale and trustworthy NLP/text mining applications. Specialty in Biomedical Natural Language Processing (BioNLP) and text mining, low-resource setups (where no or minimal training data exists), and in bioinformatics application development (protein function/structure prediction).

Natural Language Processing, Text mining, Deep learning, BioNLP, Bioinformatics

As an NLP and text mining specialist, I have been an active member of TurkuNLP lab (since 2013), and a Silo AI employee (since 2020). With over a decade of experience in academic research and publication (with 3000+ citations), university teaching, and international collaborations, and over 20 years of experience in software engineering (project management, system analysis and design, software development), I am specialized in development and optimization of deep learning and LLM-based methods for large-scale NLP and text mining applications, with particular focus on (1) low-resource setups (where no or minimal training data exists), and (2) the biomedical domain (BioNLP). I am also very capable in bioinformatics (with speciality in protein function/structure prediction).

As a senior researcher at the university, and as part of an international collaboration between TurkuNLP and various research groups across Europe, for the last four years I have worked in “Deep learning for next-generation biomedical text mining” project, designing, optimizing and running an information extraction pipeline for the STRING database, extracting information from millions of PubMed abstracts and PubMed central full text articles. Thus, I am very capable in working with very large datasets, and fine-tuning, optimizing and running LLMs, simultaneously on hundreds of GPUs.

As a Senior AI Scientist (LLMs, NLP, text mining specialist) in Silo AI, I have worked on several LLM projects, including building RAG systems, building information extraction systems, synthetic text generation, optimizing LLM-based systems with DSPy, and extracting information and tables from multilingual PDF documents (MS Document Intelligence, prompt engineering, and GPT models). In addition, I have helped in designing a GenAI/LLM course which will be offered and taught by Silo to the employees of an industrial corporation. Finally, I do a lot of sale's support in Silo AI, attending as an LLM/NLP/Text mining expert in various pre-sales client meetings, to understand and translate their business requirements into practical AI solutions.

Whenever it was possible, I have worked simultaneously in academia and industry, gaining and bringing state-of-the-art knowledge and experience from the university to a company back and forth, and utilizing them in both academic and company/client projects. Personally, I love this approach, since this has allowed me to get the best of both worlds, and grow rapidly in the field.

With a strong track record in publication, achieving high ranks in several international text mining and machine learning competitions, and achieving the state-of-the-art results on several important datasets, Farrokh has been specializing in deep learning-based methods for Biomedical Natural Language Processing (BioNLP) and text mining. His research has focused on low-resource setups, where minimal training data is available.

During 2021, Farrokh has worked as an AI scientist for Silo AI, developing text mining systems for clients, and as a researcher for AI academy, helping in the development of Massive Open Online Courses (MOOC). In 2022, Farrokh received his PhD degree certificate in Computer Science from University of Turku, with his thesis on ‘Optimizing Text Mining Methods for Biomedical Natural Language Processing’. Currently, Farrokh has a senior researcher position in TurkuNLP group, working on biomedical natural language processing and text mining.

I have been the responsible teacher for the course Algorithms in Bioinformatics, University of Turku, 2015-2020. I have also helped in teaching other NLP courses including Text mining and Deep Learning in Language Technology at the Department of Computing, University of Turku.

1 of 2

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) (2025)
- Annual Meeting of the Association for Computational Linguistics
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Burchell, Laurie; De Gibert Bonet, Ona; Arefyev, Nikolay; Aulamo, Mikko; Bañón, Marta; Chen, Pinzhen; Fedorova, Mariia; Guillou, Liane; Haddow, Barry; Hajič, Jan; Helcl, Jindřich; Henriksson, Erik; Klimaszewski, Mateusz; Komulainen, Ville; Kutuzov, Andrey; Kytöniemi, Joona; Laippala, Veronika; Mæhlum, Petter; Malik, Bhavitvya; Mehryary, Farrokh; Mikhailov, Vladislav; Moghe, Nikita; Myntti, Amanda; O’Brien, Dayyán; Oepen, Stephan; Pal, Proyag; Piha, Jousia; Pyysalo, Sampo; Ramírez-Sánchez, Gema; Samuel, David; Stepachev, Pavel; Tiedemann, Jörg; Variš, Dušan; Vojtěchová, Tereza; Zaragoza-Bernabeu, Jaume
Annotated textual dataset PV600 of perovskite bandgaps for information extraction from literature (2025)
- Scientific Data
Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorovic, Milica
Question Answering models for information extraction from perovskite materials science literature (2025)
- Communications materials
Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorović, Milica
The STRING database in 2025: protein networks with directionality of regulation (2025)
- Nucleic Acids Research
Szklarczyk, Damian; Nastou, Katerina; Koutrouli, Mikaela; Kirsch, Rebecca; Mehryary, Farrokh; Hachilif, Radja; Hu, Dewei; Peluso, Matteo E.; Huang, Qingyao; Fang, Tao; Doncheva, Nadezhda T.; Pyysalo, Sampo; Bork, Peer; Jensen, Lars J.; von Mering, Christian
Application of the Question Answering method to extract information from materials science literature (2024) Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip; Todorović Milica
Question Answering models for information extraction from perovskite materials science literature (2024) 2024 MRS Fall Meeting and Exhibit Sipilä, Matilda; Mehryary, Farrokh; Pyysalo, Sampo; Ginter, Filip, Todorović, Milica
RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature (2024)
- Database: The Journal of Biological Databases and Curation
Nastou, Katerina; Mehryary, Farrokh; Ohta, Tomoko; Luoma, Jouni; Pyysalo, Sampo; Jensen, Lars Juhl
STRING-ing together protein complexes: Corpus and methods for extracting physical protein interactions from the biomedical literature (2024)
- Bioinformatics
Mehryary, Farrokh; Nastou, Katerina; Ohta, Tomoko; Jensen, Lars Juhl; Pyysalo, Sampo
Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations (2023)
- Database: The Journal of Biological Databases and Curation
Miranda-Escalada Antonio, Mehryary Farrokh, Luoma Jouni, Estrada-Zavala Darryl, Gasco Luis, Pyysalo Sampo, Valencia Alfonso, Krallinger Martin
The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest (2023)
- Nucleic Acids Research
Szklarczyk Damian, Kirsch Rebecca, Koutrouli Mikaela, Nastou Katerina, Mehryary Farrokh, Hachilif Radja, Gable Annika L, Fang Tao, Doncheva Nadezha T, Pyysalo Sampo, Bork Peer, Jensen Lars J, von Mering Christian
Neural Network and Random Forest Models in Protein Function Prediction (2022)
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
Hakala Kai, Kaewphan Suwisa, Björne Jari, Mehryary Farrokh, Moen Hans, Tolvanen Martti, Salakoski Tapio, Ginter Filip
Optimizing text mining methods for improving biomedical natural language processing (2022) Mehryary Farrokh
Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations (2021) Proceedings of the BioCreative VII Challenge Evaluation Workshop Miranda Antonio, Mehryary Farrokh, Luoma Jouni, Pyysalo Sampo, Valencia Alfonso, Krallinger Martin
Entity-pair embeddings for improving relation extraction in the biomedical domain (2020)
- European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
ESANN 2020 - Proceedings, 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning Mehryary F., Moen H., Salakoski T., Ginter F.
The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens (2019)
- Genome Biology
Naihui Zhou, Yuxiang Jiang, Timothy R. Bergquist, Alexandra J. Lee, Balint Z. Kacsoh, Alex W. Crocker, Kimberley A. Lewis, George Georghiou, Huy N. Nguyen, Md Nafiz Hamid, Larry Davis, Tunca Dogan, Volkan Atalay, Ahmet S. Rifaioglu, Alperen Dalkıran, Rengul Cetin Atalay, Chengxin Zhang, Rebecca L. Hurto, Peter L. Freddolino, Yang Zhang, Prajwal Bhat, Fran Supek, José M. Fernández, Branislava Gemovic, Vladimir R. Perovic, Radoslav S. Davidović, Neven Sumonja, Nevena Veljkovic, Ehsaneddin Asgari, Mohammad R.K. Mofrad, Giuseppe Profiti, Castrense Savojardo, Pier Luigi Martelli, Rita Casadio, Florian Boecker, Heiko Schoof, Indika Kahanda, Natalie Thurlby, Alice C. McHardy, Alexandre Renaux, Rabie Saidi, Julian Gough, Alex A. Freitas, Magdalena Antczak, Fabio Fabris, Mark N. Wass, Jie Hou, Jianlin Cheng, Zheng Wang, Alfonso E. Romero, Alberto Paccanaro, Haixuan Yang, Tatyana Goldberg, Chenguang Zhao, Liisa Holm, Petri Törönen, Alan J. Medlar, Elaine Zosa, Itamar Borukhov, Ilya Novikov, Angela Wilkins, Olivier Lichtarge, Po-Han Chi, Wei-Cheng Tseng, Michal Linial, Peter W. Rose, Christophe Dessimoz, Vedrana Vidulin, Saso Dzeroski, Ian Sillitoe, Sayoni Das, Jonathan Gill Lees, David T. Jones, Cen Wan, Domenico Cozzetto, Rui Fa, Mateo Torres, Alex Warwick Vesztrocy, Jose Manuel Rodriguez, Michael L. Tress, Marco Frasca, Marco Notaro, Giuliano Grossi, Alessandro Petrini, Matteo Re, Giorgio Valentini, Marco Mesiti, Daniel B. Roche, Jonas Reeb, David W. Ritchie, Sabeur Aridhi, Seyed Ziaeddin Alborzi, Marie-Dominique Devignes, Da Chen Emily Koo, Richard Bonneau, Vladimir Gligorijević, Meet Barot, Hai Fang, Stefano Toppo, Enrico Lavezzo, Marco Falda, Michele Berselli, Silvio C.E. Tosatto, Marco Carraro, Damiano Piovesan, Hafeez Ur Rehman, Qizhong Mao, Shanshan Zhang, Slobodan Vucetic, Gage S. Black, Dane Jo, Erica Suh, Jonathan B. Dayton, Dallas J. Larsen, Ashton R. Omdahl, Liam J. McGuffin, Danielle A. Brackenridge, Patricia C. Babbitt, Jeffrey M. Yunes, Paolo Fontana, Feng Zhang, Shanfeng Zhu, Ronghui You, Zihan Zhang, Suyang Dai, Shuwei Yao, Weidong Tian, Renzhi Cao, Caleb Chandler, Miguel Amezola, Devon Johnson, Jia-Ming Chang, Wen-Hung Liao, Yi-Wei Liu, Stefano Pascarelli, Yotam Frank, Robert Hoehndorf, Maxat Kulmanov, Imane Boudellioua, Gianfranco Politano, Stefano Di Carlo, Alfredo Benso, Kai Hakala, Filip Ginter, Farrokh Mehryary, Suwisa Kaewphan, Jari Björne, Hans Moen, Martti E.E. Tolvanen, Tapio Salakoski, Daisuke Kihara, Aashish Jain, Tomislav Šmuc, Adrian Altenhoff, Asa Ben-Hur, Burkhard Rost, Steven E. Brenner, Christine A. Orengo, Constance J. Jeffery, Giovanni Bosco, Deborah A. Hogan, Maria J. Martin, Claire O’Donovan, Sean D. Mooney, Casey S. Greene, Predrag Radivojac, Iddo Friedberg
Combining support vector machines and LSTM networks for chemical-protein relation extraction (2018) Proceedings of the BioCreative VI Workshop Farrokh Mehryary, Jari Björne, Tapio Salakoski, Filip Ginter
Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task (2018)
- Journal of the American Medical Informatics Association
Abeed Sarker, Maksim Belousov, Jasper Friedrichs, Kai Hakala, Svetlana Kiritchenko, Farrokh Mehryary, Sifei Han, Tung Tran, Anthony Rios, Ramakanth Kavuluru, Berry de Bruijn, Filip Ginter, Debanjan Mahata, Saif M. Mohammad, Goran Nenadic, Graciela Gonzalez-Hernandez
Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction (2018)
- Database: The Journal of Biological Databases and Curation
Farrokh Mehryary, Jari Björne, Tapio Salakoski, Filip Ginter
TurkuNLP Entry for Interactive Bio-ID Assignment (2018) Proceedings of the BioCreative VI Workshop Suwisa Kaewphan, Farrokh Mehryary, Kai Hakala, Tapio Salakoski, Filip Ginter
Detecting mentions of pain and acute confusion in Finnish clinical text (2017) SIGBioMed Workshop on Biomedical Natural Language: Proceedings of the 16th BioNLP Workshop Hans Moen, Kai Hakala, Farrokh Mehryary, Laura-Maria Peltonen, Tapio Salakoski, Filip Ginter, Sanna Salanterä