GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Tekijät: Luo, Hengyu; Li, Zihao; Attieh, Joseph; Devkota, Sawal; de Gibert, Ona; Huang, Xu; Ji, Shaoxiong; Lin, Peiqin; Mantina, Bhavani Sai Praneeth Varma; Sreenidhi, Ananda; Vázquez, Raúl; Wang, Mengjie; Yusofi, Samea; Yuan, Fei; Tiedemann, Jörg

Toimittaja: Habernal, Ivan; Schulam, Peter; Tiedemann, Jörg

Konferenssin vakiintunut nimi: Empirical Methods in Natural Language Processing

Julkaisuvuosi: 2025

Kokoomateoksen nimi: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing : System Demonstrations

Aloitussivu: 602

Lopetussivu: 614

ISBN: 979-8-89176-334-0

DOI: https://doi.org/10.18653/v1/2025.emnlp-demos.43

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://aclanthology.org/2025.emnlp-demos.43/

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/506505289

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary languages. Evaluating these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks suffer from inconsistency across different benchmarks, being disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this critical challenge of fragmented and inconsistent multilingual evaluation, we introduce GlotEval, a unified and lightweight framework that systematically integrates 27 benchmarks under a standardized ISO 639-3 language identifier system, allowing for seamless incorporation of new benchmarks. Supporting nine key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, intrinsic evaluation, instruction following and reasoning), spanning over dozens to hundreds of languages, GlotEval uniquely enables language-specific, cross-benchmark analysis and non-English-centric evaluations at a scale previously less practical for many researchers. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

luo_etal_2025.pdf

Julkaisussa olevat rahoitustiedot:
This project is funded by the AI-DOC program hosted by Finnish Center of Artificial Intelligence (decision number VN/3137/2024-OKM-6). The work has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546], and the Digital Europe Programme under grant agreement No 101195233.
Sawal Devkota, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Mengjie Wang, and Samea Yusofi contributed to this project as part of the “Data Analysis Software Project for Natural Language” course at TU Darmstadt, under the guidance of Shaoxiong Ji. This teaching activity was funded by LOEWE Center DYNAMIC as part of the Hessian program for the promotion of cuttingedge research LOEWE under the grant number of LOEWE1/16/519/03/09.001(0009)/98.