TULUN: Transparent and Adaptable Low-resource Machine Translation - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

TULUN: Transparent and Adaptable Low-resource Machine Translation

Tekijät: Merx, Raphael; Suominen, Hanna; Hong, Lois Yinghui; Thieberger, Nick; Cohn, Trevor; Vylomova, Ekaterina

Toimittaja: Mishra, Pushkar; Muresan, Smaranda; Yu, Tao

Konferenssin vakiintunut nimi: Annual Meeting of the Association for Computational Linguistics

Kustantaja: ASSOC COMPUTATIONAL LINGUISTICS-ACL

Julkaisuvuosi: 2025

Lehti: Annual Meeting of the Association for Computational Linguistics

Kokoomateoksen nimi: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics : (Volume 3: System Demonstrations)

Vuosikerta: 63

Aloitussivu: 129

Lopetussivu: 139

ISBN: 979-8-89176-253-4

ISSN: 0736-587X

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://aclanthology.org/2025.acl-demo.13/

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/506057677

Tiivistelmä

Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose TULUN,(1) a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, TULUN outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF++ points over NLLB-54B. TULUN is publicly accessible at bislama-trans.rapha.dev.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

2025.acl-demo.13.pdf

Julkaisussa olevat rahoitustiedot:
This research was supported by The University of Melbourne’s Research Computing Services and the Petascale Campus Initiative.