A4 Vertaisarvioitu artikkeli konferenssijulkaisussa
OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
Tekijät: Kanerva, Jenna; Ledins, Cassadra; Käpyaho, Siiri; Ginter, Filip
Toimittaja: Holdt, Špela Arhar; Ilinykh, Nikolai; Scalvini, Barbara; Bruton, Micaella; Debess, Iben Nyholm; Tudor, Crina Madalina
Konferenssin vakiintunut nimi: Resources and Representations for Under-Resourced Languages and Domains
Kustantaja: University of Tartu Library, Estonia
Julkaisuvuosi: 2025
Kokoomateoksen nimi: Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Aloitussivu: 38
Lopetussivu: 47
ISBN: 978-9908-53-121-2
Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla
Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava
Verkko-osoite: https://aclanthology.org/2025.resourceful-1.8/
Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/506501669
Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.
Ladattava julkaisu This is an electronic reprint of the original article. |
Julkaisussa olevat rahoitustiedot:
This work was carried out in the Human Diversity University profilation programme (PROFI-7) of the Research Council of Finland, as well as in the context of several other research projects supported by the Research Council of Finland.