A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches




TekijätKanerva, Jenna; Ledins, Cassadra; Käpyaho, Siiri; Ginter, Filip

ToimittajaHoldt, Špela Arhar; Ilinykh, Nikolai; Scalvini, Barbara; Bruton, Micaella; Debess, Iben Nyholm; Tudor, Crina Madalina

Konferenssin vakiintunut nimiResources and Representations for Under-Resourced Languages and Domains

KustantajaUniversity of Tartu Library, Estonia

Julkaisuvuosi2025

Kokoomateoksen nimiProceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Aloitussivu38

Lopetussivu47

ISBN978-9908-53-121-2

Julkaisun avoimuus kirjaamishetkelläAvoimesti saatavilla

Julkaisukanavan avoimuus Kokonaan avoin julkaisukanava

Verkko-osoitehttps://aclanthology.org/2025.resourceful-1.8/

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/506501669


Tiivistelmä

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.


Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Julkaisussa olevat rahoitustiedot
This work was carried out in the Human Diversity University profilation programme (PROFI-7) of the Research Council of Finland, as well as in the context of several other research projects supported by the Research Council of Finland.


Last updated on