A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora




Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, Filip Ginter

Jörg Tiedemann

Nordic Conference of Computational Linguistics

Gothenburg

2017

Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

NEALT Proceedings Series

29

330

333

978-91-7685-601-7

1650-3686

http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf

https://research.utu.fi/converis/portal/detail/Publication/20563854









We present a software for retrieving and
exploring duplicated text passages in low
quality OCR historical text corpora. The
system combines NCBI BLAST, a software created for comparing and aligning
biological sequences, with the Solr search
and indexing engine, providing a web interface to easily query and browse the
clusters of duplicated texts. We demonstrate the system on a corpus of scanned
and OCR-recognized Finnish newspapers
and journals from years 1771 to 1910. 





Last updated on 2024-26-11 at 18:51