A4 Vertaisarvioitu artikkeli konferenssijulkaisussa
A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora
Tekijät: Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, Filip Ginter
Toimittaja: Jörg Tiedemann
Konferenssin vakiintunut nimi: Nordic Conference of Computational Linguistics
Kustannuspaikka: Gothenburg
Julkaisuvuosi: 2017
Kokoomateoksen nimi: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
Sarjan nimi: NEALT Proceedings Series
Numero sarjassa: 29
Aloitussivu: 330
Lopetussivu: 333
ISBN: 978-91-7685-601-7
ISSN: 1650-3686
Verkko-osoite: http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf
Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/20563854
We present a software for retrieving and
exploring duplicated text passages in low
quality OCR historical text corpora. The
system combines NCBI BLAST, a software created for comparing and aligning
biological sequences, with the Solr search
and indexing engine, providing a web interface to easily query and browse the
clusters of duplicated texts. We demonstrate the system on a corpus of scanned
and OCR-recognized Finnish newspapers
and journals from years 1771 to 1910.
Ladattava julkaisu This is an electronic reprint of the original article. |