A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora

Tekijät: Aleksi Vesanto, Asko Nivala, Tapio Salakoski, Hannu Salmi, Filip Ginter

Toimittaja: Jörg Tiedemann

Konferenssin vakiintunut nimi: Nordic Conference of Computational Linguistics

Kustannuspaikka: Gothenburg

Julkaisuvuosi: 2017

Kokoomateoksen nimi: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden

Sarjan nimi: NEALT Proceedings Series

Numero sarjassa: 29

Aloitussivu: 330

Lopetussivu: 333

ISBN: 978-91-7685-601-7

ISSN: 1650-3686

Verkko-osoite: http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/20563854

Tiivistelmä

We present a software for retrieving and
exploring duplicated text passages in low
quality OCR historical text corpora. The
system combines NCBI BLAST, a software created for comparing and aligning
biological sequences, with the Solr search
and indexing engine, providing a web interface to easily query and browse the
clusters of duplicated texts. We demonstrate the system on a corpus of scanned
and OCR-recognized Finnish newspapers
and journals from years 1771 to 1910.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

ecp17131049.pdf