A4 Refereed article in a conference publication

Quality assessment of the Reuters vol. 2 Multilingual Corpus




AuthorsEriksson Robin

EditorsNicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis

Conference nameInternational Conference on Language Resources and Evaluation (LREC)

PublisherEuropean Language Resources Association (ELRA)

Publication year2016

Book title Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)

Journal name in sourceProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

First page 1813

Last page1819

Number of pages7

ISBN978-2-9517408-9-1

Web address http://www.lrec-conf.org/proceedings/lrec2016/index.html

Self-archived copy’s web addresshttps://research.utu.fi/converis/portal/detail/Publication/29505864


Abstract

We introduce a framework for quality assurance of corpora, and apply it to the Reuters Multilingual Corpus (RCV2). The results of this quality assessment of this standard newsprint corpus reveal a significant duplication problem and, to a lesser extent, a problem with corrupted articles. From the raw collection of some 487,000 articles, almost one tenth are trivial duplicates. A smaller fraction of articles appear to be corrupted and should be excluded for that reason. The detailed results are being made available as on-line appendices to this article. This effort also demonstrates the beginnings of a constraint-based methodological framework for quality assessment and quality assurance for corpora. As a first implementation of this framework, we have investigated constraints to verify sample integrity, and to diagnose sample duplication, entropy aberrations, and tagging inconsistencies. To help identify near-duplicates in the corpus, we have employed both entropy measurements and a simple byte bigram incidence digest.


Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on 2024-26-11 at 22:06