A4 Article in conference proceedings
Quality assessment of the Reuters vol. 2 Multilingual Corpus




List of Authors: Eriksson Robin
Publisher: European Language Resources Association (ELRA)
Publication year: 2016
Book title *: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)
Journal name in source: Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
ISBN: 978-2-9517408-9-1

Abstract

We introduce a framework for quality assurance of corpora, and apply it to the Reuters Multilingual Corpus (RCV2). The results of this quality assessment of this standard newsprint corpus reveal a significant duplication problem and, to a lesser extent, a problem with corrupted articles. From the raw collection of some 487,000 articles, almost one tenth are trivial duplicates. A smaller fraction of articles appear to be corrupted and should be excluded for that reason. The detailed results are being made available as on-line appendices to this article. This effort also demonstrates the beginnings of a constraint-based methodological framework for quality assessment and quality assurance for corpora. As a first implementation of this framework, we have investigated constraints to verify sample integrity, and to diagnose sample duplication, entropy aberrations, and tagging inconsistencies. To help identify near-duplicates in the corpus, we have employed both entropy measurements and a simple byte bigram incidence digest.



Internal Authors/Editors

Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.




Last updated on 2019-29-01 at 11:01