A4 Refereed article in a conference publication
An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora
Authors: Hans Moen, Laura-Maria Peltonen, Henry Suhonen, Hanna-Maria Matinolli, Riitta Mieronkoski, Kirsi Telen, Kirsi Terho, Tapio Salakoski, Sanna Salanterä
Editors: Mareike Hartmann, Barbara Plank
Conference name: Nordic Conference on Computational Linguistics
Publication year: 2019
Journal: Linköping Electronic Conference Proceedings
Book title : Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa)
Series title: NEALT Proceedings Series
Number in series: 42
First page : 131
Last page: 139
ISBN: 978-91-7929-995-8
ISSN: 1650-3686
Web address : https://www.aclweb.org/anthology/W19-6114/(external)
Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/44203057(external)
We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phraselevel) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between ngrams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi-and trigrams seems to work better than a more traditional unigram model.
Downloadable publication This is an electronic reprint of the original article. |