An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

Tekijät: Hans Moen, Laura-Maria Peltonen, Henry Suhonen, Hanna-Maria Matinolli, Riitta Mieronkoski, Kirsi Telen, Kirsi Terho, Tapio Salakoski, Sanna Salanterä

Toimittaja: Mareike Hartmann, Barbara Plank

Konferenssin vakiintunut nimi: Nordic Conference on Computational Linguistics

Julkaisuvuosi: 2019

Lehti: Linköping Electronic Conference Proceedings

Kokoomateoksen nimi: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa)

Sarjan nimi: NEALT Proceedings Series

Numero sarjassa: 42

Aloitussivu: 131

Lopetussivu: 139

ISBN: 978-91-7929-995-8

ISSN: 1650-3686

Verkko-osoite: https://www.aclweb.org/anthology/W19-6114/

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/44203057

Tiivistelmä

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phraselevel) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between ngrams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi-and trigrams seems to work better than a more traditional unigram model.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

W19-6114.pdf