A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding




TekijätArto Niemi, Jukka Teuhola

KustantajaJohn Wiley & Sons Ltd

Julkaisuvuosi2020

JournalSoftware: Practice and Experience

Lehden akronyymiSPE

Vuosikerta50

Numero9

Aloitussivu1858

Lopetussivu1874

Sivujen määrä17

ISSN0038-0644

eISSN1097-024X

DOIhttps://doi.org/10.1002/spe.2873

Verkko-osoitehttps://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2873

Rinnakkaistallenteen osoitehttps://research.utu.fi/converis/portal/detail/Publication/48563979


Tiivistelmä

Lossless compression methods based on the Burrows‐Wheeler transform
(BWT) are regarded as an excellent compromise between speed and
compression efficiency: they provide compression rates close to the PPM
algorithms, with the speed of dictionary‐based methods. Instead of the
laborious statistics‐gathering process used in PPM, the BWT reversibly
sorts the input symbols, using as the sort key as many following
characters as necessary to make the sort unique. Characters occurring in
similar contexts are sorted close together, resulting in a clustered
symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding,
combined with a statistical Huffman or arithmetic coder, is then
typically used to exploit the clustering. A drawback of the MTF recoding
is that knowledge of the character that produced the MTF number is
lost. In this paper, we present a new, competitive Burrows‐Wheeler
posttransform stage that takes advantage of interpolative coding—a fast
binary encoding method for integer sequences, being able to exploit
clusters without requiring explicit statistics. We introduce a fast and
simple way to retain knowledge of the run characters during the MTF
recoding and use this to improve the clustering of MTF numbers and
run‐lengths by applying reversible, stable sorting, with the run
characters as sort keys, achieving significant improvement in the
compression rate, as shown here by experiments on common text corpora.


Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on 2024-26-11 at 16:46