A1 Refereed original research article in a scientific journal

Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding




AuthorsArto Niemi, Jukka Teuhola

PublisherJohn Wiley & Sons Ltd

Publication year2020

JournalSoftware: Practice and Experience

Journal acronymSPE

Volume50

Issue9

First page 1858

Last page1874

Number of pages17

ISSN0038-0644

eISSN1097-024X

DOIhttps://doi.org/10.1002/spe.2873

Web address https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2873

Self-archived copy’s web addresshttps://research.utu.fi/converis/portal/detail/Publication/48563979


Abstract

Lossless compression methods based on the Burrows‐Wheeler transform
(BWT) are regarded as an excellent compromise between speed and
compression efficiency: they provide compression rates close to the PPM
algorithms, with the speed of dictionary‐based methods. Instead of the
laborious statistics‐gathering process used in PPM, the BWT reversibly
sorts the input symbols, using as the sort key as many following
characters as necessary to make the sort unique. Characters occurring in
similar contexts are sorted close together, resulting in a clustered
symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding,
combined with a statistical Huffman or arithmetic coder, is then
typically used to exploit the clustering. A drawback of the MTF recoding
is that knowledge of the character that produced the MTF number is
lost. In this paper, we present a new, competitive Burrows‐Wheeler
posttransform stage that takes advantage of interpolative coding—a fast
binary encoding method for integer sequences, being able to exploit
clusters without requiring explicit statistics. We introduce a fast and
simple way to retain knowledge of the run characters during the MTF
recoding and use this to improve the clustering of MTF numbers and
run‐lengths by applying reversible, stable sorting, with the run
characters as sort keys, achieving significant improvement in the
compression rate, as shown here by experiments on common text corpora.


Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.





Last updated on 2024-26-11 at 16:46