A1 Refereed original research article in a scientific journal
Burrows‐Wheeler post‐transformation with effective clustering and interpolative coding
Authors: Arto Niemi, Jukka Teuhola
Publisher: John Wiley & Sons Ltd
Publication year: 2020
Journal:: Software: Practice and Experience
Journal acronym: SPE
Volume: 50
Issue: 9
First page : 1858
Last page: 1874
Number of pages: 17
ISSN: 0038-0644
eISSN: 1097-024X
DOI: https://doi.org/10.1002/spe.2873
Web address : https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.2873
Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/48563979
Lossless compression methods based on the Burrows‐Wheeler transform
(BWT) are regarded as an excellent compromise between speed and
compression efficiency: they provide compression rates close to the PPM
algorithms, with the speed of dictionary‐based methods. Instead of the
laborious statistics‐gathering process used in PPM, the BWT reversibly
sorts the input symbols, using as the sort key as many following
characters as necessary to make the sort unique. Characters occurring in
similar contexts are sorted close together, resulting in a clustered
symbol sequence. Run‐length encoding and Move‐to‐Front (MTF) recoding,
combined with a statistical Huffman or arithmetic coder, is then
typically used to exploit the clustering. A drawback of the MTF recoding
is that knowledge of the character that produced the MTF number is
lost. In this paper, we present a new, competitive Burrows‐Wheeler
posttransform stage that takes advantage of interpolative coding—a fast
binary encoding method for integer sequences, being able to exploit
clusters without requiring explicit statistics. We introduce a fast and
simple way to retain knowledge of the run characters during the MTF
recoding and use this to improve the clustering of MTF numbers and
run‐lengths by applying reversible, stable sorting, with the run
characters as sort keys, achieving significant improvement in the
compression rate, as shown here by experiments on common text corpora.
Downloadable publication This is an electronic reprint of the original article. |