Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 How it works  





2 Data structures  



2.1  Encoding the grammar  







3 Versions  





4 See also  





5 References  














Re-Pair







Add links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


Re-Pair (short for recursive pairing) is a grammar-based compression algorithm that, given an input text, builds a straight-line program, i.e. a context-free grammar generating a single string: the input text. In order to perform the compression in linear time, it consumes the amount of memory that is approximately five times the size of its input.

The grammar is built by recursively replacing the most frequent pair of characters occurring in the text. Once there is no pair of characters occurring twice, the resulting string is used as the axiom of the grammar. Therefore, the output grammar is such that all rules but the axiom have two symbols on the right-hand side.

How it works

[edit]
Construction of a Straight Line Program that generates the string w="xabcabcy123123zabc" by using Re-Pair

Re-Pair was first introduced by NJ. Larsson and A. Moffat[1] in 1999.

In their paper the algorithm is presented together with a detailed description of the data structures required to implement it with linear time and space complexity. The experiments showed that Re-Pair achieves high compression ratios and offers good performance for decompression. However, the major drawback of the algorithm is its memory consumption, which is approximately 5 times the size of the input. Such memory usage is required in order to perform the compression in linear time but makes the algorithm impractical for compressing large files.

The image on the right shows how the algorithm works compresses the string .

During the first iteration, the pair , which occurs three times in , is replaced by a new symbol . On the second iteration, the most frequent pair in the string , which is , is replaced by a new symbol . Thus, at the end of the second iteration, the remaining string is . In the next two iterations, the pairs and are replaced by symbols and respectively. Finally, the string contains no repeated pair and therefore it is used as the axiom of the output grammar.

Data structures

[edit]

In order to achieve linear time complexity, Re-Pair requires the following data structures

Since the hash table and the priority queue refer to the same elements (pairs), they can be implemented by a common data structure called PAIR with pointers for the hash table (h_next) and the priority queue (p_next and p_prev). Furthermore, each PAIR points to the beginning of the first (f_pos) and the last (b_pos) occurrences of the string represented by the PAIR in the sequence. The following picture shows an overview of this data structure.

Data structure to implement the Recursive Pairing algorithm with linear runtime and space usage.

The following two pictures show an example of how these data structures look after the initialization and after applying one step of the pairing process (pointers to NULL are not displayed):

State of the data structures used by the Recursive Pairing algorithm after going through the input text. State of the data structures used by the Recursive Pairing algorithm after performing the first pair replacement.

Encoding the grammar

[edit]

Once the grammar has been built for a given input string, in order to achieve effective compression, this grammar has to be encoded efficiently. One of the simplest methods for encoding the grammar is the implicit encoding, which consists on invoking function encodeCFG(X), described below, sequentially on all the axiom's symbols. Intuitively, rules are encoded as they are visited in a depth-first traversal of the grammar. The first time a rule is visited, its right hand side is encoded recursively and a new code is assigned to the rule. From that point, whenever the rule is reached, the assigned value is written.

num_rules_encoded = 256 // By default, the extended ASCII charset are the terminals of the grammar.

writeSymbol(symbol s) {
  bitslen = log(num_rules_encoded); // Initially 8, to describe any extended ASCII character
  write s in binary using bitslen bits
}

void encodeCFG_rec(symbol s) {
  if (s is non-terminal and this is the first time symbol s appears) {
   take rule s  X Y;
    write bit 1;
    encodeCFG_rec(X);
    encodeCFG_rec(Y);
    assign to symbol s value ++num_rules_encoded;
  } else {
    write bit 0;
    writeSymbol(terminal/value assigned)
  }
}

void encodeCFG(symbol s) {
  encodeCFG_rec(s);
  write bit 1;
}

Another possibility is to separate the rules of the grammar into generations such that a rule belongs to generation iff at least one of or belongs to generation and the other belongs to generation with . Then these generations are encoded subsequently starting from generation . This was the method proposed originally when Re-Pair was first introduced. However, most implementations of Re-Pair use the implicit encoding method due to its simplicity and good performance. Furthermore, it allows on-the-fly decompression.

Versions

[edit]

There exists a number of different implementations of Re-Pair. Each of these versions aims at improving one specific aspect of the algorithm, such as reducing the runtime, reducing the space consumption or increasing the compression ratio.

Improvement Year Implementation Description
Phrase Browsing[2] 2003 [1] Instead of manipulating the input string as a sequence of characters, this tool first groups the characters into phrases (for instance, words). The compression algorithm works as Re-Pair but considering the identified phrases as the terminals of the grammar. The tool accepts different options to decide which sort of phrases are considered and encodes the resulting grammar into separated files: one containing the axiom and one with the rest of the rules.
Original 2011 [2] This is one of the most popular implementations of Re-Pair. It uses the data structures described here (the ones proposed when it was originally published[1]) and encodes the resulting grammar using the implicit encoding method. Most later versions of Re-Pair are implemented starting from this one.
Encoding[3] 2013 [3] Instead of the implicit encoding method, this implementation uses a Variable Length to Fixed Length method, in which each rule (represented with a string of Variable Length) is encoded using a code of Fixed Length.
Memory usage[4] 2017 [4] The algorithm performs in two phases. During the first phase, it considers the high frequency pairs, i.e. those occurring more than times, while the low frequency pairs are considered in the second. The main difference between the two phases is the implementation of the corresponding priority queues.
Compression[5] 2017 [5] This version modifies the way the next pair to be replaced is chosen. Instead of simply considering the most frequent pair, it employs a heuristic which penalizes pairs that are not coherent with a Lempel–Ziv factorisation of the input string.
Compression[6] 2018 [6] This algorithm reduces the size of the grammar generated by Re-Pair by replacing maximal repeats first. When a pair is identified as the most frequent pair, i.e. the one to be replaced in the current step of the algorithm, MR-repair extends the pair to find the longest string that occurs the same number of times as the pair to be replaced. The provided implementation encodes the grammar by simply listing the rules in text, hence this tool is purely for research purposes and cannot be used for compression as it is.

See also

[edit]

References

[edit]
  1. ^ a b Larsson, N. J., & Moffat, A. (2000). Off-line dictionary-based compression. Proceedings of the IEEE, 88(11), 1722–1732.
  • ^ R. Wan. "Browsing and Searching Compressed Documents". PhD thesis, University of Melbourne, Australia, December 2003.
  • ^ Satoshi Yoshida and Takuya Kida, Effective Variable-Length-to-Fixed-Length Coding via a Re-Pair Algorithm, In Proc. of Data Compression Conference 2013 (DCC 2013), p. 532, Snowbird, Utah, USA, March 2013.
  • ^ Bille, P., Gørtz, I. L., & Prezza, N. (2017, April). Space-efficient re-pair compression. In 2017 (DCC) (pp. 171-180). IEEE.
  • ^ Gańczorz, M., & Jeż, A. (2017, April). Improvements on Re-Pair grammar compressor. In 2017 Data Compression Conference (DCC) (pp. 181–190). IEEE.
  • ^ Furuya, I., Takagi, T., Nakashima, Y., Inenaga, S., Bannai, H., & Kida, T. (2018). MR-RePair: Grammar Compression based on Maximal Repeats. arXiv preprint arXiv:1811.04596.

  • Retrieved from "https://en.wikipedia.org/w/index.php?title=Re-Pair&oldid=1168728675"

    Categories: 
    Data compression
    Compression algorithms
    Lossless compression algorithms
    Hidden categories: 
    Articles with short description
    Short description is different from Wikidata
    Wikipedia introduction cleanup from September 2019
    All pages needing cleanup
    Articles covered by WikiProject Wikify from September 2019
    All articles covered by WikiProject Wikify
    Wikipedia external links cleanup from September 2019
    Articles with multiple maintenance issues
     



    This page was last edited on 4 August 2023, at 16:39 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki