Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Compression algorithm  





2 Reference implementation  





3 Advantages  





4 Disadvantages  





5 History  





6 Alternative implementations  



6.1  lrzip  





6.2  rzip64  





6.3  REP  





6.4  SREP  







7 See also  





8 References  





9 External links  














rzip






Deutsch
Français
 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


rzip
Original author(s)Andrew Tridgell
Stable release

2.1 / 14 February 2006; 18 years ago (2006-02-14)

Written inC
Operating systemUnix-like
Size46K (source code tarball, gzipped)
Websiterzip.samba.org

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

Compression algorithm[edit]

rzip operates in two stages. The first stage finds and encodes large chunks of duplicated data over potentially very long distances (900 MB) in the input file. The second stage uses a standard compression algorithm (bzip2) to compress the output of the first stage.

It is quite common these days to need to compress files that contain long distance redundancies. For example, when compressing a set of home directories several users might have copies of the same file, or of quite similar files. It is also common to have a single file that contains large duplicated chunks over long distances, such as PDF files containing repeated copies of the same image. Most compression programs won't be able to take advantage of this redundancy, and thus might achieve a much lower compression ratio than rzip can achieve.

The intermediate interface between the two stages is made of a byte-aligned data stream of which there are two commands, a literal ("add") with length and data:

 type:8       = 0 => literal/add range of count bytes
 count:16     = 1..65535
 data:8..∞    = literal data to be inserted (n whole bytes)

and a match ("copy") with length and offset parameters:

 type:8       = 1 => match/copy range of count bytes
 count:16     = 31..65535
 offset:32    = offset to position to be copied from

Literal or match/copy lengths of greater than 65,535 bytes are split into multiple instructions. End-of-stream is indicated with a zero-length literal/add (type=0,count=0) command and immediately followed by a 32-bit CRC checksum.

Reference implementation[edit]

A rolling-checksum algorithm based on the one in rsync is used to locate potential matches from over such a large dataset. As the hash buckets fill up, previous hashes ("tags") are discarded based on twice.[clarification needed] The tags are discarded in such a manner as to provide fairly good coverage, with a gradually decreasing match granularity as the distance increases. This implementation does not search for match lengths of fewer than 31 consecutive bytes.

Advantages[edit]

The key difference between rzip and other well known compression algorithms is its ability to take advantage of very long distance redundancy. The well known deflate algorithm used in gzip uses a maximum history buffer of 32 KiB. The Burrows–Wheeler transform block sorting algorithm used in bzip2 is limited to 900 KiB of history. The history buffer in rzip can be up to 900 MiB long, several orders of magnitude larger than gzip or bzip2. Rzip is often much faster than bzip2, despite using the bzip2 library as a back end. This is because rzip feeds bzip2 with shrunken data, so that bzip2 has to do less work. Simple comparisons (although too small for it to be an authoritative benchmark) have been produced.[1][2]

Disadvantages[edit]

rzip is not suited for every purpose. The two biggest disadvantages of rzip are that it cannot be pipelined (so it cannot read from standard input or write to standard output), and that it uses a high amount of memory: a typical compression run on a large file might use hundreds of megabytes of RAM. If there is a lot of RAM to spare and a very high compression ratio is required, rzip should be used, but if these conditions are not satisfied, alternate compression methods such as gzip and bzip2, which are less memory-intensive, should be used instead of rzip. There is at least one patch to enable pipelining.[3]

History[edit]

rzip was originally written by Andrew Tridgell as part of his PhD research.

Alternative implementations[edit]

lrzip[edit]

lrzip
Original author(s)Con Kolivas, Peter Hyman, Andrew Tridgell
Initial releaseJanuary 2008; 16 years ago (2008-01)
Stable release

0.651 / 9 March 2022; 2 years ago (2022-03-09)

Written inC, C++ (libzpaq)
Operating systemUnix-like
Size246K (source code tarball, gzipped)
Websitegithub.com/ckolivas/lrzip

lrzip (Long Range ZIP) is an improved version of rzip. Its file format (.lrz) is incompatible with rzip's. It has the following improvements:

The lrzip distribution comes with a pair of programs to use it with tar, lrztar and lrzuntar.

rzip64[edit]

rzip64 is an extension of rzip for very large files that can utilize multiple CPU cores in parallel. There are benchmark results.[5] Most important, however, is the ability of rzip64 to be interrupted at any time. Thereby a running compression task (that may easily take several hours for large files) survives even a system maintenance reboot without losing already completed work and can be resumed later. The file format of rzip64 is identical to the original rzip.

REP[edit]

REP is an alternative implementation of rzip algorithm by Bulat Ziganshin used in his FreeArc archiver as preprocessor for LZMA/Tornado compression algorithms. In FreeArc, REP finds large-distance matches and then LZMA compress the remaining data. For example, on computer with 2 GB RAM, REP finds matches that is at least 512 bytes long at the distances up to 1 GB, and then LZMA finds any remaining matches at the distances up to 128 MB. So, working together, they provide the best compression possible on 2 GB RAM budget.

Being optimized for stream decompression and collaborative work with LZMA, REP has some differences from the original RZIP implementation. First, by default it finds only matches that are 512+ byte long, since benchmarking proved that this is optimal setting for overall REP+LZMA compression. Second, it uses a sliding dictionary that's about 1/2 RAM long, so decompression doesn't need to reread data from decompressed file. REP's advantage is its multiplicative rolling hash that is both quick to compute and has near-ideal distribution.

Larger minimal match length (512 bytes compared to 32 bytes in rzip) allowed for additional speed optimizations, so that REP provides very fast compression (about 200 MB/s on Intel i3-2100).

SREP[edit]

SREP (SuperREP) is an implementation of Tridgell's idea of LZ compressor that doesn't store its dictionary in RAM, using instead SHA1 hashes of processed blocks to compare their contents. It allows the program to compress files that are about 10x larger than RAM available. Decompression performed either by reading data from decompressed part of file, or by storing in the memory future matches (future-LZ compression algorithm). Of course, future-LZ compression requires 2 passes over input file but decompression needs tiny memory.[citation needed] In one experiment, 22 GB file compressed with minimum match length of 512 bytes and full 22 GB dictionary required just 2 GB of RAM for decompression.[citation needed]

See also[edit]

References[edit]

  • ^ "rzip".
  • ^ "Nicolas Rachinsky: Links".
  • ^ Kolivas, Kon. "lrzip README". GitHub. Retrieved 27 January 2017.
  • ^ "GHSi - Benchmarking rzip64".
  • External links[edit]


    Retrieved from "https://en.wikipedia.org/w/index.php?title=Rzip&oldid=1178950155"

    Category: 
    Free data compression software
    Hidden categories: 
    Articles with short description
    Short description is different from Wikidata
    Articles needing additional references from November 2010
    All articles needing additional references
    Wikipedia articles needing clarification from January 2013
    All articles with unsourced statements
    Articles with unsourced statements from December 2020
    Webarchive template wayback links
     



    This page was last edited on 6 October 2023, at 21:54 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki