Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Contents  





2 Benchmarks  





3 Compression challenge  





4 See also  





5 References  





6 External links  














Calgary corpus






Català
Italiano
Polski
Русский
 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


The Calgary corpus is a collection of text and binary data files, commonly used for comparing data compression algorithms. It was created by Ian Witten, Tim Bell and John Cleary from the University of Calgary in 1987 and was commonly used in the 1990s. In 1997 it was replaced by the Canterbury corpus,[1] based on concerns about how representative the Calgary corpus was,[2] but the Calgary corpus still exists for comparison and is still useful for its originally intended purpose.

Contents

[edit]

In its most commonly used form, the corpus consists of 14 files totaling 3,141,622 bytes as follows.

Size (bytes) File name Description
111,261 BIB ASCII text in UNIX "refer" format – 725 bibliographic references.
768,771 BOOK1 unformatted ASCII text – Thomas Hardy: Far from the Madding Crowd.
610,856 BOOK2 ASCII text in UNIX "troff" format – Witten: Principles of Computer Speech.
102,400 GEO 32 bit numbers in IBM floating point format – seismic data.
377,109 NEWS ASCII text – USENET batch file on a variety of topics.
21,504 OBJ1 VAX executable program – compilation of PROGP.
246,814 OBJ2 Macintosh executable program – "Knowledge Support System" of B.R. Gaines.
53,161 PAPER1 UNIX "troff" format – Witten, Neal, Cleary: Arithmetic Coding for Data Compression.
82,199 PAPER2 UNIX "troff" format – Witten: Computer (in)security.
513,216 PIC 1728 x 2376 bitmap image (MSB first): text in French and line diagrams.
39,611 PROGC Source code in C – UNIX compress v4.0.
71,646 PROGL Source code in Lisp – system software.
49,379 PROGP Source code in Pascal – program to evaluate PPM compression.
93,695 TRANS ASCII and control characters – transcript of a terminal session.

There is also a less commonly used 18 file version which include 4 additional text files in UNIX "troff" format, PAPER3 through PAPER6. The maintainers of the Canterbury corpus website notes that "they don't add to the evaluation".[3]

Benchmarks

[edit]

The Calgary corpus was a commonly used benchmark for data compression in the 1990s. Results were most commonly listed in bits per byte (bpb) for each file and then summarized by averaging. More recently, it has been common to just add the compressed sizes of all of the files. This is called a weighted average because it is equivalent to weighting the compression ratios by the original file sizes. The UCLC benchmark[4] by Johan de Bock uses this method.

For some data compressors it is possible to compress the corpus smaller by combining the inputs into an uncompressed archive (such as a tar file) before compression because of mutual information between the text files. In other cases, the compression is worse because the compressor handles nonuniform statistics poorly. This method was used in a benchmark in the online book Data Compression Explained by Matt Mahoney.[5]

The table below shows the compressed sizes of the 14 file Calgary corpus using both methods for some popular compression programs. Options, when used, select best compression. For a more complete list, see the above benchmarks.

Compressor Options As 14 separate files As a tar file
Uncompressed 3,141,622 3,152,896
compress 1,272,772 1,319,521
Info-ZIP 2.32 -9 1,020,781 1,023,042
gzip 1.3.5 -9 1,017,624 1,022,810
bzip2 1.0.3 -9 828,347 860,097
7-zip 9.12b 848,687 824,573
bzip3 1.1.8 765,939 779,795
ppmd Jr1 -m256 -o16 740,737 754,243
ppmonstrJ 675,485 669,497
ZPAQ v7.15 -method 5 659,709 659,853

Compression challenge

[edit]

The "Calgary corpus Compression and SHA-1 crack Challenge"[6] is a contest started by Leonid A. Broukhis on May 21, 1996 to compress the 14 file version of the Calgary corpus. The contest offers a small cash prize which has varied over time. Currently the prize is US $1 per 111 byte improvement over the previous result.

According to the rules of the contest, an entry must consist of both the compressed data and the decompression program packed into one of several standard archive formats. Time and memory limits, archive formats, and decompression languages have been relaxed over time. Currently the program must run within 24 hours on a 2000 MIPS machine under WindowsorLinux and use less than 800 MB memory. An SHA-1 challenge was later added. It allows the decompression program to output files different from the Calgary corpus as long as they hash to the same values as the original files. So far, that part of the challenge has not been met.

The first entry received was 759,881 bytes in September, 1997 by Malcolm Taylor, author of RK and WinRK. The most recent entry was 580,170 bytes by Alexander Ratushnyak on July 2, 2010. The entry consists of a compressed file of size 572,465 bytes and a decompression program written in C++ and compressed to 7700 bytes as a PPMd var. I archive, plus 5 bytes for the compressed file name and size. The history is as follows.

Size (bytes) Month/year Author
759,881 09/1997 Malcolm Taylor
692,154 08/2001 Maxim Smirnov
680,558 09/2001 Maxim Smirnov
653,720 11/2002 Serge Voskoboynikov
645,667 01/2004 Matt Mahoney
637,116 04/2004 Alexander Ratushnyak
608,980 12/2004 Alexander Ratushnyak
603,416 04/2005 Przemysław Skibiński
596,314 10/2005 Alexander Ratushnyak
593,620 12/2005 Alexander Ratushnyak
589,863 05/2006 Alexander Ratushnyak
580,170 07/2010 Alexander Ratushnyak

See also

[edit]

References

[edit]
  1. ^ Ian H. Witten; Alistair Moffat; Timothy C. Bell (1999). Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. p. 92. ISBN 9781558605701.
  • ^ Salomon, David (2007). Data Compression: The Complete Reference (Fourth ed.). Springer. p. 12. ISBN 9781846286032.
  • ^ "The Canterbury Corpus". corpus.canterbury.ac.nz.
  • ^ "UC Learning Center". 6 January 2023.
  • ^ "Data Compression Explained". mattmahoney.net.
  • ^ "The Compression/SHA-1 Challenge". mailcom.com.
  • [edit]
    Retrieved from "https://en.wikipedia.org/w/index.php?title=Calgary_corpus&oldid=1160913791"

    Categories: 
    Data compression
    Test items
    Hidden categories: 
    Articles needing additional references from November 2012
    All articles needing additional references
     



    This page was last edited on 19 June 2023, at 13:48 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki