Introduction#
History#
- Legal Issues
- The Rise of Deflate
Current Archival Software#
Future Developments#
Compression Techniques#
- Run-Length Encoding
- Burrows-Wheeler Transform
- Entropy Encoding
- Shannon-Fano Coding
- Huffman Coding
- Arithmetic Coding
Compression Algorithms#
- Sliding Window Algorithms
- LZ77
- LZR
- DEFLATE
- DEFLATE64
- LZSS
- LZH
- LZB
- ROLZ
- LZP
- LZRW11
- LZJB
- LZS
- LZX
- LZO
- LZMA
- LZMA2
Dictionary Algorithms#
- LZ78
- LZW
- ZLC
- LZTL
- ZMW
- LZAP
- LZWL
- LZJ
Non-dictionary Algorithms#
- PPM
- bzip2
- PAQ
References#
1 Introduction#
Compression algorithms mainly fall into two categories: lossy and lossless. Lossy data compression algorithms typically reduce file size by eliminating small details that require a lot of fidelity. In lossy data compression, it is impossible to recover the original file due to the removal of essential data. Lossy data compression is commonly used for storing image and audio data, although very high compression ratios can be achieved through data deletion, this article does not cover that. Lossless data compression reduces the size of files in such a way that a decompression function can fully restore the original file without any data loss. Lossless data compression is ubiquitous in computing, from saving space on personal computers to sending data over networks, communicating via secure shell, or viewing PNG or GIF images.
The way lossless data compression algorithms work is that any non-random file contains redundant information that can be condensed using statistical modeling techniques to determine the probability of characters or phrases occurring. These statistical models can then be used to generate codes for specific characters or phrases based on their occurrence probabilities, assigning the shortest codes to the most common data. These techniques include entropy encoding, run-length encoding, and dictionary-based compression. By using these and other techniques, an 8-bit character or a string of such characters can be represented with just a few bits, thereby eliminating a significant amount of redundant data.
2 History#
In the 1970s, as the internet became increasingly popular, the Lempel-Ziv algorithm was invented, and since then, data compression has played an important role in the field of computing. However, its history outside of computing is much older. Morse code, invented in 1838, is one of the earliest examples of data compression, as the most common letters in English, such as 'e' and 't', were assigned shorter Morse code representations. Later, as large computers began to dominate in 1949, Claude Shannon and Robert Fano invented Shannon-Fano coding. Their algorithm was based on assigning codes to symbols in a given data block according to their occurrence probabilities. The probability of a symbol occurring is inversely proportional to the length of its code, thus representing data in a shorter form.
Two years later, David Huffman studied information theory at MIT and took a class with Robert Fano, who gave students the choice of writing a term paper or taking a final exam. Huffman chose the term paper, which was about finding the most efficient binary coding method. After working for several months without success, Huffman was ready to abandon all his work and start studying for the final instead of writing the paper. At that moment, he suddenly had a flash of inspiration and discovered a counting method that was very similar to Shannon-Fano coding but more efficient. The key difference between Shannon-Fano coding and Huffman coding is that the former's probability tree is built from the bottom up, resulting in a non-optimal outcome, while the latter is built from the top down.
Early implementations of Shannon-Fano and Huffman coding were done using hardware and hardware coding. It wasn't until the 1970s, with the rise of the internet and online storage, that software compression became feasible, with Huffman coding being dynamically generated based on input data. Later, in 1997, Abraham Lempel and Jacob Ziv published their groundbreaking LZ77 algorithm, which was the first algorithm to use dictionary data. More specifically, LZ77 often employs a dynamic dictionary known as a sliding window. In 1978, the same two individuals published their LZ78 algorithm, which also used a dictionary, but unlike LZ77, this algorithm parsed the input data and generated a static dictionary instead of a dynamic one.
2.1 Legal Issues#
Both the LZ77 and LZ78 algorithms quickly gained popularity, leading to many variants, as shown in the image.
Since these algorithms were invented, most have become obsolete, and now only a few algorithms are widely used, including DEFLATE, LZMA, and LZX. Most commonly used algorithms are derived from the LZ77 algorithm. This is not due to technical superiority, but because the LZ78 algorithm had its derivative LZW patented by Sperry in 1984, which began suing software vendors, server administrators, and even terminals for using the GIF format without a license, thus becoming a patent-restricted algorithm.
At that time, the UNIX compress utility made very minor modifications to the LZW algorithm, calling it LZC, which was later discontinued due to patent issues. Other UNIX developers also began to move away from using the LZW algorithm in favor of open-source algorithms. This led to the UNIX community adopting gzip based on deflate and bzip2 based on the Burrows-Wheeler transform. In the long run, this was beneficial for the UNIX community, as gzip and bzip2 almost always achieve higher compression ratios than the LZW format. With the expiration of the LZW algorithm patent in 2003, the patent issues surrounding LZW have subsided. Nevertheless, the LZW algorithm has largely been replaced and is only commonly used in GIF compression. Since then, some LZW derivatives have emerged, but they have not gained widespread use, and the LZ77 algorithm remains dominant.
Another legal battle over the LZS algorithm erupted in 1993, developed by Stac Electronics for disk compression software, such as Stacker. Microsoft used the LZS algorithm to develop disk compression software, which was released under MS-DOS 6.0 and was said to double disk capacity. When Stac Electronics discovered that its intellectual property was being used, it sued Microsoft. Microsoft was later found guilty and ordered to pay Stac Electronics $120 million in damages, minus a $13.6 million counterclaim, with Microsoft's infringement not being estimated. Although the Stac Electronics lawsuit against Microsoft had a significant ruling, it did not hinder the development of the Lempel-Ziv algorithm as the LZW patent dispute did. The only outcome seems to be that LZS was not assigned to any new algorithms.
2.2 The Rise of Deflate#
Since the publication of the Lempel-Ziv algorithm, businesses and other large entities have been using data compression due to their increasing storage needs, and data compression helps them meet these demands. However, it wasn't until the rise of the internet that data compression began to be widely adopted, as there was a demand for data compression in the late 1980s. Bandwidth was either limited or expensive, and data compression helped alleviate these bottlenecks. When the World Wide Web was developed, compression became particularly important as people began sharing more images and other formats that were much larger than text. To meet this demand, several new file formats were developed, including ZIP, GIF, and PNG, which incorporated compression technology.
In 1985, Thom Henderson released the first commercially successful archive format, ARC, through his company System Enhancement Associates. ARC was one of the first programs capable of packaging and compressing files, making it particularly popular in the BBS community, and it was also open-source. The ARC format used a modification of the LZW algorithm to compress data. Phil Katz noticed the popularity of ARC and decided to improve it by writing compression and decompression programs in assembly language. He released the PKARC program as shareware in 1987 and was sued by Henderson for copyright infringement. He was found guilty and forced to pay royalties and other fines as part of a cross-licensing agreement. He was found guilty because PKARC was an obvious copy of ARC; in some cases, even the misspellings in the comments were the same.
Due to the restrictions of the cross-licensing agreement, Phil Katz could no longer sell PKARC after 1988. Therefore, he created an adjusted version of PKARC in 1989, now known as the ZIP format. Because it used LZW, it was considered patent-restricted, and Katz later chose to switch to the new IMPLODE algorithm. In 1993, Katz released PKZIP 2.0, implementing the DEFLATE algorithm along with other features such as split volumes. Although it has been around for a long time, almost all .zip files today follow the PKZIP 2.0 format, making this version of the ZIP format ubiquitous.
GIF, which stands for Graphics Interchange Format, is a graphic exchange format developed by CompuServe in 1987, designed to allow bitmaps to be transmitted without data loss (although the format limits each frame to 256 colors), while significantly reducing file size for transmission over dial-up modems. However, like the ZIP format, GIF is also based on the LZW algorithm. Despite being patent-restricted, Unisys failed to adequately enforce its patent to prevent the format's proliferation. Even over 20 years later, GIF remains widely used, especially for its ability to create animations.
Although GIF could not be stopped, CompuServe still sought a format without patent restrictions and launched the Portable Network Graphics (PNG) format in 1994. Like ZIP, the PNG standard uses the DEFLATE algorithm for compression. Although DEFLATE was patented by Katz, this patent was never enforced, allowing PNG and other DEFLATE-based formats to avoid patent infringement. While LZW was widely adopted in the early days of compression, it has gradually faded into history due to Unisys's litigious nature, being replaced by the faster and more efficient DEFLATE algorithm. DEFLATE is currently the most commonly used data compression algorithm, akin to a Swiss army knife in compression.
In addition to being used in PNG and ZIP formats, DEFLATE is also very common in other areas of computing. For example, the gzip (.gz) file format uses DEFLATE, as it is essentially an open-source version of ZIP. Other uses of DEFLATE include HTTP, SSL, and other technologies aimed at achieving efficient data compression over networks.
Unfortunately, Phil Katz did not live to see his DEFLATE algorithm conquer the computing world. He struggled with alcoholism for years, and his life began to unravel in the late 1990s, leading to multiple arrests for drunk driving and other offenses. On April 14, 2000, 37-year-old Katz was found dead in a hotel room. The cause of death was acute pancreatic hemorrhage due to alcohol poisoning, indicated by many empty bottles found beside his body.
2.3 Current Archival Software#
By the mid-1990s, new and better formats began to emerge, and the ZIP format and other DEFLATE-based formats no longer dominated. In 1993, Eugene Roshal released his compression software WinRAR, using the proprietary RAR format. The latest version of RAR combines the PPM and LZSS algorithms, but earlier implementations are not well-known. RAR has become the standard format for sharing files over the internet, particularly for distributing pirated media. An open-source implementation of the Burrows-Wheeler transform called bzip2 was released in 1996, quickly becoming significant competition for the DEFLATE-based gzip format on UNIX platforms. Another open-source compression program, 7-Zip or .7z format, was released in 1999. Due to its generally higher compression ratios and the modular and open nature of the format, 7-Zip may be the first format to challenge the dominance of ZIP and RAR. This format is not limited to using a single compression algorithm but allows for selection among algorithms like bzip2, LZMA, LZMA2, and PPMd. Finally, at the forefront of archival software is the PAQ* format. The first PAQ format was released by Matt Mahoney in 2002, called PAQ1. PAQ significantly improves the PPM algorithm by combining two or more statistical models using a technique called context mixing to generate better predictions for the next symbol than any single model could.
3 Future Developments#
The future is always uncertain, but based on current trends, some predictions can be made about the future of data compression. As context-mixing algorithms like PAQ and its variants become popular, they often achieve the highest compression ratios, but they tend to be slower. With hardware speeds growing exponentially, following Moore's Law, context-mixing algorithms are likely to shine in high compression ratio scenarios as the speed penalty will be overcome by faster hardware. New variants of the PPM algorithm designed for improvement may also emerge. Finally, the Lempel-Ziv Markov Chain algorithm (LZMA) has consistently shown an excellent balance between speed and high compression ratios, likely leading to more variants. Since LZMA has been widely adopted in many competing compression formats since its introduction in the 7-Zip format, it may even become the "winner." Another potential development direction is the use of substring enumeration compression (CSE), an emerging compression technique that has yet to see many software implementations. In its naive form, it performs similarly to bzip2 and PPM, and researchers have been working to improve its efficiency.
4 Compression Techniques#
Many different techniques are used to compress data. Most compression techniques cannot exist independently and must be combined to form a compression algorithm. Those that can exist independently often become more effective when combined with other compression techniques. Most of these techniques belong to entropy encoders, but there are also some other commonly used techniques, such as run-length encoding and the Burrows-Wheeler transform.
4.1 Run-Length Encoding#
Run-length encoding is a very simple compression technique that uses a number to represent the consecutive occurrences of the same character, followed by that character; a single character is encoded as one consecutive occurrence. RLE is very useful for highly redundant data, such as indexed images with many rows of the same color pixels, or when combined with other compression techniques like the Burrows-Wheeler transform.
Here is a quick example of RLE:
Input: AAABBCCCCDEEEEEEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Output: 3A2B4C1D6E38A
4.2 Burrows-Wheeler Transform#
The Burrows-Wheeler transform is a compression technique invented in 1994, designed to reversibly transform input data blocks to maximize the number of repeated occurrences of the same character. BWT itself does not perform any compression operations; it merely transforms the input into a form that can be encoded more efficiently by a run-length encoder or other secondary compression techniques.
The BWT algorithm is quite simple:
- Create an array of strings.
- Generate all possible rotations of the input string and store each rotation in the array.
- Sort the array in lexicographical order.
- Return the last column of the array.
BWT typically performs best on long inputs with many alternating occurrences of the same character. Below is an example of running the algorithm on ideal input. Note that & is the end-of-file character:
Due to its alternating identical characters, performing BWT on this input yields an optimal result, and another algorithm can further compress it, such as RLE, which will produce "3H&3A". Although this example produces an optimal result, it does not yield optimal results on most real data.
4.3 Entropy Encoding#
In data compression, entropy refers to the average number of bits needed to represent a symbol or character. A basic entropy encoder combines statistical models with encoders. The input file is parsed and used to generate a statistical model consisting of the probabilities of given symbols occurring. The encoder then uses the statistical model to determine what bit or byte code to assign to each symbol, making the most common symbols have the shortest codes and the least common symbols have the longest codes.
4.3.1 Shannon-Fano Coding#
This is one of the earliest compression techniques, invented by Claude Shannon and Robert Fano in 1949. This technique involves generating a binary tree to represent the probabilities of each symbol occurring. These symbols are sorted so that the most common symbols appear at the top of the tree, and the least likely symbols appear at the bottom.
The encoding for a given symbol is obtained by searching for that symbol in the Shannon-Fano tree and appending the value (0 or 1) for each left or right branch. For example, if "A" is two left branches and one right branch, its encoding would be "0012". Because it builds the binary tree from the bottom up, Shannon-Fano coding does not always generate optimal codes. Therefore, Huffman coding is often used, as it can generate optimal codes for any given input.
The algorithm for generating Shannon-Fano coding is relatively simple:
- Parse the input and count the occurrences of each symbol.
- Use the symbol counts to determine the probability of each symbol.
- Sort the symbols by probability, with the most likely symbols at the front.
- Generate leaf nodes for each symbol.
- Split the list into two parts while ensuring that the probability of the left branch is roughly equal to that of the right branch.
- Add 0 and 1 as prefixes to the left and right nodes' codes, respectively.
- Recursively apply steps 5 and 6 to the left and right subtrees until each node is a leaf in the tree.
4.3.2 Huffman Coding#
Huffman coding is another variant of entropy encoding that works similarly to Shannon-Fano coding, but the binary tree is built from the top down to generate optimal results.
The algorithm for generating Huffman coding shares its initial steps with Shannon-Fano:
-
Parse the input and count the occurrences of each symbol.
-
Use the symbol counts to determine the probability of each symbol. Sort the symbols by probability, with the most likely at the front.
-
Generate leaf nodes for each symbol, including P, and add them to the queue.
-
While the number of nodes in the queue > 1, perform the following:
- Remove the two nodes with the lowest probabilities from the queue.
- Add 0 and 1 as prefixes to the left and right nodes' codes, respectively.
- Create a new node whose value equals the sum of the probabilities of the two nodes.
- Assign the first node to the left branch and the second node to the right branch.
- Add the node to the queue.
-
The last remaining node in the queue is the root node of the Huffman tree.
4.3.3 Arithmetic Coding#
This method was developed by IBM in 1979 while the company was researching data compression techniques for its mainframe computers. If the goal is to achieve the best compression ratio, arithmetic coding can be considered the most excellent entropy encoding technique, as it typically yields better results than Huffman coding. However, arithmetic coding is much more complex than other encoding techniques.
Arithmetic coding does not split the probabilities of symbols into a tree structure; instead, it transforms the input data into a rational number between 0 and 1 by changing the base and assigning a single value to each unique symbol between 0 and the base. It is then further transformed into a binary number with a fixed number of decimal places, which is the encoding result. The value can be decoded back to the original output by changing the base from binary back to the original base and replacing the values with their corresponding symbols.
The general algorithm for computing arithmetic coding is as follows:
- Count the number of unique symbols in the input.
- This number represents the base b for arithmetic coding (e.g., base 2 is binary).
- Assign values from 0 to b to each unique symbol in the order they appear.
- Use the values from step 2 to replace the symbols in the input with their codes.
- Convert the result from step 3 from base b to a sufficiently long fixed decimal binary number to retain precision.
- Record the length of the input string in the result, as this is necessary for the decoding process.
Here is an example encoding operation, given the input "ABCDAABD":
- Discover that there are 4 unique symbols in the input, so the base is 4.
Length is 8. - Assign values to the symbols: A=0, B=1, C=2, D=3.
- Replace the input using the codes: "0.012300134", where the leading 0 is not a symbol.
- Convert "0.012311234" from base 4 to base 2: "0.011011000001112".
- Find the result. Note that in the result, the input length is 8.
Assuming 8-bit characters, the input length is 64 bits, while its arithmetic coding is only 15 bits, resulting in an excellent compression ratio of 24%. This example demonstrates how arithmetic coding can perform well with a limited character set.
5 Compression Algorithms#
5.1 Sliding Window Algorithms#
5.1.1 LZ77#
The LZ77 algorithm, published in 1977, is a pioneering algorithm. It introduced the concept of the "sliding window," which brought about greater improvements in compression ratios compared to more primitive algorithms. LZ77 uses triplets to maintain a dictionary, where these triplets represent an offset, run length, and a literal character. The offset indicates the distance from the beginning of the file where a given phrase starts, and the run length indicates the number of characters from the offset to offset + length. The literal character simply indicates that a new phrase has been found, and that phrase equals the offset to offset + length plus the literal character. The dictionary used changes dynamically based on the sliding window as the file is parsed. For example, the sliding window might be 64MB, meaning the dictionary will contain entries from the past 64MB of input data.
Given the input "abbadabba," the output would be something like "abb(0,1,'d')(0,3,'a')", as shown in the example below:
While the size of this replacement is slightly larger than the input, it typically yields smaller results when the input data is longer.
5.1.2 LZR#
LZR is an improved version of the LZ77 algorithm developed by Michael Rodeh in 1981. The algorithm was designed to be a linear-time alternative to LZ77. However, the encoding pointers can point to any offset in the file, which means LZR consumes a lot of memory. Coupled with its poorer compression ratio (LZ77 is usually superior), this makes LZR an unfeasible variant.
5.1.3 DEFLATE#
DEFLATE, invented by Phil Katz in 1993, serves as the foundation for most compression tasks today. It simply combines an LZ77 or LZSS preprocessor with a backend Huffman coding to achieve moderate compression results in a short time.
5.1.4 DEFLATE64#
DEFLATE64 is a proprietary extension of the DEFLATE algorithm that increases the dictionary size to 64KB (hence the name) and allows for greater distances in the sliding window. Compared to DEFLATE, it improves performance and compression ratios. However, the proprietary nature of DEFLATE64 and its modest improvements over DEFLATE have limited its adoption. Instead, open-source algorithms like LZMA are typically used.
5.1.5 LZSS#
The LZSS (Lempel-Ziv-Storer-Szymanski) algorithm was first released by James Storer and Thomas Szymanski in 1982. LZSS improves upon LZ77 by detecting whether a replacement will reduce the file size. If there is no size reduction, the input is retained as a literal in the output. Otherwise, a portion of the input is replaced with a (offset, length) pair, where the offset indicates how many bytes from the beginning of the input and the length indicates how many characters to read from that position. Another improvement over LZ77 is that LZSS eliminates the "next character," using only the offset and length pair.
Here is a brief example, given the input "these theses," which produces "these(0,6)s," saving just one byte, but saving more space for larger inputs.
LZSS is still used by many popular archive formats, the most famous of which is RAR. It is sometimes also used for network data compression.
5.1.6 LZH#
LZH, developed in 1987, stands for "Lempel-Ziv Huffman." It is a variant of LZSS that utilizes Huffman coding to compress pointers, achieving slightly better compression. However, the improvements brought by using Huffman coding are negligible, and the compression results do not justify the performance cost incurred by using Huffman coding.
5.1.7 LZB#
LZB is an LZSS variant developed by Timothy Bell and others in 1987. Similar to LZH, LZB aims to reduce the size of compressed files by more efficiently encoding LZSS pointers. It achieves this by gradually increasing the size of the pointers, making them larger as the sliding window increases. It can achieve higher compression rates than LZSS and LZH, but it is still slower than LZSS due to the additional encoding steps for the pointers.
5.1.8 ROLZ#
ROLZ stands for "Reduced Offset Lempel-Ziv," which aims to improve the compression ratio of LZ77 by limiting the offset length to reduce the amount of data required for encoding the offset-length pairs. This derivative algorithm of LZ77 first appeared in Ross Williams's LZRW4 algorithm, with other implementations including BALZ, QUAD, and RZM. Highly optimized ROLZ can achieve compression ratios nearly equivalent to LZMA, but due to a lack of widespread application, ROLZ's popularity is lower.
5.1.9 LZP#
LZP stands for "Lempel-Ziv + Prediction." It is a special case of the ROLZ algorithm where the offset is reduced to 1. There are several different variants that use different techniques to achieve faster operations or better compression ratios. LZW4 implements an arithmetic encoder to achieve optimal compression ratios, but at the cost of slower speeds.
5.1.10 LZRW1#
The LZRW1 algorithm was created by Ron Williams in 1991, introducing the concept of "Reduced Offset Lempel-Ziv Compression." LZRW1 can achieve high compression ratios while maintaining fast and efficient characteristics. Ron Williams also created several variants to improve LZRW1, such as LZRW1-A, 2, 3, 3-A, and 4.
5.1.11 LZJB#
Jeff Bonwick created his Lempel-Ziv Jeff Bonwick algorithm in 1998 for the Solaris Z file system (ZFS). It is considered a variant of the LZRW algorithm, particularly the LZRW1 variant, aimed at achieving maximum compression speed. Since it is used in file systems, speed is particularly important to ensure that the compression algorithm does not become a bottleneck for disk operations.
5.1.12 LZS#
The Lempel-Ziv-Stac algorithm is an improved LZ77 algorithm developed by Stac Electronics in 1994 for disk compression software. It distinguishes between literal symbols and offset-length pairs in the output while omitting the next encountered symbol, making the LZS algorithm functionally similar to the LZSS algorithm.
5.1.13 LZX#
The LZX algorithm was developed by Jonathan Forbes and Tomi Poutanen in 1995 for the Amiga computer. The "X" in LZX has no special meaning. Forbes sold the algorithm to Microsoft in 1996 and worked for Microsoft to further improve it for use in Microsoft's cabinet (.CAB) format. The algorithm was also used by Microsoft to compress compiled HTML help (CHM) files, Windows Imaging Format (WIM) files, and Xbox Live Avatars.
5.1.14 LZO#
LZO was developed by Markus Oberhumer in 1996, with the goal of achieving fast compression and decompression. It allows for adjustable compression levels, with the highest compression level requiring only an additional 64KB of memory, while decompression only requires input and output buffers. LZO's functionality is very similar to the LZSS algorithm, but it is focused on speed rather than compression ratio.
5.1.15 LZMA#
The Lempel-Ziv Markov chain algorithm was first made public in 1998 with the release of the 7-Zip compression software for the .7z file format. It is generally able to achieve better compression results than bzip2, DEFLATE, and other algorithms. LZMA employs a series of compression techniques to achieve its output. First, it uses a modified version of the LZ77 algorithm to parse data at the bit level rather than the traditional byte level. Then, the output of the LZ77 algorithm is subjected to arithmetic coding. Depending on the specific LZMA implementation, additional techniques may be applied. The result is often better than the compression ratios of most other LZ variants, primarily due to the use of bit-level compression methods instead of byte-level compression.
5.1.16 LZMA2#
LZMA2 is a progressive improvement over the original LZMA algorithm, initially introduced in 2009 through an update to the 7-Zip archiving software. LZMA2 enhances the multithreading capabilities and performance of the LZMA algorithm and better handles data that cannot be compressed, resulting in slightly better compression.
5.1.17 Statistical Lempel-Ziv#
Statistical Lempel-Ziv is a concept proposed by Dr. Sam Kwong and Yu Fan Ho in 2001. Its fundamental principle is to combine statistical analysis of data with LZ77 variant algorithms to further optimize the encoding stored in the dictionary.
5.2 Dictionary Algorithms#
5.2.1 LZ78#
The LZ78 algorithm was created by Lempel and Ziv in 1978, hence the "78" in its abbreviation. Unlike the method that generates a dictionary using a sliding window, the input data can be preprocessed to generate a dictionary with an infinite range of inputs or form a dictionary while parsing the file. LZ78 adopts the latter strategy. The dictionary size is typically limited to a few megabytes, or all codes are a certain number of bytes, such as 8 bytes; this is done to reduce memory requirements. How most LZ78-type algorithms handle a full dictionary is where they differ.
During the parsing of a file, the LZ78 algorithm adds each new character or string encountered to the dictionary. For each symbol in the input, a dictionary entry is generated in the form of a dictionary index and an unknown symbol; if a symbol is already in the dictionary, the dictionary is searched for the substring of the current symbol and the subsequent symbols. The index of the longest substring match is used for the dictionary index. The data pointed to by the dictionary index will be added to the last character of the unknown substring. If the current symbol is unknown, the dictionary index is set to 0 to indicate that it is a single-character entry. These entries form a data structure similar to a linked list.
For example, the input "abbadabbaabaad" will generate the output "(0,a)(0,b)(2,a)(0,d)(1,b)(3,a)(6,d)," as demonstrated in the example below:
5.2.2 LZW#
LZW is the Lempel-Ziv-Welch algorithm, created by Terry Welch in 1984. Despite serious patent issues, LZW is the most widely used algorithm in the LZ78 family. Similar to LZ78, LZW improves upon LZ78 by eliminating redundant characters in the output and making the output entirely composed of pointers. Before starting compression, it also includes every character in the dictionary and employs other tricks to enhance compression, such as encoding the last character of each new phrase as the first character of the next phrase. LZW often appears in the Graphics Interchange Format (GIF) as well as in the early specifications of the ZIP format and other specialized applications. LZW is very fast but offers poorer compression results compared to most new algorithms, while some algorithms are both faster and achieve better compression.
5.2.3 LZC#
LZC (Lempel-Ziv Compress) is a slight modification of the LZW algorithm used in the UNIX compress utility. The main difference between LZC and LZW is that LZC monitors the compression ratio of the output. Once the ratio exceeds a certain threshold, the dictionary is discarded and rebuilt.
5.2.4 LZT#
The Lempel-Ziv Tischer algorithm is an improvement on LZC that deletes the least recently used phrases when the dictionary is full and replaces them with new entries. There are some other progressive improvements, but both LZC and LZT are not commonly used today.
5.2.5 LZMW#
The LZMW algorithm was invented by Victor Miller and Mark Wegman in 1984. Similar to LZT, it adopts a strategy of replacing recently unused phrases. However, LZMW does not merge similar entries in the dictionary; instead, it merges the last two encoded phrases and stores the result as a new entry. As a result, the size of the dictionary can quickly expand, requiring more frequent discarding of least recently used entries. Compared to LZT, LZMW typically achieves better compression but is another less common algorithm.
5.2.6 LZAP#
LZAP was created by James Storer in 1988 as a modification of the LZMW algorithm. AP stands for "All Prefixes," meaning the dictionary stores not just single phrases but every permutation. For example, if the previous phrase is "last" and the current phrase is "next," the dictionary will store "lastn," "lastne," "lastnex," and "lastnext."
5.2.7 LZWL#
LZWL is a modified LZW algorithm created in 2006 that uses syllables instead of single characters for compression. The design of LZWL aims to better handle specific datasets with many common syllables, such as XML data. Typically, such algorithms are used with preprocessors that break down input data into syllables.
5.2.8 LZJ#
Matti Jakobsson released the LZJ algorithm in 1985, which is one of the few LZ78 algorithms that differ from LZW. This algorithm works by storing each unique string (up to an arbitrary maximum length) processed in the input in the dictionary and assigning a code to each string. When the dictionary is full, all entries that appear only once are deleted.
5.3 Non-dictionary Algorithms#
5.3.1 PPM#
Prediction by Partial Matching is a statistical modeling technique that uses a set of previous symbols in the input to predict the next symbol, reducing the entropy of the output data. This differs from dictionaries because PPM predicts the next symbol rather than trying to find the next symbol in the dictionary for encoding. PPM is typically combined with encoders like arithmetic coding or adaptive Huffman coding. PPM or its variant PPMd has been implemented in many archive formats, including 7-Zip and RAR.
5.3.2 bzip2#
bzip2 is an open-source implementation of the Burrows-Wheeler transform. Its operation is very straightforward, but it achieves a very good balance between speed and compression ratio, making the bzip2 format very popular in UNIX environments. First, run-length encoding is applied to the data. Next, the Burrows-Wheeler transform is applied. Then, a move-to-front transform is applied, designed to create a large number of identical symbols forming runs for another run-length encoder to use. Finally, the result is Huffman encoded and wrapped in a header.
5.3.3 PAQ#
PAQ was created by Matt Mahoney in 2002 to improve the older PPM(d) algorithm. It uses a revolutionary technique called context mixing, which intelligently combines multiple statistical models (of which PPM is one) to better predict the next symbol than any single model could. PAQ is one of the most promising algorithms due to its extremely high compression ratios and very active development. Since its inception, over 20 variants have been created, some of which have achieved record-breaking compression ratios. The biggest drawback of PAQ is its slow speed due to using multiple statistical models to achieve optimal compression ratios. However, as hardware continues to get faster, it may become the standard in the future. PAQ is slowly being adopted and can be found in the PeaZip program on Windows, which features 64-bit support and significant speed improvements in the PAQ8O variant. Most other PAQ formats are command-line only.
6 References#
-
Wolfram, Stephen. A New Kind of Science. Champaign, IL: Wolfram Media, 2002. 1069. Print.
-
Ken Huffman. Profile: David A. Huffman, Scientific American, September 1991, pp. 54–58.
-
Ziv J., Lempel A., “A Universal Algorithm for Sequential Data Compression”, IEEE Transactions on Information Theory, Vol. 23, No. 3 (1977), pp. 337-343.
-
Ziv J., Lempel A., “Compression of Individual Sequences via Variable-Rate Coding”, IEEE Transactions on Information Theory, Vol. 24, No. 5, pp. 530-536.
-
USPTO Patent #4814746. See http://www.theregister.co.uk/1999/09/01/unisys_demands_5k_licence_fee
-
blog [Fedora & Linux Tips]
-
ARC Info
-
comp.compression Frequently Asked Questions (part 1/3)Section - [8] What about patents on data compression algorithms?
-
USPTO Patent #5051745
-
Phil Katz' Death
-
Iwata, K., Arimura, M., and Shima, Y., "An Improvement in Lossless Data Compression via Substring Enumeration", , 2011 IEEE/ACIS 10th International Conference on Computer and Information Science (ICIS).
-
Burrows M., and Wheeler, D. J. 1994. A Block-Sorting Lossless Data Compression Algorithm. SRC Research Report 124, Digital Systems Research Center.
-
http://www.cs.tau.ac.il/~dcor/Graphics/adv-slides/entropy.pdf
-
Shannon, C.E. (July 1948). "A Mathematical Theory of Communication". Bell System Technical Journal 27: 379–423.
-
HUFFMAN, D. A. 1952. A method for the construction of minimum-redundancy codes. In Proceedings of the Institute of Electrical and Radio Engineers 40, 9 (Sept.), pp. 1098-1101.
-
RISSANEN, J., AND LANGDON, G. G. 1979. Arithmetic coding. IBM J. Res. Dev. 23, 2 (Mar.), 149-162.
-
RODEH, M., PRATT, V. R., AND EVEN, S. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (Jan.), 16-24.
-
Bell, T., Witten, I., Cleary, J., "Modeling for Text Compression", ACM Computing Surveys, Vol. 21, No. 4 (1989).
-
DEFLATE64 benchmarks
-
STORER, J. A., AND SZYMANSKI, T. G. 1982. Data compression via textual substitution. J. ACM 29, 4 (Oct.), 928-951.
-
Bloom, C., "LZP: a new data compression algorithm", Data Compression Conference, 1996. DCC '96. Proceedings, p. 425 10.1109/DCC.1996.488353.
-
Dr Ross's Compression Crypt
-
"Data Compression Method - Adaptive Coding with Sliding Window for Information Interchange", American National Standard for Information Systems, August 30, 1994.
-
LZX Sold to Microsoft
-
LZO Info
-
LZMA Accessed on 12/10/2011.
-
LZMA2 Release Date
-
Kwong, S., Ho, Y.F., "A Statistical Lempel-Ziv Compression Algorithm for Personal Digital Assistant (PDA)", IEEE Transactions on Consumer Electronics, Vol. 47, No. 1, February 2001, pp 154-162.
-
David Salomon, Data Compression – The complete reference, 4th ed., page 212
-
Chernik, K., Lansky, J., Galambos, L., "Syllable-based Compression for XML Documents", Dateso 2006, pp 21-31, ISBN 80-248-1025-5.
-
Jakobsson, M., "Compression of Character Strings by an Adaptive Dictionary", BIT Computer Science and Numerical Mathematics, Vol. 25 No. 4 (1985). doi>10.1007/BF01936138
-
Cleary, J., Witten, I., "Data Compression Using Adaptive Coding and Partial String Matching", IEEE Transactions on Communications, Vol. COM-32, No. 4, April 1984, pp 396-402.
-
Seward, J., "bzip2 and libbzip2", bzip2 Manual, March 2000.
-
Mahoney, M., "Adaptive Weighting of Context Models for Lossless Data Compression", Unknown, 2002.