Compression Algorithms
Overview
Cabriolet supports multiple compression algorithms, each with different characteristics:
Algorithm Year Type Compression Speed Complexity
------------------------------------------------------------------------
LZSS 1982 Dictionary Medium Fast Low
MSZIP 1993 DEFLATE variant Good Medium Medium
Quantum 1993 Arithmetic Good Medium High
LZX 1995 Advanced Dict Excellent Slow Very High
------------------------------------------------------------------------LZSS Compression
Algorithm Overview
LZSS (Lempel-Ziv-Storer-Szymanski) uses a sliding window to find repeated sequences.
Key Concepts:
-
Sliding Window: Fixed-size buffer of previously seen data
-
Look-ahead Buffer: Upcoming data to compress
-
Match Finding: Search window for longest match
-
Length-Distance Pairs: Encode matches as (length, distance)
How It Works
Input: "the quick brown fox jumps over the lazy dog"
Window size: 4096 bytes
Look-ahead: 18 bytes
Step 1: No match for "the" → output literal 't', 'h', 'e', ' '
Step 2: Found "the" at distance 36 → output (length=3, distance=36)
Step 3: Continue scanning...
Compressed output:
't' 'h' 'e' ' ' 'q' 'u' 'i' 'c' 'k' ' ' 'b' 'r' 'o' 'w' 'n'
' ' 'f' 'o' 'x' ' ' 'j' 'u' 'm' 'p' 's' ' ' 'o' 'v' 'e' 'r'
' ' (3,36) ' ' 'l' 'a' 'z' 'y' ' ' 'd' 'o' 'g'
↑
Refers back to "the"Implementation Details
class LZSSDecompressor
WINDOW_SIZE = 4096
LOOK_AHEAD_SIZE = 18
def decompress(input)
window = CircularBuffer.new(WINDOW_SIZE)
output = []
input.each_token do |token|
if token.literal?
# Direct byte
byte = token.value
output << byte
window.add(byte)
else
# Match reference (length, distance)
length = token.length
distance = token.distance
# Copy from window
length.times do
byte = window.get_at(distance)
output << byte
window.add(byte)
end
end
end
output.pack('C*')
end
endMSZIP (DEFLATE)
Algorithm Overview
MSZIP is Microsoft’s implementation of DEFLATE, combining LZ77 and Huffman coding.
Process:
-
LZ77 Compression: Find repeated sequences
-
Huffman Encoding: Compress symbols statistically
-
Block Structure: Process data in blocks
Compression Pipeline
Input Data
↓
┌─────────────────┐
│ LZ77 Encoding │ Find matches, output literals & length-distance pairs
└────────┬────────┘
↓
┌─────────────────┐
│ Build Huffman │ Analyze symbol frequencies
│ Trees │
└────────┬────────┘
↓
┌─────────────────┐
│ Huffman Encode │ Encode using optimal codes
└────────┬────────┘
↓
Compressed OutputBlock Format
Each MSZIP block contains:
Block Structure:
┌──────────────────────────────────────┐
│ Block Header (3 bits) │
│ - Final block flag (1 bit) │
│ - Block type (2 bits) │
├──────────────────────────────────────┤
│ Compressed Length Tables (variable) │
│ - Literal/length code lengths │
│ - Distance code lengths │
├──────────────────────────────────────┤
│ Compressed Data (variable) │
│ - Literals and match codes │
│ - Distance codes │
└──────────────────────────────────────┘Decompression Process
def decompress_mszip_block(input)
# 1. Read block header
final_block = input.read_bits(1)
block_type = input.read_bits(2)
# 2. Build Huffman trees
literal_tree = build_tree_from_lengths(read_code_lengths(input))
distance_tree = build_tree_from_lengths(read_distance_lengths(input))
# 3. Decode compressed data
output = []
window = CircularBuffer.new(32768)
loop do
code = decode_symbol(input, literal_tree)
if code < 256
# Literal byte
output << code
window.add(code)
elsif code == 256
# End of block
break
else
# Length-distance pair
length = decode_length(code, input)
distance = decode_distance(input, distance_tree)
# Copy from window
length.times do
byte = window.get_at(distance)
output << byte
window.add(byte)
end
end
end
output
endQuantum Compression
Algorithm Overview
Quantum uses arithmetic coding combined with adaptive modeling.
Key Features:
-
Arithmetic Coding: Represents entire message as single number
-
Adaptive Model: Updates probabilities dynamically
-
Context Modeling: Uses previous bytes as context
How Arithmetic Coding Works
Traditional Huffman assigns integer bit codes. Arithmetic coding represents data as a fraction:
Example: Encoding "AAB"
Symbol probabilities:
A: 0.6 (60%)
B: 0.4 (40%)
Start with range [0.0, 1.0]
Encode 'A':
New range: [0.0, 0.6] (60% of current range)
Encode 'A':
New range: [0.0, 0.36] (60% of [0.0, 0.6])
Encode 'B':
New range: [0.36, 0.504] (40% of [0.36, 0.504])
Final output: Any number in [0.36, 0.504], e.g., 0.4
In binary: 0.011001... (approximately)Implementation
class QuantumDecompressor
def decompress(input, length)
model = AdaptiveModel.new
output = []
# Initialize arithmetic decoder range
decoder = ArithmeticDecoder.new(input)
length.times do
# Get symbol based on current range
symbol = decoder.decode_symbol(model)
output << symbol
# Update model with new symbol
model.update(symbol)
end
output.pack('C*')
end
end
class AdaptiveModel
def initialize
@counts = Array.new(256, 1) # Start with count of 1
@total = 256
end
def probability(symbol)
@counts[symbol].to_f / @total
end
def update(symbol)
@counts[symbol] += 1
@total += 1
# Prevent overflow by scaling
if @total > 16384
@counts.map! { |c| (c + 1) / 2 }
@total = @counts.sum
end
end
endLZX Compression
Algorithm Overview
LZX is the most advanced algorithm, designed for maximum compression.
Features:
-
Large Window: Up to 2 MB (2^21 bytes)
-
Multiple Match Lengths: Optimized encoding for various lengths
-
Aligned Offsets: Special handling for aligned positions
-
Pretree Encoding: Huffman trees for code lengths
Window Structure
LZX Window (2 MB):
┌────────────────────────────────────────┐
│ │
│ Previously Decoded Data │
│ (Sliding Window) │
│ │
├────────────────────────────────────────┤
│ Recent Matches │ ← Recent position list
│ (8 most recent positions) │
└────────────────────────────────────────┘
Match Types:
1. Repeat Match: Use recent position
2. New Match: Find in entire window
3. Aligned Match: Match at 8-byte boundary (faster)Decoding Process
class LZXDecompressor
def decompress_block(input)
# Read block type
block_type = input.read_bits(3)
# Read block size
block_size = read_block_size(input, block_type)
# Build Huffman trees for this block
main_tree = build_main_tree(input)
length_tree = build_length_tree(input)
aligned_tree = build_aligned_tree(input) if aligned_block?
# Decode matches and literals
output = []
window = LZXWindow.new(@window_size)
recent_offsets = [1, 1, 1] # R0, R1, R2
while output.size < block_size
# Decode main element
main_element = decode_symbol(input, main_tree)
if main_element < 256
# Literal
output << main_element
window.add(main_element)
else
# Match
length, offset = decode_match(
main_element, input,
length_tree, aligned_tree,
recent_offsets
)
# Copy from window
match_data = window.copy(offset, length)
output.concat(match_data)
window.add_multiple(match_data)
# Update recent offsets
recent_offsets.unshift(offset)
recent_offsets.pop
end
end
output
end
endMatch Encoding
LZX uses sophisticated match encoding:
Match Length Encoding:
Length 2: code 0-248 (direct)
Length 3-257: code 249 + extra bits
Distance Encoding:
Uses position slots with extra bits
Position Slot Table:
Slot Distance Range Extra Bits
----------------------------------------
0 0 0
1 1 0
2 2-3 1
3 4-7 2
4 8-15 3
...
50 2097152-... 21Huffman Coding
Principle
Huffman coding assigns shorter codes to frequent symbols.
Example:
Text: "MISSISSIPPI"
Symbol frequencies:
I: 4
S: 4
P: 2
M: 1
Huffman codes:
I: 0 (1 bit)
S: 10 (2 bits)
P: 110 (3 bits)
M: 111 (3 bits)
Encoded: 111 0 10 10 0 10 10 0 110 110 0
= 1110101001010011001 (19 bits vs 88 bits uncompressed)See Huffman Coding for details.
Comparison
Speed vs Compression
Compression Ratio Test (100 MB text file):
Algorithm Compressed Ratio Compress Decompress
Size (MB) Time (s) Time (s)
----------------------------------------------------------
None 100.0 1.00x 0.0 0.0
LZSS 45.2 2.21x 2.1 0.8
MSZIP 32.1 3.12x 3.5 1.2
Quantum 29.8 3.36x 5.8 2.1
LZX 27.3 3.66x 12.4 1.5
----------------------------------------------------------