Compression Algorithms

Purpose

Understanding the compression algorithms used in Microsoft Cabinet files and related formats.

References

Overview

Cabriolet supports multiple compression algorithms, each with different characteristics:

Algorithm   Year  Type              Compression  Speed    Complexity
------------------------------------------------------------------------
LZSS        1982  Dictionary        Medium       Fast     Low
MSZIP       1993  DEFLATE variant   Good         Medium   Medium
Quantum     1993  Arithmetic        Good         Medium   High
LZX         1995  Advanced Dict     Excellent    Slow     Very High
------------------------------------------------------------------------

LZSS Compression

Algorithm Overview

LZSS (Lempel-Ziv-Storer-Szymanski) uses a sliding window to find repeated sequences.

Key Concepts:

Sliding Window: Fixed-size buffer of previously seen data
Look-ahead Buffer: Upcoming data to compress
Match Finding: Search window for longest match
Length-Distance Pairs: Encode matches as (length, distance)

How It Works

Input: "the quick brown fox jumps over the lazy dog"
Window size: 4096 bytes
Look-ahead: 18 bytes

Step 1: No match for "the" → output literal 't', 'h', 'e', ' '
Step 2: Found "the" at distance 36 → output (length=3, distance=36)
Step 3: Continue scanning...

Compressed output:
  't' 'h' 'e' ' ' 'q' 'u' 'i' 'c' 'k' ' ' 'b' 'r' 'o' 'w' 'n'
  ' ' 'f' 'o' 'x' ' ' 'j' 'u' 'm' 'p' 's' ' ' 'o' 'v' 'e' 'r'
  ' ' (3,36) ' ' 'l' 'a' 'z' 'y' ' ' 'd' 'o' 'g'
                ↑
          Refers back to "the"

Implementation Details

class LZSSDecompressor
  WINDOW_SIZE = 4096
  LOOK_AHEAD_SIZE = 18

  def decompress(input)
    window = CircularBuffer.new(WINDOW_SIZE)
    output = []

    input.each_token do |token|
      if token.literal?
        # Direct byte
        byte = token.value
        output << byte
        window.add(byte)
      else
        # Match reference (length, distance)
        length = token.length
        distance = token.distance

        # Copy from window
        length.times do
          byte = window.get_at(distance)
          output << byte
          window.add(byte)
        end
      end
    end

    output.pack('C*')
  end
end

MSZIP (DEFLATE)

Algorithm Overview

MSZIP is Microsoft’s implementation of DEFLATE, combining LZ77 and Huffman coding.

Process:

LZ77 Compression: Find repeated sequences
Huffman Encoding: Compress symbols statistically
Block Structure: Process data in blocks

Compression Pipeline

Input Data
    ↓
┌─────────────────┐
│  LZ77 Encoding  │  Find matches, output literals & length-distance pairs
└────────┬────────┘
         ↓
┌─────────────────┐
│ Build Huffman   │  Analyze symbol frequencies
│     Trees       │
└────────┬────────┘
         ↓
┌─────────────────┐
│ Huffman Encode  │  Encode using optimal codes
└────────┬────────┘
         ↓
Compressed Output

Block Format

Each MSZIP block contains:

Block Structure:
┌──────────────────────────────────────┐
│ Block Header (3 bits)                │
│   - Final block flag (1 bit)         │
│   - Block type (2 bits)              │
├──────────────────────────────────────┤
│ Compressed Length Tables (variable)  │
│   - Literal/length code lengths      │
│   - Distance code lengths            │
├──────────────────────────────────────┤
│ Compressed Data (variable)           │
│   - Literals and match codes         │
│   - Distance codes                   │
└──────────────────────────────────────┘

Decompression Process

def decompress_mszip_block(input)
  # 1. Read block header
  final_block = input.read_bits(1)
  block_type = input.read_bits(2)

  # 2. Build Huffman trees
  literal_tree = build_tree_from_lengths(read_code_lengths(input))
  distance_tree = build_tree_from_lengths(read_distance_lengths(input))

  # 3. Decode compressed data
  output = []
  window = CircularBuffer.new(32768)

  loop do
    code = decode_symbol(input, literal_tree)

    if code < 256
      # Literal byte
      output << code
      window.add(code)
    elsif code == 256
      # End of block
      break
    else
      # Length-distance pair
      length = decode_length(code, input)
      distance = decode_distance(input, distance_tree)

      # Copy from window
      length.times do
        byte = window.get_at(distance)
        output << byte
        window.add(byte)
      end
    end
  end

  output
end

Quantum Compression

Algorithm Overview

Quantum uses arithmetic coding combined with adaptive modeling.

Key Features:

Arithmetic Coding: Represents entire message as single number
Adaptive Model: Updates probabilities dynamically
Context Modeling: Uses previous bytes as context

How Arithmetic Coding Works

Traditional Huffman assigns integer bit codes. Arithmetic coding represents data as a fraction:

Example: Encoding "AAB"

Symbol probabilities:
A: 0.6 (60%)
B: 0.4 (40%)

Start with range [0.0, 1.0]

Encode 'A':
  New range: [0.0, 0.6]  (60% of current range)

Encode 'A':
  New range: [0.0, 0.36]  (60% of [0.0, 0.6])

Encode 'B':
  New range: [0.36, 0.504]  (40% of [0.36, 0.504])

Final output: Any number in [0.36, 0.504], e.g., 0.4
In binary: 0.011001... (approximately)

Implementation

class QuantumDecompressor
  def decompress(input, length)
    model = AdaptiveModel.new
    output = []

    # Initialize arithmetic decoder range
    decoder = ArithmeticDecoder.new(input)

    length.times do
      # Get symbol based on current range
      symbol = decoder.decode_symbol(model)
      output << symbol

      # Update model with new symbol
      model.update(symbol)
    end

    output.pack('C*')
  end
end

class AdaptiveModel
  def initialize
    @counts = Array.new(256, 1)  # Start with count of 1
    @total = 256
  end

  def probability(symbol)
    @counts[symbol].to_f / @total
  end

  def update(symbol)
    @counts[symbol] += 1
    @total += 1

    # Prevent overflow by scaling
    if @total > 16384
      @counts.map! { |c| (c + 1) / 2 }
      @total = @counts.sum
    end
  end
end

LZX Compression

Algorithm Overview

LZX is the most advanced algorithm, designed for maximum compression.

Features:

Large Window: Up to 2 MB (2^21 bytes)
Multiple Match Lengths: Optimized encoding for various lengths
Aligned Offsets: Special handling for aligned positions
Pretree Encoding: Huffman trees for code lengths

Window Structure

LZX Window (2 MB):
┌────────────────────────────────────────┐
│                                        │
│         Previously Decoded Data        │
│              (Sliding Window)          │
│                                        │
├────────────────────────────────────────┤
│         Recent Matches               │ ← Recent position list
│       (8 most recent positions)       │
└────────────────────────────────────────┘

Match Types:
1. Repeat Match: Use recent position
2. New Match: Find in entire window
3. Aligned Match: Match at 8-byte boundary (faster)

Decoding Process

class LZXDecompressor
  def decompress_block(input)
    # Read block type
    block_type = input.read_bits(3)

    # Read block size
    block_size = read_block_size(input, block_type)

    # Build Huffman trees for this block
    main_tree = build_main_tree(input)
    length_tree = build_length_tree(input)
    aligned_tree = build_aligned_tree(input) if aligned_block?

    # Decode matches and literals
    output = []
    window = LZXWindow.new(@window_size)
    recent_offsets = [1, 1, 1]  # R0, R1, R2

    while output.size < block_size
      # Decode main element
      main_element = decode_symbol(input, main_tree)

      if main_element < 256
        # Literal
        output << main_element
        window.add(main_element)
      else
        # Match
        length, offset = decode_match(
          main_element, input,
          length_tree, aligned_tree,
          recent_offsets
        )

        # Copy from window
        match_data = window.copy(offset, length)
        output.concat(match_data)
        window.add_multiple(match_data)

        # Update recent offsets
        recent_offsets.unshift(offset)
        recent_offsets.pop
      end
    end

    output
  end
end

Match Encoding

LZX uses sophisticated match encoding:

Match Length Encoding:
  Length 2:     code 0-248  (direct)
  Length 3-257: code 249 + extra bits

Distance Encoding:
  Uses position slots with extra bits

  Position Slot Table:
  Slot    Distance Range    Extra Bits
  ----------------------------------------
  0       0                 0
  1       1                 0
  2       2-3               1
  3       4-7               2
  4       8-15              3
  ...
  50      2097152-...       21

Huffman Coding

Principle

Huffman coding assigns shorter codes to frequent symbols.

Example:

Text: "MISSISSIPPI"

Symbol frequencies:
  I: 4
  S: 4
  P: 2
  M: 1

Huffman codes:
  I: 0     (1 bit)
  S: 10    (2 bits)
  P: 110   (3 bits)
  M: 111   (3 bits)

Encoded: 111 0 10 10 0 10 10 0 110 110 0
         = 1110101001010011001 (19 bits vs 88 bits uncompressed)

See Huffman Coding for details.

Comparison

Speed vs Compression

Compression Ratio Test (100 MB text file):

Algorithm  Compressed   Ratio   Compress   Decompress
         Size (MB)           Time (s)   Time (s)
----------------------------------------------------------
None     100.0       1.00x   0.0        0.0
LZSS      45.2       2.21x   2.1        0.8
MSZIP     32.1       3.12x   3.5        1.2
Quantum   29.8       3.36x   5.8        2.1
LZX       27.3       3.66x   12.4       1.5
----------------------------------------------------------

Use case recommendations

NONE: Already compressed data (video, images)
LZSS: Fast compression needed, moderate ratio acceptable
MSZIP: General purpose, good balance
Quantum: Text-heavy data, good compression needed
LZX: Maximum compression, time not critical

Bibliography

"The Data Compression Book" by Mark Nelson
RFC 1951 - DEFLATE Specification
Microsoft Cabinet SDK Documentation
"Introduction to Data Compression" by Khalid Sayood