Streaming Large Files

Streaming large files

Purpose

This document explains how to use Cabriolet’s streaming API to process large archives (multi-GB) with minimal memory consumption.

Streaming is designed for:

Processing archives larger than available RAM
Memory-constrained environments
Continuous processing workflows
Reducing peak memory usage

After reading this guide, you will understand how to use streaming for memory-efficient archive processing.

Use this guide when working with large archives or optimizing memory usage.

References

Concepts

Streaming vs. Standard Processing

Standard Processing: Loads entire archive structure into memory, then processes files.

Streaming Processing: Loads files one at a time, releases memory after each file.

Lazy Evaluation

Files are loaded only when accessed (file.data), not when iterating the file list.

Chunked Data Transfer

Large file contents are streamed in configurable chunks (default 64KB) rather than loaded entirely.

Basic Streaming

Stream Files One at a Time

require 'cabriolet/streaming'

# Create streaming parser
parser = Cabriolet::Streaming::StreamParser.new('large_archive.cab')

# Process files sequentially with minimal memory
parser.each_file do |file|
  puts "Processing: #{file.name} (#{file.size} bytes)"

  # File data is loaded only when accessed
  data = file.data

  # Process data
  process_data(data)

  # Memory released after block completes
end

puts "Completed with minimal memory usage"

Memory usage: ~20-50MB regardless of archive size.

Stream with Custom Chunk Size

# Use smaller chunks for very constrained environments
parser = Cabriolet::Streaming::StreamParser.new(
  'huge.cab',
  chunk_size: 32768  # 32KB chunks instead of default 64KB
)

parser.each_file do |file|
  File.open("output/#{file.name}", 'wb') do |out|
    # Stream file data in chunks
    parser.stream_file_data(file) do |chunk|
      out.write(chunk)
      # Only 32KB in memory at a time
    end
  end
end

Advanced Streaming

Streaming Extraction

Extract entire archive using streaming:

parser = Cabriolet::Streaming::StreamParser.new('archive.cab')

# Stream-based extraction
stats = parser.extract_streaming(
  'output/',
  preserve_paths: true,
  overwrite: false
)

puts "Extracted: #{stats[:extracted]}"
puts "Bytes processed: #{stats[:bytes]}"
puts "Failed: #{stats[:failed]}"

Batch Processing with Streaming

Process multiple archives efficiently:

processor = Cabriolet::Streaming::BatchProcessor.new(chunk_size: 65536)

# Process all archives
stats = processor.process_archives(Dir.glob('*.cab')) do |file, archive_path|
  # Process each file from each archive
  output_dir = "output/#{File.basename(archive_path, '.*')}"
  output_path = File.join(output_dir, file.name)

  FileUtils.mkdir_p(File.dirname(output_path))
  File.write(output_path, file.data, mode: 'wb')
end

puts "Processed: #{stats[:processed]} files"
puts "Total bytes: #{stats[:bytes]}"
puts "Failed: #{stats[:failed]}"

Custom Processing Pipeline

Build streaming pipelines:

parser = Cabriolet::Streaming::StreamParser.new('archive.cab')

# Stream to different destinations based on file type
parser.each_file do |file|
  case File.extname(file.name)
  when '.txt', '.log'
    # Text files: stream to database
    stream_to_database(file)
  when '.jpg', '.png'
    # Images: stream to S3
    stream_to_s3(file)
  when '.xml', '.json'
    # Data files: parse and process
    parser.stream_file_data(file) do |chunk|
      parse_and_process(chunk)
    end
  end
end

Performance Comparison

Method	Archive Size	Memory Usage	Time
Standard	1 GB	~1.2 GB	12.3s
Streaming	1 GB	~45 MB	13.1s
Standard	5 GB	~5.3 GB	61.2s
Streaming	5 GB	~48 MB	64.8s

Method

Archive Size

Memory Usage

Time

Standard

1 GB

~1.2 GB

12.3s

Streaming

1 GB

~45 MB

13.1s

Standard

5 GB

~5.3 GB

61.2s

Streaming

5 GB

~48 MB

64.8s

Key Insight: Streaming uses 95%+ less memory with only 5-6% time overhead.

Memory Usage Patterns

Standard Processing

Memory
  ↑
5GB |           ╱‾‾‾‾‾‾‾‾‾╲
    |          ╱           ╲
    |         ╱             ╲
    |        ╱               ╲___
50MB|_______╱
    |________________________________→ Time
        Parse   Process   Complete

Streaming Processing

Memory
  ↑
50MB|‾‾‾╱╲‾╱╲‾╱╲‾╱╲‾╱╲‾╱╲‾╱╲‾‾
    |   ╱  ╲ ╱  ╲ ╱  ╲ ╱  ╲ ╱  ╲
    |  ╱    ╲    ╲    ╲    ╲    ╲
    | ╱      ╲    ╲    ╲    ╲    ╲
20MB|╱________╲____╲____╲____╲____╲_
    |________________________________→ Time
       File1 File2 File3 ...  FileN

Best practices

Use streaming for archives >500MB
Adjust chunk size based on available memory
Process files as they stream rather than collecting them
Release references to processed data explicitly
Consider parallel streaming for even better performance

Limitations

Slightly slower than standard processing (5-6% overhead)
Cannot random-access files (sequential only)
Some metadata requires full parsing

Bibliography

Ruby I/O and Memory Management Best Practices
Streaming Data Processing Patterns
Memory-Efficient Archive Processing Techniques