Streaming large files
Purpose
This document explains how to use Cabriolet’s streaming API to process large archives (multi-GB) with minimal memory consumption.
Streaming is designed for:
-
Processing archives larger than available RAM
-
Memory-constrained environments
-
Continuous processing workflows
-
Reducing peak memory usage
After reading this guide, you will understand how to use streaming for memory-efficient archive processing.
Use this guide when working with large archives or optimizing memory usage.
Concepts
Basic Streaming
Stream Files One at a Time
require 'cabriolet/streaming'
# Create streaming parser
parser = Cabriolet::Streaming::StreamParser.new('large_archive.cab')
# Process files sequentially with minimal memory
parser.each_file do |file|
puts "Processing: #{file.name} (#{file.size} bytes)"
# File data is loaded only when accessed
data = file.data
# Process data
process_data(data)
# Memory released after block completes
end
puts "Completed with minimal memory usage"Memory usage: ~20-50MB regardless of archive size.
Stream with Custom Chunk Size
# Use smaller chunks for very constrained environments
parser = Cabriolet::Streaming::StreamParser.new(
'huge.cab',
chunk_size: 32768 # 32KB chunks instead of default 64KB
)
parser.each_file do |file|
File.open("output/#{file.name}", 'wb') do |out|
# Stream file data in chunks
parser.stream_file_data(file) do |chunk|
out.write(chunk)
# Only 32KB in memory at a time
end
end
endAdvanced Streaming
Streaming Extraction
Extract entire archive using streaming:
parser = Cabriolet::Streaming::StreamParser.new('archive.cab')
# Stream-based extraction
stats = parser.extract_streaming(
'output/',
preserve_paths: true,
overwrite: false
)
puts "Extracted: #{stats[:extracted]}"
puts "Bytes processed: #{stats[:bytes]}"
puts "Failed: #{stats[:failed]}"Batch Processing with Streaming
Process multiple archives efficiently:
processor = Cabriolet::Streaming::BatchProcessor.new(chunk_size: 65536)
# Process all archives
stats = processor.process_archives(Dir.glob('*.cab')) do |file, archive_path|
# Process each file from each archive
output_dir = "output/#{File.basename(archive_path, '.*')}"
output_path = File.join(output_dir, file.name)
FileUtils.mkdir_p(File.dirname(output_path))
File.write(output_path, file.data, mode: 'wb')
end
puts "Processed: #{stats[:processed]} files"
puts "Total bytes: #{stats[:bytes]}"
puts "Failed: #{stats[:failed]}"Custom Processing Pipeline
Build streaming pipelines:
parser = Cabriolet::Streaming::StreamParser.new('archive.cab')
# Stream to different destinations based on file type
parser.each_file do |file|
case File.extname(file.name)
when '.txt', '.log'
# Text files: stream to database
stream_to_database(file)
when '.jpg', '.png'
# Images: stream to S3
stream_to_s3(file)
when '.xml', '.json'
# Data files: parse and process
parser.stream_file_data(file) do |chunk|
parse_and_process(chunk)
end
end
endPerformance Comparison
| Method | Archive Size | Memory Usage | Time |
|---|---|---|---|
Standard | 1 GB | ~1.2 GB | 12.3s |
Streaming | 1 GB | ~45 MB | 13.1s |
Standard | 5 GB | ~5.3 GB | 61.2s |
Streaming | 5 GB | ~48 MB | 64.8s |
Key Insight: Streaming uses 95%+ less memory with only 5-6% time overhead.
Memory Usage Patterns
Best practices
-
Use streaming for archives >500MB
-
Adjust chunk size based on available memory
-
Process files as they stream rather than collecting them
-
Release references to processed data explicitly
-
Consider parallel streaming for even better performance