Format Auto-Detection
Purpose
This document explains Cabriolet’s automatic format detection capability, which allows you to work with archives without knowing their specific format in advance.
Format auto-detection is designed for users who:
-
Work with archives of unknown or mixed formats
-
Process files from diverse sources
-
Need simplified APIs without format-specific code
-
Build tools that handle multiple archive types
After reading this guide, you will understand how format detection works and when to use it versus format-specific APIs.
Use this guide when building applications that need to handle various archive formats dynamically.
Concepts
Magic Bytes
Magic bytes are signature sequences at the beginning of files that identify their format. Each archive format has a unique magic byte signature:
-
CAB:
MSCF(0x4D534346) -
CHM:
ITSF(0x49545346) -
HLP:
?_(0x3F5F) orLN(0x4C4E) -
KWAJ:
KWAJ(0x4B57414A) -
SZDD:
SZDD(0x535A4444) -
LIT:
ITOLITLS -
OAB: Format-specific header
Basic Auto-Detection
Auto-Open an Archive
The simplest way to use auto-detection:
require 'cabriolet'
# Open archive with automatic format detection
archive = Cabriolet.open('unknown.archive')
# Use archive normally
puts "Format: #{archive.class}"
puts "Files: #{archive.files.count}"
archive.files.each do |file|
puts " #{file.name}: #{file.size} bytes"
endThis works with any supported format. The returned object is the appropriate archive type (Cabinet, CHMFile, etc.).
Auto-Extract
Extract without knowing the format:
require 'cabriolet'
# Extract with auto-detection
stats = Cabriolet.extract('archive.file', 'output/')
puts "Extracted #{stats[:extracted]} files"
puts "Failed: #{stats[:failed]}"
puts "Total bytes: #{stats[:bytes]}"Expected output:
Extracted 145 files
Failed: 0
Total bytes: 52428800Advanced Auto-Detection
With Options
Pass options to the underlying parser:
# Auto-detect with salvage mode enabled
archive = Cabriolet.open('possibly_corrupted.cab',
salvage_mode: true,
skip_checksum: true
)Get Detailed Information
Retrieve comprehensive archive information:
info = Cabriolet.info('archive.cab')
# Full information hash
puts info[:format] # => :cab
puts info[:file_count] # => 145
puts info[:total_size] # => 52428800
puts info[:compressed_size] # => 23592960
puts info[:compression_ratio] # => 45.0
# Individual file information
info[:files].each do |file_info|
puts "#{file_info[:name]}: #{file_info[:size]} bytes"
puts " Attributes: 0x#{file_info[:attributes].to_s(16)}"
puts " Date: #{file_info[:date]}"
endParallel Extraction with Auto-Detection
Combine auto-detection with parallel processing:
# Auto-detect and extract with 8 workers
stats = Cabriolet.extract(
'large_archive.cab',
'output/',
parallel: true,
workers: 8
)
puts "Extracted #{stats[:extracted]} files using parallel processing"This automatically: 1. Detects the archive format 2. Selects the appropriate parser 3. Extracts using parallel workers 4. Returns statistics
Error Handling
When to Use Auto-Detection
Performance considerations
Detection Overhead
Format detection adds minimal overhead:
-
Magic byte check: <1ms
-
File open/close: ~1-5ms
-
Total overhead: Typically <10ms
For batch operations on thousands of files, this is negligible compared to actual parsing time.
Optimization Tips
-
Cache format detection::
# Detect once, use multiple times format = Cabriolet.detect_format('archive.cab') if format == :cab # Use format-specific API for optimal performance cab = Cabriolet::CAB::Parser.new.parse('archive.cab') end -
Batch detection::
# Detect all formats first files_by_format = Dir.glob('*.{cab,chm,hlp}').group_by do |file| Cabriolet.detect_format(file) end # Process by format files_by_format.each do |format, files| parser_class = Cabriolet::FormatDetector.format_to_parser(format) files.each { |f| parser_class.new.parse(f) } end
Bibliography
-
ISO/IEC 9899 - C Standard (for binary data type definitions)
-
Microsoft Cabinet Format Specification
-
File Signature Database - https://www.filesignatures.net/