DownloadATON Compression Guide
Overview
ATON V2 provides multiple compression strategies to optimize token usage for different use cases.
Compression Modes
FAST
No dictionary compression. Fastest encoding speed.
use Aton\Encoder;
use Aton\Enums\CompressionMode;
$encoder = new Encoder(compression: CompressionMode::FAST);
Best for:
- Small datasets (< 1KB)
- Real-time encoding requirements
- When speed is more important than size
Output example: @schema[id:int, category:str, status:str]
products(3):
1, "Electronics", "In Stock"
2, "Electronics", "In Stock"
3, "Electronics", "Out of Stock"
BALANCED (Default)
Dictionary compression for strings ?5 characters appearing ?3 times.
$encoder = new Encoder(compression: CompressionMode::BALANCED);
Best for:
- General purpose use
- Medium datasets (1KB - 100KB)
- Good balance of speed and compression
Output example: @dict[#0:"Electronics", #1:"In Stock"]
@schema[id:int, category:str, status:str]
products(3):
1, #0, #1
2, #0, #1
3, #0, "Out of Stock"
ULTRA
Aggressive dictionary compression for strings ?3 characters appearing ?2 times.
$encoder = new Encoder(compression: CompressionMode::ULTRA);
Best for:
- Large datasets (> 100KB)
- Bandwidth-constrained scenarios
- Maximum token savings
Output example: @dict[#0:"Electronics", #1:"In Stock", #2:"Out of Stock"]
@schema[id:int, category:str, status:str]
products(3):
1, #0, #1
2, #0, #1
3, #0, #2
ADAPTIVE
Automatically selects mode based on data size:
- < 1KB: FAST
- 1KB - 10KB: BALANCED
- > 10KB: ULTRA
$encoder = new Encoder(compression: CompressionMode::ADAPTIVE);
Best for:
- Variable dataset sizes
- When you don't know data size in advance
- Automated pipelines
Dictionary Compression
How It Works
-
String Extraction: All strings in the data are collected
-
Frequency Analysis: Count occurrences of each string
-
Reference Creation: Strings meeting thresholds get short references (#0, #1, etc.)
-
Replacement: Original strings replaced with references
Thresholds by Mode
| Mode | Min Length | Min Occurrences |
|------|------------|-----------------|
| FAST | - | - (no compression) |
| BALANCED | 5 chars | 3 times |
| ULTRA | 3 chars | 2 times |
Example
Input data: $data = [
'logs' => [
['level' => 'INFO', 'message' => 'Application started'],
['level' => 'INFO', 'message' => 'User logged in'],
['level' => 'INFO', 'message' => 'Request processed'],
['level' => 'ERROR', 'message' => 'Connection failed'],
]
];
BALANCED output: @dict[#0:"INFO"]
@schema[level:str, message:str]
logs(4):
#0, "Application started"
#0, "User logged in"
#0, "Request processed"
"ERROR", "Connection failed"
"INFO" appears 3 times and has 4 characters, so it's compressed.
"ERROR" appears only once, not compressed.
Default Values Optimization
When optimize: true (default), the encoder detects common values and sets them as defaults.
How It Works
-
Sample Analysis: First 100 records are analyzed
-
Frequency Detection: Values appearing in >60% of records become defaults
-
Default Omission: Records with default values skip those fields
Example
$data = [
'users' => [
['id' => 1, 'name' => 'Alice', 'status' => 'active'],
['id' => 2, 'name' => 'Bob', 'status' => 'active'],
['id' => 3, 'name' => 'Carol', 'status' => 'active'],
['id' => 4, 'name' => 'Dave', 'status' => 'inactive'],
]
];
Output: @schema[id:int, name:str, status:str]
@defaults[status:"active"]
users(4):
1, "Alice"
2, "Bob"
3, "Carol"
4, "Dave", "inactive"
Only Dave's status is encoded since others match the default.
Compression Statistics
Get detailed compression metrics:
$encoder = new Encoder(compression: CompressionMode::BALANCED);
$stats = $encoder->getCompressionStats($data);
echo "Original tokens: {$stats['originalTokens']}\n";
echo "Compressed tokens: {$stats['compressedTokens']}\n";
echo "Compression ratio: {$stats['compressionRatio']}\n";
echo "Savings: {$stats['savingsPercent']}%\n";
echo "Dictionary size: {$stats['dictionarySize']} entries\n";
echo "Encoding time: {$stats['encodingTimeMs']}ms\n";
Metrics Explained
| Metric | Description |
|--------|-------------|
| originalTokens | Estimated tokens without compression |
| compressedTokens | Estimated tokens with compression |
| compressionRatio | compressedTokens / originalTokens |
| savingsPercent | Percentage reduction |
| dictionarySize | Number of dictionary entries |
| encodingTimeMs | Time to encode in milliseconds |
| modeUsed | Compression mode used |
Token Estimation
The encoder estimates tokens using a heuristic:
tokens ? (characters / 4) + (punctuation / 2) + (words / 3)
This approximates typical LLM tokenization. Actual tokens vary by model.
Performance Comparison
| Dataset | JSON | ATON FAST | ATON BALANCED | ATON ULTRA |
|---------|------|-----------|---------------|------------|
| 100 records | 2,450 | 1,890 | 1,540 | 1,420 |
| 1,000 records | 24,500 | 18,900 | 12,100 | 10,800 |
| 10,000 records | 245,000 | 189,000 | 98,000 | 85,000 |
Best Practices
1. Choose the Right Mode
// Real-time chat responses
$encoder = new Encoder(compression: CompressionMode::FAST);
// API data exchange
$encoder = new Encoder(compression: CompressionMode::BALANCED);
// Large report generation
$encoder = new Encoder(compression: CompressionMode::ULTRA);
// Unknown/variable sizes
$encoder = new Encoder(compression: CompressionMode::ADAPTIVE);
2. Enable Optimization
// Always enable for best compression
$encoder = new Encoder(optimize: true);
3. Batch Similar Data
Compression works better with homogeneous data:
// Good: Same structure, similar values
$data = ['users' => $allUsers];
// Less optimal: Mixed structures
$data = ['users' => $users, 'settings' => $settings, 'logs' => $logs];
4. Monitor Compression Ratio
$stats = $encoder->getCompressionStats($data);
if ($stats['savingsPercent'] < 20) {
// Data might not benefit from compression
// Consider using FAST mode
}
Disable Compression
For debugging or compatibility:
// No compression at all
$encoder = new Encoder(
optimize: false,
compression: CompressionMode::FAST
);
// Or use encode with compress=false
$aton = $encoder->encode($data, compress: false);
|