File: docs/COMPRESSION.md

Recommend this page to a friend!

docs/COMPRESSION.md

File:	`docs/COMPRESSION.md`
Role:	Auxiliary data
Content type:	`text/markdown`
Description:	Auxiliary data
Class:	ATON Format PHP Encode and decode values using the ATON format
Author:	By Stefano D'Agostino
Last change:
Date:	3 months ago
Size:	`6,644 bytes`

Download

ATON Compression Guide

Overview

ATON V2 provides multiple compression strategies to optimize token usage for different use cases.

Compression Modes

FAST

No dictionary compression. Fastest encoding speed.

use Aton\Encoder;
use Aton\Enums\CompressionMode;

$encoder = new Encoder(compression: CompressionMode::FAST);

Best for: - Small datasets (< 1KB) - Real-time encoding requirements - When speed is more important than size

Output example:

@schema[id:int, category:str, status:str]

products(3):
  1, "Electronics", "In Stock"
  2, "Electronics", "In Stock"
  3, "Electronics", "Out of Stock"

BALANCED (Default)

Dictionary compression for strings ?5 characters appearing ?3 times.

$encoder = new Encoder(compression: CompressionMode::BALANCED);

Best for: - General purpose use - Medium datasets (1KB - 100KB) - Good balance of speed and compression

Output example:

@dict[#0:"Electronics", #1:"In Stock"]
@schema[id:int, category:str, status:str]

products(3):
  1, #0, #1
  2, #0, #1
  3, #0, "Out of Stock"

ULTRA

Aggressive dictionary compression for strings ?3 characters appearing ?2 times.

$encoder = new Encoder(compression: CompressionMode::ULTRA);

Best for: - Large datasets (> 100KB) - Bandwidth-constrained scenarios - Maximum token savings

Output example:

@dict[#0:"Electronics", #1:"In Stock", #2:"Out of Stock"]
@schema[id:int, category:str, status:str]

products(3):
  1, #0, #1
  2, #0, #1
  3, #0, #2

ADAPTIVE

Automatically selects mode based on data size: - < 1KB: FAST - 1KB - 10KB: BALANCED - > 10KB: ULTRA

$encoder = new Encoder(compression: CompressionMode::ADAPTIVE);

Best for: - Variable dataset sizes - When you don't know data size in advance - Automated pipelines

Dictionary Compression

How It Works

String Extraction: All strings in the data are collected
Frequency Analysis: Count occurrences of each string
Reference Creation: Strings meeting thresholds get short references (#0, #1, etc.)
Replacement: Original strings replaced with references

Thresholds by Mode

| Mode | Min Length | Min Occurrences | |------|------------|-----------------| | FAST | - | - (no compression) | | BALANCED | 5 chars | 3 times | | ULTRA | 3 chars | 2 times |

Example

Input data:

$data = [
    'logs' => [
        ['level' => 'INFO', 'message' => 'Application started'],
        ['level' => 'INFO', 'message' => 'User logged in'],
        ['level' => 'INFO', 'message' => 'Request processed'],
        ['level' => 'ERROR', 'message' => 'Connection failed'],
    ]
];

BALANCED output:

@dict[#0:"INFO"]
@schema[level:str, message:str]

logs(4):
  #0, "Application started"
  #0, "User logged in"
  #0, "Request processed"
  "ERROR", "Connection failed"

"INFO" appears 3 times and has 4 characters, so it's compressed. "ERROR" appears only once, not compressed.

Default Values Optimization

When optimize: true (default), the encoder detects common values and sets them as defaults.

How It Works

Sample Analysis: First 100 records are analyzed
Frequency Detection: Values appearing in >60% of records become defaults
Default Omission: Records with default values skip those fields

Example

$data = [
    'users' => [
        ['id' => 1, 'name' => 'Alice', 'status' => 'active'],
        ['id' => 2, 'name' => 'Bob', 'status' => 'active'],
        ['id' => 3, 'name' => 'Carol', 'status' => 'active'],
        ['id' => 4, 'name' => 'Dave', 'status' => 'inactive'],
    ]
];

Output:

@schema[id:int, name:str, status:str]
@defaults[status:"active"]

users(4):
  1, "Alice"
  2, "Bob"
  3, "Carol"
  4, "Dave", "inactive"

Only Dave's status is encoded since others match the default.

Compression Statistics

Get detailed compression metrics:

$encoder = new Encoder(compression: CompressionMode::BALANCED);
$stats = $encoder->getCompressionStats($data);

echo "Original tokens: {$stats['originalTokens']}\n";
echo "Compressed tokens: {$stats['compressedTokens']}\n";
echo "Compression ratio: {$stats['compressionRatio']}\n";
echo "Savings: {$stats['savingsPercent']}%\n";
echo "Dictionary size: {$stats['dictionarySize']} entries\n";
echo "Encoding time: {$stats['encodingTimeMs']}ms\n";

Metrics Explained

| Metric | Description | |--------|-------------| | originalTokens | Estimated tokens without compression | | compressedTokens | Estimated tokens with compression | | compressionRatio | compressedTokens / originalTokens | | savingsPercent | Percentage reduction | | dictionarySize | Number of dictionary entries | | encodingTimeMs | Time to encode in milliseconds | | modeUsed | Compression mode used |

Token Estimation

The encoder estimates tokens using a heuristic:

tokens ? (characters / 4) + (punctuation / 2) + (words / 3)

This approximates typical LLM tokenization. Actual tokens vary by model.

Performance Comparison

| Dataset | JSON | ATON FAST | ATON BALANCED | ATON ULTRA | |---------|------|-----------|---------------|------------| | 100 records | 2,450 | 1,890 | 1,540 | 1,420 | | 1,000 records | 24,500 | 18,900 | 12,100 | 10,800 | | 10,000 records | 245,000 | 189,000 | 98,000 | 85,000 |

Best Practices

1. Choose the Right Mode

// Real-time chat responses
$encoder = new Encoder(compression: CompressionMode::FAST);

// API data exchange
$encoder = new Encoder(compression: CompressionMode::BALANCED);

// Large report generation
$encoder = new Encoder(compression: CompressionMode::ULTRA);

// Unknown/variable sizes
$encoder = new Encoder(compression: CompressionMode::ADAPTIVE);

2. Enable Optimization

// Always enable for best compression
$encoder = new Encoder(optimize: true);

3. Batch Similar Data

Compression works better with homogeneous data:

// Good: Same structure, similar values
$data = ['users' => $allUsers];

// Less optimal: Mixed structures
$data = ['users' => $users, 'settings' => $settings, 'logs' => $logs];

4. Monitor Compression Ratio

$stats = $encoder->getCompressionStats($data);

if ($stats['savingsPercent'] < 20) {
    // Data might not benefit from compression
    // Consider using FAST mode
}

Disable Compression

For debugging or compatibility:

// No compression at all
$encoder = new Encoder(
    optimize: false,
    compression: CompressionMode::FAST
);

// Or use encode with compress=false
$aton = $encoder->encode($data, compress: false);