PHP Classes

File: docs/COMPRESSION.md

Recommend this page to a friend!
  Packages of Stefano D'Agostino   ATON Format PHP   docs/COMPRESSION.md   Download  
File: docs/COMPRESSION.md
Role: Auxiliary data
Content type: text/markdown
Description: Auxiliary data
Class: ATON Format PHP
Encode and decode values using the ATON format
Author: By
Last change:
Date: 3 months ago
Size: 6,644 bytes
 

Contents

Class file image Download

ATON Compression Guide

Overview

ATON V2 provides multiple compression strategies to optimize token usage for different use cases.

Compression Modes

FAST

No dictionary compression. Fastest encoding speed.

use Aton\Encoder;
use Aton\Enums\CompressionMode;

$encoder = new Encoder(compression: CompressionMode::FAST);

Best for: - Small datasets (< 1KB) - Real-time encoding requirements - When speed is more important than size

Output example:

@schema[id:int, category:str, status:str]

products(3):
  1, "Electronics", "In Stock"
  2, "Electronics", "In Stock"
  3, "Electronics", "Out of Stock"

BALANCED (Default)

Dictionary compression for strings ?5 characters appearing ?3 times.

$encoder = new Encoder(compression: CompressionMode::BALANCED);

Best for: - General purpose use - Medium datasets (1KB - 100KB) - Good balance of speed and compression

Output example:

@dict[#0:"Electronics", #1:"In Stock"]
@schema[id:int, category:str, status:str]

products(3):
  1, #0, #1
  2, #0, #1
  3, #0, "Out of Stock"

ULTRA

Aggressive dictionary compression for strings ?3 characters appearing ?2 times.

$encoder = new Encoder(compression: CompressionMode::ULTRA);

Best for: - Large datasets (> 100KB) - Bandwidth-constrained scenarios - Maximum token savings

Output example:

@dict[#0:"Electronics", #1:"In Stock", #2:"Out of Stock"]
@schema[id:int, category:str, status:str]

products(3):
  1, #0, #1
  2, #0, #1
  3, #0, #2

ADAPTIVE

Automatically selects mode based on data size: - < 1KB: FAST - 1KB - 10KB: BALANCED - > 10KB: ULTRA

$encoder = new Encoder(compression: CompressionMode::ADAPTIVE);

Best for: - Variable dataset sizes - When you don't know data size in advance - Automated pipelines

Dictionary Compression

How It Works

  1. String Extraction: All strings in the data are collected
  2. Frequency Analysis: Count occurrences of each string
  3. Reference Creation: Strings meeting thresholds get short references (#0, #1, etc.)
  4. Replacement: Original strings replaced with references

Thresholds by Mode

| Mode | Min Length | Min Occurrences | |------|------------|-----------------| | FAST | - | - (no compression) | | BALANCED | 5 chars | 3 times | | ULTRA | 3 chars | 2 times |

Example

Input data:

$data = [
    'logs' => [
        ['level' => 'INFO', 'message' => 'Application started'],
        ['level' => 'INFO', 'message' => 'User logged in'],
        ['level' => 'INFO', 'message' => 'Request processed'],
        ['level' => 'ERROR', 'message' => 'Connection failed'],
    ]
];

BALANCED output:

@dict[#0:"INFO"]
@schema[level:str, message:str]

logs(4):
  #0, "Application started"
  #0, "User logged in"
  #0, "Request processed"
  "ERROR", "Connection failed"

"INFO" appears 3 times and has 4 characters, so it's compressed. "ERROR" appears only once, not compressed.

Default Values Optimization

When optimize: true (default), the encoder detects common values and sets them as defaults.

How It Works

  1. Sample Analysis: First 100 records are analyzed
  2. Frequency Detection: Values appearing in >60% of records become defaults
  3. Default Omission: Records with default values skip those fields

Example

$data = [
    'users' => [
        ['id' => 1, 'name' => 'Alice', 'status' => 'active'],
        ['id' => 2, 'name' => 'Bob', 'status' => 'active'],
        ['id' => 3, 'name' => 'Carol', 'status' => 'active'],
        ['id' => 4, 'name' => 'Dave', 'status' => 'inactive'],
    ]
];

Output:

@schema[id:int, name:str, status:str]
@defaults[status:"active"]

users(4):
  1, "Alice"
  2, "Bob"
  3, "Carol"
  4, "Dave", "inactive"

Only Dave's status is encoded since others match the default.

Compression Statistics

Get detailed compression metrics:

$encoder = new Encoder(compression: CompressionMode::BALANCED);
$stats = $encoder->getCompressionStats($data);

echo "Original tokens: {$stats['originalTokens']}\n";
echo "Compressed tokens: {$stats['compressedTokens']}\n";
echo "Compression ratio: {$stats['compressionRatio']}\n";
echo "Savings: {$stats['savingsPercent']}%\n";
echo "Dictionary size: {$stats['dictionarySize']} entries\n";
echo "Encoding time: {$stats['encodingTimeMs']}ms\n";

Metrics Explained

| Metric | Description | |--------|-------------| | originalTokens | Estimated tokens without compression | | compressedTokens | Estimated tokens with compression | | compressionRatio | compressedTokens / originalTokens | | savingsPercent | Percentage reduction | | dictionarySize | Number of dictionary entries | | encodingTimeMs | Time to encode in milliseconds | | modeUsed | Compression mode used |

Token Estimation

The encoder estimates tokens using a heuristic:

tokens ? (characters / 4) + (punctuation / 2) + (words / 3)

This approximates typical LLM tokenization. Actual tokens vary by model.

Performance Comparison

| Dataset | JSON | ATON FAST | ATON BALANCED | ATON ULTRA | |---------|------|-----------|---------------|------------| | 100 records | 2,450 | 1,890 | 1,540 | 1,420 | | 1,000 records | 24,500 | 18,900 | 12,100 | 10,800 | | 10,000 records | 245,000 | 189,000 | 98,000 | 85,000 |

Best Practices

1. Choose the Right Mode

// Real-time chat responses
$encoder = new Encoder(compression: CompressionMode::FAST);

// API data exchange
$encoder = new Encoder(compression: CompressionMode::BALANCED);

// Large report generation
$encoder = new Encoder(compression: CompressionMode::ULTRA);

// Unknown/variable sizes
$encoder = new Encoder(compression: CompressionMode::ADAPTIVE);

2. Enable Optimization

// Always enable for best compression
$encoder = new Encoder(optimize: true);

3. Batch Similar Data

Compression works better with homogeneous data:

// Good: Same structure, similar values
$data = ['users' => $allUsers];

// Less optimal: Mixed structures
$data = ['users' => $users, 'settings' => $settings, 'logs' => $logs];

4. Monitor Compression Ratio

$stats = $encoder->getCompressionStats($data);

if ($stats['savingsPercent'] < 20) {
    // Data might not benefit from compression
    // Consider using FAST mode
}

Disable Compression

For debugging or compatibility:

// No compression at all
$encoder = new Encoder(
    optimize: false,
    compression: CompressionMode::FAST
);

// Or use encode with compress=false
$aton = $encoder->encode($data, compress: false);