DownloadATON Format Specification V2
Overview
ATON V2 builds on V1 with advanced features for better compression, querying, and large dataset handling.
New Features in V2
-
Dictionary Compression: Automatic deduplication of repeated strings
-
Default Values: Skip encoding when values match defaults
-
Query Language: SQL-like filtering and sorting
-
Streaming Encoder: Process large datasets in chunks
-
Compression Modes: FAST, BALANCED, ULTRA, ADAPTIVE
Format Structure
Complete Syntax
@dict[#0:"repeated value", #1:"another repeated"]
@schema[field1:type1, field2:type2, ...]
@defaults[field1:defaultValue, field2:defaultValue]
@queryable[tableName]
tableName(recordCount):
value1, value2, ...
value1, value2, ...
Dictionary Compression
Purpose
Reduces token usage by replacing repeated strings with short references.
Syntax
@dict[#0:"Long repeated string", #1:"Another common value"]
Usage in Data
@dict[#0:"Electronics", #1:"In Stock"]
@schema[id:int, name:str, category:str, status:str]
products(3):
1, "Laptop", #0, #1
2, "Mouse", #0, #1
3, "Keyboard", #0, "Out of Stock"
Compression Thresholds
| Mode | Min Length | Min Occurrences |
|------|------------|-----------------|
| FAST | No compression | - |
| BALANCED | 5 chars | 3 times |
| ULTRA | 3 chars | 2 times |
| ADAPTIVE | Auto-selected based on data size |
Default Values
Purpose
Skip encoding values that match the most common value for a field.
Syntax
@defaults[status:"active", verified:true]
Example
@schema[id:int, name:str, status:str, verified:bool]
@defaults[status:"active", verified:true]
users(4):
1, "Alice"
2, "Bob"
3, "Carol", "inactive"
4, "Dave", "active", false
Users 1 and 2 have default status and verified values (not encoded).
User 3 has non-default status.
User 4 has non-default verified.
Query Language
Syntax
tableName [SELECT fields] [WHERE conditions] [ORDER BY field [ASC|DESC]] [LIMIT n] [OFFSET n]
Operators
| Operator | Description | Example |
|----------|-------------|---------|
| = | Equals | status = 'active' |
| !=, <> | Not equals | status != 'deleted' |
| < | Less than | age < 30 |
| > | Greater than | price > 100 |
| <= | Less or equal | count <= 10 |
| >= | Greater or equal | score >= 80 |
| LIKE | Pattern match | name LIKE '%john%' |
| IN | In set | category IN ('A', 'B') |
| NOT IN | Not in set | status NOT IN ('deleted') |
| BETWEEN | Range | price BETWEEN 10 AND 100 |
Logical Operators
-
`AND`: Both conditions must be true
-
`OR`: Either condition must be true
-
`NOT`: Negates condition
-
Parentheses for grouping: `(a OR b) AND c`
Examples
-- Simple filter
users WHERE active = true
-- Multiple conditions
products WHERE price > 100 AND category = 'Electronics'
-- Pattern matching
users WHERE email LIKE '%@gmail.com'
-- Sorting and pagination
orders WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10
-- Field selection
users SELECT id, name, email WHERE verified = true
Streaming Format
Chunk Structure
First chunk includes full schema: @schema[id:int, name:str]
records(100):
1, "First"
2, "Second"
...
Subsequent chunks use continuation syntax: records+(100):
101, "Next"
102, "Another"
...
Metadata
Each chunk includes:
- chunkId: Current chunk number (0-indexed)
- totalChunks: Total number of chunks
- isFirst: Boolean, true for first chunk
- isLast: Boolean, true for last chunk
- metadata.table: Table name
- metadata.recordsInChunk: Records in this chunk
- metadata.startIdx: Starting record index
- metadata.endIdx: Ending record index
- metadata.totalRecords: Total records across all chunks
- metadata.progress: Completion percentage (0.0 to 1.0)
Compression Modes
FAST
-
No dictionary compression
-
Fastest encoding
-
Best for: Small datasets, real-time encoding
BALANCED (Default)
-
Dictionary compression for strings ?5 chars appearing ?3 times
-
Good balance of speed and compression
-
Best for: General purpose use
ULTRA
-
Aggressive dictionary compression (?3 chars, ?2 times)
-
Maximum compression
-
Best for: Large datasets, bandwidth-constrained scenarios
ADAPTIVE
-
Automatically selects mode based on data size:
- < 1KB: FAST
- 1KB - 10KB: BALANCED
- > 10KB: ULTRA
PHP Implementation
Encoder
use Aton\Encoder;
use Aton\Enums\CompressionMode;
$encoder = new Encoder(
optimize: true,
compression: CompressionMode::BALANCED,
queryable: true,
validate: true
);
// Basic encoding
$aton = $encoder->encode($data);
// With query filter
$aton = $encoder->encodeWithQuery($data, "users WHERE active = true");
// Get stats
$stats = $encoder->getCompressionStats($data);
Decoder
use Aton\Decoder;
$decoder = new Decoder(validate: true);
$data = $decoder->decode($atonString);
Query Engine
use Aton\QueryEngine;
$engine = new QueryEngine();
$query = $engine->parse("products WHERE price > 100 ORDER BY price DESC");
$results = $engine->execute($data, $query);
Stream Encoder
use Aton\StreamEncoder;
use Aton\Enums\CompressionMode;
$encoder = new StreamEncoder(
chunkSize: 100,
compression: CompressionMode::BALANCED
);
foreach ($encoder->streamEncode($largeData) as $chunk) {
processChunk($chunk['data']);
}
Migration from V1
V2 is fully backward compatible with V1. To use V1-style encoding:
$encoder = new Encoder(
optimize: false, // Disable defaults optimization
compression: CompressionMode::FAST // No dictionary compression
);
V2 decoder can read both V1 and V2 format without any changes.
|