File: docs/SPECIFICATION_V2.md

Recommend this page to a friend!

docs/SPECIFICATION_V2.md

File:	`docs/SPECIFICATION_V2.md`
Role:	Auxiliary data
Content type:	`text/markdown`
Description:	Auxiliary data
Class:	ATON Format PHP Encode and decode values using the ATON format
Author:	By Stefano D'Agostino
Last change:
Date:	3 months ago
Size:	`5,840 bytes`

Download

ATON Format Specification V2

Overview

ATON V2 builds on V1 with advanced features for better compression, querying, and large dataset handling.

New Features in V2

Dictionary Compression: Automatic deduplication of repeated strings
Default Values: Skip encoding when values match defaults
Query Language: SQL-like filtering and sorting
Streaming Encoder: Process large datasets in chunks
Compression Modes: FAST, BALANCED, ULTRA, ADAPTIVE

Format Structure

Complete Syntax

@dict[#0:"repeated value", #1:"another repeated"]
@schema[field1:type1, field2:type2, ...]
@defaults[field1:defaultValue, field2:defaultValue]
@queryable[tableName]

tableName(recordCount):
  value1, value2, ...
  value1, value2, ...

Dictionary Compression

Purpose

Reduces token usage by replacing repeated strings with short references.

Syntax

@dict[#0:"Long repeated string", #1:"Another common value"]

Usage in Data

@dict[#0:"Electronics", #1:"In Stock"]
@schema[id:int, name:str, category:str, status:str]

products(3):
  1, "Laptop", #0, #1
  2, "Mouse", #0, #1
  3, "Keyboard", #0, "Out of Stock"

Compression Thresholds

| Mode | Min Length | Min Occurrences | |------|------------|-----------------| | FAST | No compression | - | | BALANCED | 5 chars | 3 times | | ULTRA | 3 chars | 2 times | | ADAPTIVE | Auto-selected based on data size |

Default Values

Purpose

Skip encoding values that match the most common value for a field.

Syntax

@defaults[status:"active", verified:true]

Example

@schema[id:int, name:str, status:str, verified:bool]
@defaults[status:"active", verified:true]

users(4):
  1, "Alice"
  2, "Bob"
  3, "Carol", "inactive"
  4, "Dave", "active", false

Users 1 and 2 have default status and verified values (not encoded). User 3 has non-default status. User 4 has non-default verified.

Query Language

Syntax

tableName [SELECT fields] [WHERE conditions] [ORDER BY field [ASC|DESC]] [LIMIT n] [OFFSET n]

Operators

| Operator | Description | Example | |----------|-------------|---------| | = | Equals | status = 'active' | | !=, <> | Not equals | status != 'deleted' | | < | Less than | age < 30 | | > | Greater than | price > 100 | | <= | Less or equal | count <= 10 | | >= | Greater or equal | score >= 80 | | LIKE | Pattern match | name LIKE '%john%' | | IN | In set | category IN ('A', 'B') | | NOT IN | Not in set | status NOT IN ('deleted') | | BETWEEN | Range | price BETWEEN 10 AND 100 |

Logical Operators

`AND`: Both conditions must be true
`OR`: Either condition must be true
`NOT`: Negates condition
Parentheses for grouping: `(a OR b) AND c`

Examples

-- Simple filter
users WHERE active = true

-- Multiple conditions
products WHERE price > 100 AND category = 'Electronics'

-- Pattern matching
users WHERE email LIKE '%@gmail.com'

-- Sorting and pagination
orders WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10

-- Field selection
users SELECT id, name, email WHERE verified = true

Streaming Format

Chunk Structure

First chunk includes full schema:

@schema[id:int, name:str]

records(100):
  1, "First"
  2, "Second"
  ...

Subsequent chunks use continuation syntax:

records+(100):
  101, "Next"
  102, "Another"
  ...

Metadata

Each chunk includes: - chunkId: Current chunk number (0-indexed) - totalChunks: Total number of chunks - isFirst: Boolean, true for first chunk - isLast: Boolean, true for last chunk - metadata.table: Table name - metadata.recordsInChunk: Records in this chunk - metadata.startIdx: Starting record index - metadata.endIdx: Ending record index - metadata.totalRecords: Total records across all chunks - metadata.progress: Completion percentage (0.0 to 1.0)

Compression Modes

FAST

No dictionary compression
Fastest encoding
Best for: Small datasets, real-time encoding

BALANCED (Default)

Dictionary compression for strings ?5 chars appearing ?3 times
Good balance of speed and compression
Best for: General purpose use

ULTRA

Aggressive dictionary compression (?3 chars, ?2 times)
Maximum compression
Best for: Large datasets, bandwidth-constrained scenarios

ADAPTIVE

Automatically selects mode based on data size: - < 1KB: FAST - 1KB - 10KB: BALANCED - > 10KB: ULTRA

PHP Implementation

Encoder

use Aton\Encoder;
use Aton\Enums\CompressionMode;

$encoder = new Encoder(
    optimize: true,
    compression: CompressionMode::BALANCED,
    queryable: true,
    validate: true
);

// Basic encoding
$aton = $encoder->encode($data);

// With query filter
$aton = $encoder->encodeWithQuery($data, "users WHERE active = true");

// Get stats
$stats = $encoder->getCompressionStats($data);

Decoder

use Aton\Decoder;

$decoder = new Decoder(validate: true);
$data = $decoder->decode($atonString);

Query Engine

use Aton\QueryEngine;

$engine = new QueryEngine();
$query = $engine->parse("products WHERE price > 100 ORDER BY price DESC");
$results = $engine->execute($data, $query);

Stream Encoder

use Aton\StreamEncoder;
use Aton\Enums\CompressionMode;

$encoder = new StreamEncoder(
    chunkSize: 100,
    compression: CompressionMode::BALANCED
);

foreach ($encoder->streamEncode($largeData) as $chunk) {
    processChunk($chunk['data']);
}

Migration from V1

V2 is fully backward compatible with V1. To use V1-style encoding:

$encoder = new Encoder(
    optimize: false,              // Disable defaults optimization
    compression: CompressionMode::FAST  // No dictionary compression
);

V2 decoder can read both V1 and V2 format without any changes.