About the Service

How FormatSense works & why it exists

FormatSense is a service for automated analysis and decoding of binary data files — unknown, proprietary, and legacy storage and serialization formats — using coordinated LLM agents.

Origin

The project grew out of years of experience working with proprietary, legacy, and undocumented data formats. Over years of research — from legacy systems to archives found in specialized communities — we accumulated a set of binary analysis tools and an extensive knowledge base of format signatures. Some of these tools were built for specific tasks that no public solution could handle: non-standard packings, hybrid containers, formats with platform-dependent alignment.

In parallel, we built a collection of test samples — real files from dozens of domains: telemetry files from industrial controllers, configuration dumps from medical equipment, geographic information systems, serialized database structures, financial mainframe exports, game engine archives. This collection became the foundation for algorithm validation and agent benchmarking.

In early 2025, we began experimenting with applying LLM agents to the analysis of unknown binary files. Early results indicated the direction but revealed systemic limitations: agents were reluctant to use complex tools, lost context on long files, and failed to build coherent hypotheses. It took a year of iterative work — redesigning tools into formats comprehensible to agents, designing multi-level orchestration, creating an evaluation pipeline on real samples — to achieve consistently reproducible analysis quality.

How It Works

The core of the service is a pipeline of coordinated LLM agents, each with access to a set of specialized tools:

Binary analysis tools — instruments for inspecting file structure: pattern search, header extraction, entropy analysis, encoding detection.
Signature databases — agents cross-reference discovered patterns against an extensive database of known magic bytes, header structures, and characteristic format markers.
Coordination and workflow — agents work iteratively: forming hypotheses about file structure, verifying them with tools, refining and deepening the analysis. An orchestrator distributes tasks and monitors progress.
Report generation — results are compiled into a structured report with format classification, structure description, and, where possible, extracted data.

The tools are designed specifically for use by LLM agents — their output is optimized for machine-readable interpretation rather than human viewing.

BYOK Model

The service operates on a Bring Your Own Key model — you connect your own LLM provider API key. We do not act as an intermediary and do not resell tokens. Keys are encrypted at rest and deleted along with job data after analysis is complete.

Limitations

Analysis quality depends on the amount of structural information available in the file. The most complete results are achieved on formats with prominent markers: magic bytes, fixed-length headers, repeating record structures, text labels.

More challenging to analyze:

Encrypted data — without a key, content is indistinguishable from random noise, making structural analysis impossible. The service can detect the fact of encryption and sometimes identify the algorithm, but cannot decode the content.
Custom compression without headers — if the compression format does not match known algorithms and has no signature, agents cannot decompress the data for further analysis.
Flat streams without boundaries — continuous data without record delimiters (e.g., raw sensor dumps with a fixed sample rate) can be described structurally, but without external context the semantics of fields remain hypothetical.

In such cases, the service returns a partial result: discovered patterns, statistical data profile, structural hypotheses with confidence levels — rather than an empty response or an error.

Who It's For

The service is designed for engineers, data analysts, and teams who encounter unknown binary formats: during legacy system migrations, data integration from external partners, digital archiving, security incident investigations, or reverse engineering.