Post-LLM Architecture: A Compression-Native Thought Experiment

Thesis

Current LLM efficiency work (TurboQuant, 1-bit quantization, GPTQ) compresses

the representation of a fundamentally redundant architecture. It doesn't fix

the underlying waste: transformers do dense matmul over learned dense vectors

when the underlying signal (language) is sparse, structured, and has ~1 bit/char

of Shannon entropy.

We're betting on a different stack with one coherent claim:

A small parametric reasoner, paired with external memory, with output

measured directly in bits, can match or beat a same-size transformer on

compression — and dominate it on inference cost.

Compression and intelligence are formally linked (Solomonoff, Hutter,

Schmidhuber; DeepMind 2023 "Language Modeling Is Compression"). If that link

is real, optimizing directly for bits/char is a more honest objective than

cross-entropy-on-tokens, and a model that does it well should be smaller.

Architecture (composed, not piled)

Four pieces, each attacking a different source of waste:

Mamba / SSM backbone — linear-time sequence modeling, no KV cache.

Kills quadratic attention and the inference-time memory blowup.

KAN-style MLP replacement — learnable activations on edges (splines)

instead of fixed activations on nodes. Smaller and more interpretable per

layer. Drop if it slows research iteration.

Retrieval + tiny reasoner — knowledge lives in an external index, not

in weights. The parametric model learns how to think, not *what to

recall*.

Compression-native output head — model emits an arithmetic-coded

stream. Trained directly on bits/char, not token cross-entropy. This is

the least-standard piece and the one that makes the thesis testable.

Explicitly not combining: energy-based models (#5 from earlier discussion).

Different training paradigm; doesn't compose with autoregressive SSM.

Phase 1 — PyTorch prototype (research velocity)

Goal: find out whether the architecture works at all before investing in a

performance rewrite.

Scale: ~10M parameters
Corpus: enwik8 (the Hutter Prize benchmark — natural fit for a compression

thesis)

Components:
Mamba/SSM backbone
KAN-style MLPs (start simple; ablate if it slows us down)
Arithmetic-coded output head
Retrieval interface (stub for Phase 1; real index in Phase 2)
Baselines (same parameter budget, same corpus):
gzip — the trivial floor
Vanilla transformer — the real benchmark

Phase 1 success criteria (decided before building)

Must: beat gzip on bits/char.
Must: match the vanilla transformer baseline within 10% on bits/char.
If both hold → proceed to Phase 2.
If not → the thesis needs revision before we port anything to Rust.

Phase 2 — Rust inference port (thesis demonstration)

Only triggered if Phase 1 succeeds.

Forward pass on Candle (Hugging Face Rust ML framework)
Native Rust arithmetic coder — bit-twiddling tight loops, where Rust

genuinely dominates Python (10–100× realistic)

Native Rust retrieval index (tantivy / qdrant-style)
Target hardware: laptop CPU, no CUDA
Measure: tokens/sec, peak RAM, bits/char (must match Phase 1)

Why Rust is for Phase 2, not Phase 1

~95% of training time is inside CUDA kernels. Rust doesn't make cuBLAS

faster. The Rust win is in the outer loop (arithmetic coding, retrieval,

inference orchestration) and on no-GPU targets (edge inference). Both of

those are Phase 2 concerns. Doing pure-Rust from day one would burn months

reimplementing autograd before we learn whether the architecture works at

all — wrong order of operations.

Open questions (to resolve before coding)

How does the arithmetic-coded head train end-to-end? This is the

least-standard piece. Need to nail down the loss function and gradient

path before writing the model.

Retrieval interface in Phase 1 — what does the stub look like such

that swapping in a real index in Phase 2 doesn't require a model rewrite?

KAN keep-or-cut decision rule — what ablation result kills KAN from

Phase 1?

Tokenization — byte-level (cleanest for bits/char) vs. BPE

(standard, but muddies the compression metric).

Non-goals

Beating GPT-4 on anything. This is a thesis test at 10M params.
A general-purpose framework. We're building one specific architecture to

test one specific claim.

Energy-based / diffusion / non-autoregressive variants. Different bet.
Rust autograd from scratch. Use Candle if/when we get to Phase 2.

Status

Plan locked. Next step: resolve open question #1 (arithmetic-coded head

training) before writing model code.