Post-LLM Architecture: A Compression-Native Thought Experiment
Thesis
Current LLM efficiency work (TurboQuant, 1-bit quantization, GPTQ) compresses
the representation of a fundamentally redundant architecture. It doesn't fix
the underlying waste: transformers do dense matmul over learned dense vectors
when the underlying signal (language) is sparse, structured, and has ~1 bit/char
of Shannon entropy.
We're betting on a different stack with one coherent claim:
A small parametric reasoner, paired with external memory, with output
measured directly in bits, can match or beat a same-size transformer on
compression — and dominate it on inference cost.
Compression and intelligence are formally linked (Solomonoff, Hutter,
Schmidhuber; DeepMind 2023 "Language Modeling Is Compression"). If that link
is real, optimizing directly for bits/char is a more honest objective than
cross-entropy-on-tokens, and a model that does it well should be smaller.
Architecture (composed, not piled)
Four pieces, each attacking a different source of waste:
Kills quadratic attention and the inference-time memory blowup.
instead of fixed activations on nodes. Smaller and more interpretable per
layer. Drop if it slows research iteration.
in weights. The parametric model learns how to think, not *what to
recall*.
stream. Trained directly on bits/char, not token cross-entropy. This is
the least-standard piece and the one that makes the thesis testable.
Explicitly not combining: energy-based models (#5 from earlier discussion).
Different training paradigm; doesn't compose with autoregressive SSM.
Phase 1 — PyTorch prototype (research velocity)
Goal: find out whether the architecture works at all before investing in a
performance rewrite.
- Scale: ~10M parameters
- Corpus: enwik8 (the Hutter Prize benchmark — natural fit for a compression
- Components:
- Mamba/SSM backbone
- KAN-style MLPs (start simple; ablate if it slows us down)
- Arithmetic-coded output head
- Retrieval interface (stub for Phase 1; real index in Phase 2)
- Baselines (same parameter budget, same corpus):
- gzip — the trivial floor
- Vanilla transformer — the real benchmark
Phase 1 success criteria (decided before building)
- Must: beat gzip on bits/char.
- Must: match the vanilla transformer baseline within 10% on bits/char.
- If both hold → proceed to Phase 2.
- If not → the thesis needs revision before we port anything to Rust.
Phase 2 — Rust inference port (thesis demonstration)
Only triggered if Phase 1 succeeds.
- Forward pass on Candle (Hugging Face Rust ML framework)
- Native Rust arithmetic coder — bit-twiddling tight loops, where Rust
- Native Rust retrieval index (tantivy / qdrant-style)
- Target hardware: laptop CPU, no CUDA
- Measure: tokens/sec, peak RAM, bits/char (must match Phase 1)
Why Rust is for Phase 2, not Phase 1
~95% of training time is inside CUDA kernels. Rust doesn't make cuBLAS
faster. The Rust win is in the outer loop (arithmetic coding, retrieval,
inference orchestration) and on no-GPU targets (edge inference). Both of
those are Phase 2 concerns. Doing pure-Rust from day one would burn months
reimplementing autograd before we learn whether the architecture works at
all — wrong order of operations.
Open questions (to resolve before coding)
least-standard piece. Need to nail down the loss function and gradient
path before writing the model.
that swapping in a real index in Phase 2 doesn't require a model rewrite?
Phase 1?
(standard, but muddies the compression metric).
Non-goals
- Beating GPT-4 on anything. This is a thesis test at 10M params.
- A general-purpose framework. We're building one specific architecture to
- Energy-based / diffusion / non-autoregressive variants. Different bet.
- Rust autograd from scratch. Use Candle if/when we get to Phase 2.
Status
Plan locked. Next step: resolve open question #1 (arithmetic-coded head
training) before writing model code.