Back to Field Notes
[LOG/Development·04.28.2026]

Post-LLM Architecture: A Compression-Native Thought Experiment

Thesis

Current LLM efficiency work (TurboQuant, 1-bit quantization, GPTQ) compresses

the representation of a fundamentally redundant architecture. It doesn't fix

the underlying waste: transformers do dense matmul over learned dense vectors

when the underlying signal (language) is sparse, structured, and has ~1 bit/char

of Shannon entropy.

We're betting on a different stack with one coherent claim:

A small parametric reasoner, paired with external memory, with output
measured directly in bits, can match or beat a same-size transformer on
compression — and dominate it on inference cost.

Compression and intelligence are formally linked (Solomonoff, Hutter,

Schmidhuber; DeepMind 2023 "Language Modeling Is Compression"). If that link

is real, optimizing directly for bits/char is a more honest objective than

cross-entropy-on-tokens, and a model that does it well should be smaller.

Architecture (composed, not piled)

Four pieces, each attacking a different source of waste:

  • Mamba / SSM backbone — linear-time sequence modeling, no KV cache.
  • Kills quadratic attention and the inference-time memory blowup.

  • KAN-style MLP replacement — learnable activations on edges (splines)
  • instead of fixed activations on nodes. Smaller and more interpretable per

    layer. Drop if it slows research iteration.

  • Retrieval + tiny reasoner — knowledge lives in an external index, not
  • in weights. The parametric model learns how to think, not *what to

    recall*.

  • Compression-native output head — model emits an arithmetic-coded
  • stream. Trained directly on bits/char, not token cross-entropy. This is

    the least-standard piece and the one that makes the thesis testable.

    Explicitly not combining: energy-based models (#5 from earlier discussion).

    Different training paradigm; doesn't compose with autoregressive SSM.

    Phase 1 — PyTorch prototype (research velocity)

    Goal: find out whether the architecture works at all before investing in a

    performance rewrite.

    • Scale: ~10M parameters
    • Corpus: enwik8 (the Hutter Prize benchmark — natural fit for a compression
    thesis)
    • Components:
    • Mamba/SSM backbone
    • KAN-style MLPs (start simple; ablate if it slows us down)
    • Arithmetic-coded output head
    • Retrieval interface (stub for Phase 1; real index in Phase 2)
    • Baselines (same parameter budget, same corpus):
    • gzip — the trivial floor
    • Vanilla transformer — the real benchmark

    Phase 1 success criteria (decided before building)

    • Must: beat gzip on bits/char.
    • Must: match the vanilla transformer baseline within 10% on bits/char.
    • If both hold → proceed to Phase 2.
    • If not → the thesis needs revision before we port anything to Rust.

    Phase 2 — Rust inference port (thesis demonstration)

    Only triggered if Phase 1 succeeds.

    • Forward pass on Candle (Hugging Face Rust ML framework)
    • Native Rust arithmetic coder — bit-twiddling tight loops, where Rust
    genuinely dominates Python (10–100× realistic)
    • Native Rust retrieval index (tantivy / qdrant-style)
    • Target hardware: laptop CPU, no CUDA
    • Measure: tokens/sec, peak RAM, bits/char (must match Phase 1)

    Why Rust is for Phase 2, not Phase 1

    ~95% of training time is inside CUDA kernels. Rust doesn't make cuBLAS

    faster. The Rust win is in the outer loop (arithmetic coding, retrieval,

    inference orchestration) and on no-GPU targets (edge inference). Both of

    those are Phase 2 concerns. Doing pure-Rust from day one would burn months

    reimplementing autograd before we learn whether the architecture works at

    all — wrong order of operations.

    Open questions (to resolve before coding)

  • How does the arithmetic-coded head train end-to-end? This is the
  • least-standard piece. Need to nail down the loss function and gradient

    path before writing the model.

  • Retrieval interface in Phase 1 — what does the stub look like such
  • that swapping in a real index in Phase 2 doesn't require a model rewrite?

  • KAN keep-or-cut decision rule — what ablation result kills KAN from
  • Phase 1?

  • Tokenization — byte-level (cleanest for bits/char) vs. BPE
  • (standard, but muddies the compression metric).

    Non-goals

    • Beating GPT-4 on anything. This is a thesis test at 10M params.
    • A general-purpose framework. We're building one specific architecture to
    test one specific claim.
    • Energy-based / diffusion / non-autoregressive variants. Different bet.
    • Rust autograd from scratch. Use Candle if/when we get to Phase 2.

    Status

    Plan locked. Next step: resolve open question #1 (arithmetic-coded head

    training) before writing model code.