Back to Field Notes
[LOG/Development·04.28.2026]

Phase 1 Results — Baseline Anchor

First end-to-end run: 14.7M-param vanilla transformer, 50k steps, byte-level CE on enwik8, scored against the gzip floor.

Headline numbers

Splitgzip bits/bytetransformer bits/byteImprovement
val2.92461.2727−1.65 bpb (−56.5%)
test2.81801.2565−1.56 bpb (−55.4%)

Test below val (1.26 < 1.27) — no overfitting; 50k steps is on the

undertrained side of the budget, not the overfit side.

What this proves (and doesn't)

Proved:
  • Byte-level cross-entropy in nats, divided by ln(2) per scored byte, is
the bits/byte arithmetic-coded code length. Identity, not approximation.

No exotic loss, no gradient through the coder, no learnable bits — and

the number falls out of standard training cleanly.

  • The DeepMind 2023 "Language Modeling Is Compression" framing is real on
hardware we own: 14.7M params crush gzip by a factor of ~2.3.
  • The metric pipeline is honest. gzip baseline lands where independent
sources put it; the model number sits inside the published range for

~10M-param byte-level models on enwik8 (typically 1.0–1.4 bpb at

comparable budgets).

NOT proved:
  • The actual thesis. The thesis is "small parametric reasoner + external
memory + bits-as-output beats a same-size transformer." Tonight's run is

the baseline transformer in that comparison. The experiment that

decides the bet is the next one — same model + frozen retrieval — and we

haven't run it yet.

  • Anything about inference cost. That's Phase 2.

Compute and wall time

  • Hardware: RTX 5070 Ti (Blackwell, sm_120), 16 GB VRAM, on WSL2.
  • PyTorch 2.12 nightly + CUDA 12.8 (stable wheels don't yet ship sm_120
kernels — Blackwell needs the nightly).
  • ~13 train steps/sec sustained, ~14 eval batches/sec. 50k training steps
in ~68 min. Full val eval in ~11 s.
  • Peak readings: 250 W power draw, 72 °C die temp, 77% fan. Watchdog
thresholds (88 °C / 100 °C / 115% TGP / fan-stuck) never approached.

What we changed mid-run

One real bug found and fixed: eval.py was iterating

ByteSequenceDataset with stride 1, scoring every token ~seq_len times.

Fixed by tiling non-overlapping (seq_len + 1)-length chunks and dividing

total nats by the actual prediction count. The first eval run didn't

finish; the second (with the fix) ran in 11 s and produced the numbers

above.

Lesson worth keeping: training-time datasets and eval-time iteration have

opposite sampling needs. Catching it required noticing eval was queueing

~150k batches when it should have been ~150. Always sanity-check eval

batch count against total_bytes / seq_len.

What's locked vs. tentative

Locked (these are reproducible now and shouldn't change for the rest

of Phase 1):

  • Byte-level 257-symbol vocab.
  • Plain CE loss; bits/byte computed in eval, not training.
  • Non-overlapping eval; gzip on the same source bytes.
  • 14.7M-param transformer baseline (a hair over the 10M target, but the
closest "round" config — d=384, h=6, L=8, ff=1536 — and the right ballpark). Tentative (likely to change as we run more variants):
  • 50k steps. The training curve was still dropping at the end. A 4-hour
overnight run at 200k steps would give us a less undertrained anchor.
  • seq_len=1024. Fine for a transformer; we'll revisit when the SSM
variant lands.

Anchor for next runs

Every future variant gets compared back to 1.2727 (val) and 1.2565

(test) at this same param/step budget. Variants that don't beat both

are losing.

Next experiment on deck: same trained model, same eval, plus a frozen

kNN-style retriever blending its next-byte distribution with the model's.

That isolates the "retrieval-as-free-context" claim without retraining.

If it drops bits/byte, the thesis is alive. If not, we know the retrieval

framing needs more than plug-in blending — likely full RETRO-style

retraining.