Phase 1 Results — Baseline Anchor

Headline numbers

Split	gzip bits/byte	transformer bits/byte	Improvement
val	2.9246	1.2727	−1.65 bpb (−56.5%)
test	2.8180	1.2565	−1.56 bpb (−55.4%)

Test below val (1.26 < 1.27) — no overfitting; 50k steps is on the

undertrained side of the budget, not the overfit side.

What this proves (and doesn't)

Proved:

Byte-level cross-entropy in nats, divided by ln(2) per scored byte, is

the bits/byte arithmetic-coded code length. Identity, not approximation.

No exotic loss, no gradient through the coder, no learnable bits — and

the number falls out of standard training cleanly.

The DeepMind 2023 "Language Modeling Is Compression" framing is real on

hardware we own: 14.7M params crush gzip by a factor of ~2.3.

The metric pipeline is honest. gzip baseline lands where independent

sources put it; the model number sits inside the published range for

~10M-param byte-level models on enwik8 (typically 1.0–1.4 bpb at

comparable budgets).

NOT proved:

The actual thesis. The thesis is "small parametric reasoner + external

memory + bits-as-output beats a same-size transformer." Tonight's run is

the baseline transformer in that comparison. The experiment that

decides the bet is the next one — same model + frozen retrieval — and we

haven't run it yet.

Anything about inference cost. That's Phase 2.

Compute and wall time

Hardware: RTX 5070 Ti (Blackwell, sm_120), 16 GB VRAM, on WSL2.
PyTorch 2.12 nightly + CUDA 12.8 (stable wheels don't yet ship sm_120

kernels — Blackwell needs the nightly).

~13 train steps/sec sustained, ~14 eval batches/sec. 50k training steps

in ~68 min. Full val eval in ~11 s.

Peak readings: 250 W power draw, 72 °C die temp, 77% fan. Watchdog

thresholds (88 °C / 100 °C / 115% TGP / fan-stuck) never approached.

What we changed mid-run

One real bug found and fixed: eval.py was iterating

ByteSequenceDataset with stride 1, scoring every token ~seq_len times.

Fixed by tiling non-overlapping (seq_len + 1)-length chunks and dividing

total nats by the actual prediction count. The first eval run didn't

finish; the second (with the fix) ran in 11 s and produced the numbers

above.

Lesson worth keeping: training-time datasets and eval-time iteration have

opposite sampling needs. Catching it required noticing eval was queueing

~150k batches when it should have been ~150. Always sanity-check eval

batch count against total_bytes / seq_len.

What's locked vs. tentative

Locked (these are reproducible now and shouldn't change for the rest

of Phase 1):

Byte-level 257-symbol vocab.
Plain CE loss; bits/byte computed in eval, not training.
Non-overlapping eval; gzip on the same source bytes.
14.7M-param transformer baseline (a hair over the 10M target, but the

closest "round" config — d=384, h=6, L=8, ff=1536 — and the right ballpark). Tentative (likely to change as we run more variants):

50k steps. The training curve was still dropping at the end. A 4-hour

overnight run at 200k steps would give us a less undertrained anchor.

seq_len=1024. Fine for a transformer; we'll revisit when the SSM

variant lands.

Anchor for next runs

Every future variant gets compared back to 1.2727 (val) and 1.2565

(test) at this same param/step budget. Variants that don't beat both

are losing.

Next experiment on deck: same trained model, same eval, plus a frozen

kNN-style retriever blending its next-byte distribution with the model's.

That isolates the "retrieval-as-free-context" claim without retraining.

If it drops bits/byte, the thesis is alive. If not, we know the retrieval

framing needs more than plug-in blending — likely full RETRO-style

retraining.