Phase 1 Results — Baseline Anchor
First end-to-end run: 14.7M-param vanilla transformer, 50k steps, byte-level CE on enwik8, scored against the gzip floor.
Headline numbers
| Split | gzip bits/byte | transformer bits/byte | Improvement |
|---|---|---|---|
| val | 2.9246 | 1.2727 | −1.65 bpb (−56.5%) |
| test | 2.8180 | 1.2565 | −1.56 bpb (−55.4%) |
Test below val (1.26 < 1.27) — no overfitting; 50k steps is on the
undertrained side of the budget, not the overfit side.
What this proves (and doesn't)
Proved:- Byte-level cross-entropy in nats, divided by ln(2) per scored byte, is
No exotic loss, no gradient through the coder, no learnable bits — and
the number falls out of standard training cleanly.
- The DeepMind 2023 "Language Modeling Is Compression" framing is real on
- The metric pipeline is honest. gzip baseline lands where independent
~10M-param byte-level models on enwik8 (typically 1.0–1.4 bpb at
comparable budgets).
NOT proved:- The actual thesis. The thesis is "small parametric reasoner + external
the baseline transformer in that comparison. The experiment that
decides the bet is the next one — same model + frozen retrieval — and we
haven't run it yet.
- Anything about inference cost. That's Phase 2.
Compute and wall time
- Hardware: RTX 5070 Ti (Blackwell, sm_120), 16 GB VRAM, on WSL2.
- PyTorch 2.12 nightly + CUDA 12.8 (stable wheels don't yet ship sm_120
- ~13 train steps/sec sustained, ~14 eval batches/sec. 50k training steps
- Peak readings: 250 W power draw, 72 °C die temp, 77% fan. Watchdog
What we changed mid-run
One real bug found and fixed: eval.py was iterating
ByteSequenceDataset with stride 1, scoring every token ~seq_len times.
Fixed by tiling non-overlapping (seq_len + 1)-length chunks and dividing
total nats by the actual prediction count. The first eval run didn't
finish; the second (with the fix) ran in 11 s and produced the numbers
above.
Lesson worth keeping: training-time datasets and eval-time iteration have
opposite sampling needs. Catching it required noticing eval was queueing
~150k batches when it should have been ~150. Always sanity-check eval
batch count against total_bytes / seq_len.
What's locked vs. tentative
Locked (these are reproducible now and shouldn't change for the restof Phase 1):
- Byte-level 257-symbol vocab.
- Plain CE loss; bits/byte computed in eval, not training.
- Non-overlapping eval; gzip on the same source bytes.
- 14.7M-param transformer baseline (a hair over the 10M target, but the
- 50k steps. The training curve was still dropping at the end. A 4-hour
seq_len=1024. Fine for a transformer; we'll revisit when the SSM
Anchor for next runs
Every future variant gets compared back to 1.2727 (val) and 1.2565
(test) at this same param/step budget. Variants that don't beat both
are losing.
Next experiment on deck: same trained model, same eval, plus a frozen
kNN-style retriever blending its next-byte distribution with the model's.
That isolates the "retrieval-as-free-context" claim without retraining.
If it drops bits/byte, the thesis is alive. If not, we know the retrieval
framing needs more than plug-in blending — likely full RETRO-style
retraining.