ML Frontier #05: Grokking — Delayed Generalization

Train a small network past the point of zero training loss and sometimes --- thousands of steps later --- test accuracy suddenly jumps from random to near perfect. The model didn't just memorize; it discovered the rule. This is grokking, and the research explaining it reframes generalization as a phase transition.

Fifth ML Frontier episode. You train a small network on a math task. Training loss drops to zero. Test accuracy stays at random chance. Classical intuition says you’re done — the model has overfit. But you keep training, and thousands of steps later, something surprising happens.

Resource	Link
Papers	4 papers
Video	ML Frontier 5: Grokking
Comments	Discord

The Phenomenon

Power et al. (2022) reported something that looked like a bug. They trained small transformers on modular arithmetic tasks and watched training loss crash to zero while test accuracy sat at random chance. By every classical signal, the models had overfit.

Then they left training running. Thousands of optimization steps past convergence — long after any reasonable early-stopping rule would have halted the run — test accuracy suddenly jumped from random to nearly perfect. The model hadn’t just memorized. It had discovered the underlying rule.

They called this grokking: delayed generalization.

Competing Circuits Inside the Network

Nanda et al. (2023) opened the network up to find out what was actually happening. Using mechanistic interpretability, they tracked the internal structure of a small transformer learning modular addition across training.

What they found: two solutions coexist and compete. A memorization circuit dominates early — it fits the training data by rote but doesn’t generalize. Underneath it, a generalizing solution slowly forms. For modular arithmetic, that solution turns out to be a Fourier-style trigonometric circuit that represents numbers on a circle and adds by rotating.

Generalization wasn’t absent during the long plateau. It was being constructed, gradually, while memorization held the foreground.

Why Regularization Wins

The handoff from memorization to generalization isn’t automatic. It needs pressure.

Weight decay is the key. Memorization circuits are large and brittle — they spread signal across many weights to pin down specific examples. The generalizing circuit is simpler and uses smaller weights. Weight decay taxes the memorization circuit more heavily than the general one, so over many steps it suppresses memorization and lets the simpler solution take over.

Omnigrok (Liu et al., 2022) generalized this picture beyond modular arithmetic, showing grokking-like dynamics across more varied tasks when the right regularization and initialization are in play. Kumar et al. (2024) reframes grokking as a transition from lazy training dynamics (features barely move from initialization) to rich training dynamics (features reorganize meaningfully) — tying delayed generalization to well-studied concepts in deep learning theory.

Generalization as Phase Transition

The uncomfortable lesson: generalization doesn’t always emerge smoothly as loss decreases. It can appear suddenly, like a phase transition, after a long latent period where the model looks stuck.

Phase	Train loss	Test accuracy	What’s happening inside
Memorization	Drops to ~0	Random	Network fits training set by rote
Plateau	~0	Still random	Generalizing circuit slowly forming under regularization pressure
Grokking	~0	Jumps to ~100%	Generalizing circuit overtakes memorization

Classical early stopping would have halted training in phase 1 or 2 and declared the model a failure. The generalizing solution existed but hadn’t yet taken over the output.

The Open Question for Large Models

Grokking is most cleanly documented on small networks, small algorithmic datasets, with aggressive weight decay. The frontier question is whether a version of this happens quietly inside large language models during pretraining — on subsets of their data, on specific capabilities, hidden inside aggregate loss curves that look smooth.

Question	Status
Does grokking occur inside large-scale LLMs during pretraining?	Open — aggregate loss hides per-capability dynamics
Is grokking universal or tied to specific regimes?	Evidence leans toward regime-dependent (weight decay + small tasks), but lazy-to-rich framing broadens it
Can we detect grokking in practice?	Mechanistic progress measures help on small models; scaling them is open
Can we accelerate grokking?	Regularization tuning and initialization choices influence timing

If LLM pretraining hides local grokking events — capabilities that appear abruptly after a long latent phase — then loss curves alone aren’t enough to decide when a model is “done.” Something richer would be needed to know what it has actually learned.

Papers

Date	Paper	Link
Jan 2022	Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets (Power et al.)	arXiv 2201.02177
Oct 2022	Omnigrok: Grokking Beyond Algorithmic Data (Liu et al.)	arXiv 2210.01117
Jan 2023	Progress Measures for Grokking via Mechanistic Interpretability (Nanda et al.)	arXiv 2301.05217
Oct 2023	Grokking as the Transition from Lazy to Rich Training Dynamics (Kumar et al.)	arXiv 2310.06110

Generalization can be quiet, slow, and sudden all at once. Follow for more ML Frontier episodes exploring research at the edge.

ML Frontier #05: Grokking --- Delayed Generalization