Five ML Concepts - #30: The Journey So Far | Software Wrighter Lab Blog

Episode 30 marks a milestone: 145 machine learning concepts covered across 30 episodes. From backpropagation to scaling laws, dropout to distribution shift, RAG to reward hacking. This retrospective celebrates the journey and announces what's next: Frontier ML Thinking—one concept, two minutes, deeper implications.

30 episodes. 145 machine learning concepts.

Resource	Link
Full Series	Five ML Concepts Episodes 1-29
Video	Five ML Concepts #30
Papers Index	Complete Concept Index
Comments	Discord

The Journey So Far

For the past thirty episodes, we’ve explored 145 machine learning concepts in under 30 seconds each.

From backpropagation to scaling laws. From dropout to distribution shift. From RAG to reward hacking.

We covered:

Foundations — the building blocks of neural networks and learning algorithms
Failure modes — how models break, overfit, forget, and hallucinate
Deployment realities — what happens when models meet production
Alignment challenges — ensuring models do what we actually want

What’s Changing

Machine learning is evolving rapidly. The foundational primitives are now well-established—the concepts we covered form a stable vocabulary.

But new research is reshaping how we apply these primitives:

Memory and retrieval architectures
Reasoning and planning systems
Sparsity and efficiency at scale
Robustness and generalization
Alignment and safety

30 seconds per concept was a good start. But some ideas deserve more depth.

What’s Next: Frontier ML Thinking

Starting soon: Frontier ML Thinking.

One concept. Two minutes. Deeper implications.

We’ll explore the cutting edge—ideas from papers published in the last 12 months that build on the foundations we’ve covered.

If You’re New Here

Start with Five ML Concepts Episodes 1–29. Each episode covers five concepts in five minutes total. The full series provides a foundation in modern machine learning vocabulary.

If You’ve Been Here the Whole Time

You’re ready for the frontier.

Why the Papers Look “Old”

When I tabulated the papers behind the 145 concepts in this series, something looked odd: almost none of the cited papers were from the last two years.

This is not a mistake—it’s a feature of how ML knowledge evolves.

Seminal papers don’t keep getting re-written

Most concepts in this series are primitives: backpropagation, transformers, RAG, dropout, calibration, and so on. Each primitive has an origin paper that introduced it. Once the primitive exists, later research focuses on:

Scaling it
Combining it with other ideas
Benchmarking it
Making it more efficient
Making it safer

That kind of work produces new papers, but not new “origin papers.”

What this reveals about the field

The core intellectual breakthroughs of modern ML largely occurred between 2016 and 2022. The frontier has since shifted from inventing new primitives to:

Memory and retrieval systems
Continual learning
Agent architectures
Tool use and planning
Sparsity and efficiency at scale
Alignment and safety

That’s exactly what Frontier ML Thinking will explore: ideas from papers published in the last 12 months that build on these foundations.

Complete Concept Index

All 145 concepts organized chronologically by seminal paper year.

Pre-1990

1958

Concept	Links (Post, Video, Paper)
Perceptron	Post 5 \| Video 5 \| (1958) The Perceptron

1960s


None

1970s


None

1986

Concept	Links (Post, Video, Paper)
Backpropagation	Post 1 \| Video 1 \| (1986) Learning representations by back-propagating errors
RNN	Post 11 \| Video 11 \| (1986) Learning representations

1989

Concept	Links (Post, Video, Paper)
Universal Approximation	Post 13 \| Video 13 \| (1989) Approximation by Superpositions

1990s

1995

Concept	Links (Post, Video, Paper)
Cross-Validation	Post 7 \| Video 7 \| (1995) A Study of Cross-Validation

1997

Concept	Links (Post, Video, Paper)
LSTM	Post 22 \| Video 22 \| (1997) Long Short-Term Memory

1998

Concept	Links (Post, Video, Paper)
Early Stopping	Post 13 \| Video 13 \| (1998) Early Stopping - But When?

2000s

2000

Concept	Links (Post, Video, Paper)
Ensembling	Post 18 \| Video 18 \| (2000) Ensemble Methods

2002

Concept	Links (Post, Video, Paper)
Cold Start Problems	Post 14 \| Video 14 \| (2002) Addressing Cold Start

2003

Concept	Links (Post, Video, Paper)
Perplexity	Post 15 \| Video 15 \| (2003) A Neural Probabilistic Language Model

2006

Concept	Links (Post, Video, Paper)
Autoencoders	Post 19 \| Video 19 \| (2006) Reducing Dimensionality
ROC / AUC	Post 14 \| Video 14 \| (2006) An Introduction to ROC Analysis

2007

Concept	Links (Post, Video, Paper)
Precision vs Recall	Post 12 \| Video 12 \| (2007) The Truth of the F-Measure

2009

Concept	Links (Post, Video, Paper)
A/B Testing Models	Post 16 \| Video 16 \| (2009) Controlled Experiments
Bias-Variance Tradeoff	Post 8 \| Video 8 \| (2009) Elements of Statistical Learning
Correlation vs Causation	Post 19 \| Video 19 \| (2009) Causality
Covariate Shift	Post 19 \| Video 19 \| (2009) Dataset Shift in ML
Curriculum Learning	Post 19 \| Video 19 \| (2009) Curriculum Learning
Curse of Dimensionality	Post 15 \| Video 15 \| (2009) Elements of Statistical Learning
Distribution Shift	Post 11 \| Video 11 \| (2009) Dataset Shift in ML
Why ML Is Fragile	Post 18 \| Video 18 \| (2009) Distribution Shift
Why More Data Beats Better Models	Post 22 \| Video 22 \| (2009) Unreasonable Effectiveness of Data

2010s

2010

Concept	Links (Post, Video, Paper)
Transfer Learning	Post 4 \| Video 4 \| (2010) A Survey on Transfer Learning
Weight Initialization	Post 15 \| Video 15 \| (2010) Understanding Difficulty of Training

2011

Concept	Links (Post, Video, Paper)
Spurious Correlations	Post 14 \| Video 14 \| (2011) Unbiased Look at Dataset Bias

2012

Concept	Links (Post, Video, Paper)
CNN	Post 10 \| Video 10 \| (2012) ImageNet Classification with Deep CNNs
Data Leakage	Post 24 \| Video 24 \| (2012) Leakage in Data Mining

2013

Concept	Links (Post, Video, Paper)
Adversarial Examples	Post 25 \| Video 25 \| (2013) Intriguing properties of neural networks
Embedding	Post 1 \| Video 1 \| (2013) Word2Vec
Gradient Clipping	Post 14 \| Video 14 \| (2013) Difficulty of Training RNNs
Latent Space	Post 5 \| Video 5 \| (2013) Auto-Encoding Variational Bayes
Representation Learning	Post 25 \| Video 25 \| (2013) Representation Learning: A Review
VAEs	Post 20 \| Video 20 \| (2013) Auto-Encoding Variational Bayes

2014

Concept	Links (Post, Video, Paper)
Adam	Post 4 \| Video 4 \| (2014) Adam: Stochastic Optimization
Attention	Post 2 \| Video 2 \| (2014) Neural Machine Translation
Dropout	Post 9 \| Video 9 \| (2014) Dropout: Prevent Overfitting
Encoder-Decoder	Post 10 \| Video 10 \| (2014) Sequence to Sequence Learning
GRU	Post 21 \| Video 21 \| (2014) Gated Recurrent Neural Networks
Memory-Augmented Networks	Post 27 \| Video 27 \| (2014) Neural Turing Machines
Mode Collapse	Post 24 \| Video 24 \| (2014) Generative Adversarial Nets
Overfitting	Post 3 \| Video 3 \| (2014) Dropout
Regularization	Post 6 \| Video 6 \| (2014) Dropout
Temperature	Post 2 \| Video 2 \| (2014) Properties of Neural MT

2015

Concept	Links (Post, Video, Paper)
Batch Normalization	Post 16 \| Video 16 \| (2015) Batch Normalization
Distillation	Post 10 \| Video 10 \| (2015) Distilling Knowledge
Label Smoothing	Post 25 \| Video 25 \| (2015) Rethinking Inception
Learning Rate	Post 2 \| Video 2 \| (2015) Cyclical Learning Rates
Tokenization	Post 3 \| Video 3 \| (2015) Subword Units

2016

Concept	Links (Post, Video, Paper)
Activation Functions	Post 4 \| Video 4 \| (2016) Deep Learning Book
Benchmark Leakage	Post 17 \| Video 17 \| (2016) Rethinking Inception
Checkpointing	Post 13 \| Video 13 \| (2016) Sublinear Memory Cost
Epoch	Post 18 \| Video 18 \| (2016) Deep Learning Book
Gradient Descent	Post 2 \| Video 2 \| (2016) Overview of Gradient Descent
Inference	Post 9 \| Video 9 \| (2016) Deep Learning Book
Learning Rate Schedules	Post 23 \| Video 23 \| (2016) SGDR: Warm Restarts
Loss Surface Sharpness	Post 23 \| Video 23 \| (2016) Large-Batch Training
Reward Hacking	Post 24 \| Video 24 \| (2016) Concrete Problems in AI Safety
Softmax	Post 11 \| Video 11 \| (2016) Deep Learning Book
Train/Validation/Test Split	Post 16 \| Video 16 \| (2016) Deep Learning Book

2017

Concept	Links (Post, Video, Paper)
Batch Size	Post 12 \| Video 12 \| (2017) Large-Batch Training
Calibration	Post 13 \| Video 13 \| (2017) On Calibration
Catastrophic Forgetting	Post 15 \| Video 15 \| (2017) Overcoming Catastrophic Forgetting
Conditional Computation	Post 28 \| Video 28 \| (2017) Sparsely-Gated MoE
Context Window	Post 7 \| Video 7 \| (2017) Attention Is All You Need
Elastic Weight Consolidation	Post 27 \| Video 27 \| (2017) Overcoming Catastrophic Forgetting (EWC)
Gradient Noise	Post 20 \| Video 20 \| (2017) SGD as Approximate Bayesian Inference
Loss Function	Post 3 \| Video 3 \| (2017) Survey of Loss Functions
Miscalibration	Post 25 \| Video 25 \| (2017) On Calibration
Mixed Precision	Post 8 \| Video 8 \| (2017) Mixed Precision Training
MoE	Post 11 \| Video 11 \| (2017) Sparsely-Gated MoE
OOD Inputs	Post 12 \| Video 12 \| (2017) Detecting Misclassified Examples
Optimization vs Generalization	Post 16 \| Video 16 \| (2017) Rethinking Generalization
Overconfidence	Post 16 \| Video 16 \| (2017) On Calibration
Parameter Routing	Post 27 \| Video 27 \| (2017) Sparsely-Gated MoE
Positional Encoding	Post 6 \| Video 6 \| (2017) Attention Is All You Need
Self-Attention	Post 7 \| Video 7 \| (2017) Attention Is All You Need
Sparse Activation	Post 28 \| Video 28 \| (2017) Sparsely-Gated MoE
Transformer	Post 1 \| Video 1 \| (2017) Attention Is All You Need
Uncertainty Estimation	Post 20 \| Video 20 \| (2017) Uncertainties in Bayesian DL
Warmup	Post 24 \| Video 24 \| (2017) Accurate Large Minibatch SGD
Why Interpretability Is Hard	Post 20 \| Video 20 \| (2017) Rigorous Science of Interpretability

2018

Concept	Links (Post, Video, Paper)
BERT	Post 6 \| Video 6 \| (2018) BERT: Pre-training
Concept Drift vs Data Drift	Post 17 \| Video 17 \| (2018) Learning under Concept Drift
Inductive Bias	Post 12 \| Video 12 \| (2018) Relational Inductive Biases
Loss Landscapes	Post 14 \| Video 14 \| (2018) Visualizing Loss Landscape
Pre-training	Post 5 \| Video 5 \| (2018) BERT

2019

Concept	Links (Post, Video, Paper)
Data Augmentation	Post 26 \| Video 26 \| (2019) Survey on Data Augmentation
Double Descent	Post 25 \| Video 25 \| (2019) Deep Double Descent
GPT	Post 7 \| Video 7 \| (2019) Language Models are Unsupervised Multitask Learners
Inference Parallelism	Post 28 \| Video 28 \| (2019) Megatron-LM
Lottery Ticket Hypothesis	Post 28 \| Video 28 \| (2019) The Lottery Ticket Hypothesis
Manifold Hypothesis	Post 26 \| Video 26 \| (2019) Intro to VAEs
Monitoring & Drift Detection	Post 15 \| Video 15 \| (2019) Detecting Dataset Shift
Replay Buffers	Post 27 \| Video 27 \| (2019) Experience Replay
Weight Decay	Post 17 \| Video 17 \| (2019) Decoupled Weight Decay

2020s

2020

Concept	Links (Post, Video, Paper)
Diffusion Models	Post 8 \| Video 8 \| (2020) Denoising Diffusion
Few-shot Learning	Post 10 \| Video 10 \| (2020) Language Models are Few-Shot Learners
Fine-tuning	Post 3 \| Video 3 \| (2020) Survey on Transfer Learning
ICL (In-Context Learning)	Post 5 \| Video 5 \| (2020) Language Models are Few-Shot Learners
Neural Collapse	Post 29 \| Video 29 \| (2020) Prevalence of Neural Collapse
Preference Learning	Post 18 \| Video 18 \| (2020) Learning to Summarize
Prompting	Post 6 \| Video 6 \| (2020) Language Models are Few-Shot Learners
RAG	Post 10 \| Video 10 \| (2020) Retrieval-Augmented Generation
Scaling Laws	Post 17 \| Video 17 \| (2020) Scaling Laws for Neural Language Models
Self-Training Instability	Post 29 \| Video 29 \| (2020) Understanding Self-Training
Shortcut Learning	Post 13 \| Video 13 \| (2020) Shortcut Learning in DNNs

2021

Concept	Links (Post, Video, Paper)
Failure Analysis	Post 19 \| Video 19 \| (2021) Practical ML for CV
Human-in-the-Loop Systems	Post 20 \| Video 20 \| (2021) Human-in-the-Loop ML
Latency vs Throughput	Post 12 \| Video 12 \| (2021) Efficient Large-Scale Training
LoRA	Post 3 \| Video 3 \| (2021) LoRA: Low-Rank Adaptation
Mechanistic Interpretability	Post 29 \| Video 29 \| (2021) Transformer Circuits
Quantization	Post 9 \| Video 9 \| (2021) Survey of Quantization Methods
RoPE	Post 6 \| Video 6 \| (2021) RoFormer
SAM	Post 29 \| Video 29 \| (2021) Sharpness-Aware Minimization
VLM	Post 4 \| Video 4 \| (2021) CLIP

2022

Concept	Links (Post, Video, Paper)
Chain of Thought	Post 11 \| Video 11 \| (2022) Chain-of-Thought Prompting
Compute Optimality Hypothesis	Post 28 \| Video 28 \| (2022) Chinchilla
Constitutional AI	Post 26 \| Video 26 \| (2022) Constitutional AI
Cost vs Quality Tradeoffs	Post 18 \| Video 18 \| (2022) Efficient Transformers
Emergent Behavior	Post 23 \| Video 23 \| (2022) Emergent Abilities
Flash Attention	Post 9 \| Video 9 \| (2022) FlashAttention
Goodhart’s Law	Post 26 \| Video 26 \| (2022) Goodhart’s Law and ML
Grokking	Post 29 \| Video 29 \| (2022) Grokking
KV Cache	Post 8 \| Video 8 \| (2022) Fast Transformer Decoding
RLHF	Post 9 \| Video 9 \| (2022) Training with Human Feedback
Shadow Deployment	Post 17 \| Video 17 \| (2022) Reliable ML
Speculative Decoding	Post 5 \| Video 5 \| (2022) Fast Inference via Speculative Decoding
Superposition	Post 4 \| Video 4 \| (2022) Toy Models of Superposition

2023

Concept	Links (Post, Video, Paper)
DPO	Post 2 \| Video 2 \| (2023) Direct Preference Optimization
GQA	Post 7 \| Video 7 \| (2023) GQA: Training Generalized Multi-Query
Hallucination	Post 1 \| Video 1 \| (2023) Survey of Hallucination
Jailbreaks	Post 21 \| Video 21 \| (2023) Jailbroken
Mamba	Post 1 \| Video 1 \| (2023) Mamba: Linear-Time Sequence Modeling
Model Editing	Post 27 \| Video 27 \| (2023) Editing LLMs
Model Steerability	Post 22 \| Video 22 \| (2023) Controllable Generation
Planning vs Prediction	Post 21 \| Video 21 \| (2023) AI/ML Gap
Prompt Injection	Post 21 \| Video 21 \| (2023) Prompt Injection Attack
RSFT	Post 22 \| Video 22 \| (2023) Scaling Mathematical Reasoning
Tool Use	Post 23 \| Video 23 \| (2023) Toolformer

2024

Concept	Links (Post, Video, Paper)
MLA	Post 8 \| Video 8 \| (2024) DeepSeek-V2

2025 and Beyond

Since 2024, no widely-adopted new fundamental ML concepts have emerged. Research has shifted from inventing primitives to composing, scaling, and applying them. Papers from 2025–2026 will be covered in our new series: Frontier ML Thinking—one concept, two minutes, deeper implications.

Thank you for following along. The journey continues.