Five ML Concepts - #30: The Journey So Far
3559 words • 18 min read • Abstract

30 episodes. 145 machine learning concepts.
| Resource | Link |
|---|---|
| Full Series | Five ML Concepts Episodes 1-29 |
| Video | Five ML Concepts #30![]() |
| Papers Index | Complete Concept Index |
| Comments | Discord |
The Journey So Far
For the past thirty episodes, we’ve explored 145 machine learning concepts in under 30 seconds each.
From backpropagation to scaling laws. From dropout to distribution shift. From RAG to reward hacking.
We covered:
- Foundations — the building blocks of neural networks and learning algorithms
- Failure modes — how models break, overfit, forget, and hallucinate
- Deployment realities — what happens when models meet production
- Alignment challenges — ensuring models do what we actually want
What’s Changing
Machine learning is evolving rapidly. The foundational primitives are now well-established—the concepts we covered form a stable vocabulary.
But new research is reshaping how we apply these primitives:
- Memory and retrieval architectures
- Reasoning and planning systems
- Sparsity and efficiency at scale
- Robustness and generalization
- Alignment and safety
30 seconds per concept was a good start. But some ideas deserve more depth.
What’s Next: Frontier ML Thinking
Starting soon: Frontier ML Thinking.
One concept. Two minutes. Deeper implications.
We’ll explore the cutting edge—ideas from papers published in the last 12 months that build on the foundations we’ve covered.
If You’re New Here
Start with Five ML Concepts Episodes 1–29. Each episode covers five concepts in five minutes total. The full series provides a foundation in modern machine learning vocabulary.
If You’ve Been Here the Whole Time
You’re ready for the frontier.
Why the Papers Look “Old”
When I tabulated the papers behind the 145 concepts in this series, something looked odd: almost none of the cited papers were from the last two years.
This is not a mistake—it’s a feature of how ML knowledge evolves.
Seminal papers don’t keep getting re-written
Most concepts in this series are primitives: backpropagation, transformers, RAG, dropout, calibration, and so on. Each primitive has an origin paper that introduced it. Once the primitive exists, later research focuses on:
- Scaling it
- Combining it with other ideas
- Benchmarking it
- Making it more efficient
- Making it safer
That kind of work produces new papers, but not new “origin papers.”
What this reveals about the field
The core intellectual breakthroughs of modern ML largely occurred between 2016 and 2022. The frontier has since shifted from inventing new primitives to:
- Memory and retrieval systems
- Continual learning
- Agent architectures
- Tool use and planning
- Sparsity and efficiency at scale
- Alignment and safety
That’s exactly what Frontier ML Thinking will explore: ideas from papers published in the last 12 months that build on these foundations.
Complete Concept Index
All 145 concepts organized chronologically by seminal paper year.
Pre-1990



1958
| Concept | Links (Post, Video, Paper) |
|---|---|
| Perceptron | Post 5 | Video 5 | (1958) The Perceptron |



1960s
| None |



1970s
| None |



1986
| Concept | Links (Post, Video, Paper) |
|---|---|
| Backpropagation | Post 1 | Video 1 | (1986) Learning representations by back-propagating errors |
| RNN | Post 11 | Video 11 | (1986) Learning representations |
1989
| Concept | Links (Post, Video, Paper) |
|---|---|
| Universal Approximation | Post 13 | Video 13 | (1989) Approximation by Superpositions |



1990s
1995
| Concept | Links (Post, Video, Paper) |
|---|---|
| Cross-Validation | Post 7 | Video 7 | (1995) A Study of Cross-Validation |
1997
| Concept | Links (Post, Video, Paper) |
|---|---|
| LSTM | Post 22 | Video 22 | (1997) Long Short-Term Memory |
1998
| Concept | Links (Post, Video, Paper) |
|---|---|
| Early Stopping | Post 13 | Video 13 | (1998) Early Stopping - But When? |



2000s
2000
| Concept | Links (Post, Video, Paper) |
|---|---|
| Ensembling | Post 18 | Video 18 | (2000) Ensemble Methods |
2002
| Concept | Links (Post, Video, Paper) |
|---|---|
| Cold Start Problems | Post 14 | Video 14 | (2002) Addressing Cold Start |
2003
| Concept | Links (Post, Video, Paper) |
|---|---|
| Perplexity | Post 15 | Video 15 | (2003) A Neural Probabilistic Language Model |
2006
| Concept | Links (Post, Video, Paper) |
|---|---|
| Autoencoders | Post 19 | Video 19 | (2006) Reducing Dimensionality |
| ROC / AUC | Post 14 | Video 14 | (2006) An Introduction to ROC Analysis |
2007
| Concept | Links (Post, Video, Paper) |
|---|---|
| Precision vs Recall | Post 12 | Video 12 | (2007) The Truth of the F-Measure |
2009
| Concept | Links (Post, Video, Paper) |
|---|---|
| A/B Testing Models | Post 16 | Video 16 | (2009) Controlled Experiments |
| Bias-Variance Tradeoff | Post 8 | Video 8 | (2009) Elements of Statistical Learning |
| Correlation vs Causation | Post 19 | Video 19 | (2009) Causality |
| Covariate Shift | Post 19 | Video 19 | (2009) Dataset Shift in ML |
| Curriculum Learning | Post 19 | Video 19 | (2009) Curriculum Learning |
| Curse of Dimensionality | Post 15 | Video 15 | (2009) Elements of Statistical Learning |
| Distribution Shift | Post 11 | Video 11 | (2009) Dataset Shift in ML |
| Why ML Is Fragile | Post 18 | Video 18 | (2009) Distribution Shift |
| Why More Data Beats Better Models | Post 22 | Video 22 | (2009) Unreasonable Effectiveness of Data |



2010s
2010
| Concept | Links (Post, Video, Paper) |
|---|---|
| Transfer Learning | Post 4 | Video 4 | (2010) A Survey on Transfer Learning |
| Weight Initialization | Post 15 | Video 15 | (2010) Understanding Difficulty of Training |
2011
| Concept | Links (Post, Video, Paper) |
|---|---|
| Spurious Correlations | Post 14 | Video 14 | (2011) Unbiased Look at Dataset Bias |
2012
| Concept | Links (Post, Video, Paper) |
|---|---|
| CNN | Post 10 | Video 10 | (2012) ImageNet Classification with Deep CNNs |
| Data Leakage | Post 24 | Video 24 | (2012) Leakage in Data Mining |
2013
| Concept | Links (Post, Video, Paper) |
|---|---|
| Adversarial Examples | Post 25 | Video 25 | (2013) Intriguing properties of neural networks |
| Embedding | Post 1 | Video 1 | (2013) Word2Vec |
| Gradient Clipping | Post 14 | Video 14 | (2013) Difficulty of Training RNNs |
| Latent Space | Post 5 | Video 5 | (2013) Auto-Encoding Variational Bayes |
| Representation Learning | Post 25 | Video 25 | (2013) Representation Learning: A Review |
| VAEs | Post 20 | Video 20 | (2013) Auto-Encoding Variational Bayes |
2014
| Concept | Links (Post, Video, Paper) |
|---|---|
| Adam | Post 4 | Video 4 | (2014) Adam: Stochastic Optimization |
| Attention | Post 2 | Video 2 | (2014) Neural Machine Translation |
| Dropout | Post 9 | Video 9 | (2014) Dropout: Prevent Overfitting |
| Encoder-Decoder | Post 10 | Video 10 | (2014) Sequence to Sequence Learning |
| GRU | Post 21 | Video 21 | (2014) Gated Recurrent Neural Networks |
| Memory-Augmented Networks | Post 27 | Video 27 | (2014) Neural Turing Machines |
| Mode Collapse | Post 24 | Video 24 | (2014) Generative Adversarial Nets |
| Overfitting | Post 3 | Video 3 | (2014) Dropout |
| Regularization | Post 6 | Video 6 | (2014) Dropout |
| Temperature | Post 2 | Video 2 | (2014) Properties of Neural MT |
2015
| Concept | Links (Post, Video, Paper) |
|---|---|
| Batch Normalization | Post 16 | Video 16 | (2015) Batch Normalization |
| Distillation | Post 10 | Video 10 | (2015) Distilling Knowledge |
| Label Smoothing | Post 25 | Video 25 | (2015) Rethinking Inception |
| Learning Rate | Post 2 | Video 2 | (2015) Cyclical Learning Rates |
| Tokenization | Post 3 | Video 3 | (2015) Subword Units |
2016
| Concept | Links (Post, Video, Paper) |
|---|---|
| Activation Functions | Post 4 | Video 4 | (2016) Deep Learning Book |
| Benchmark Leakage | Post 17 | Video 17 | (2016) Rethinking Inception |
| Checkpointing | Post 13 | Video 13 | (2016) Sublinear Memory Cost |
| Epoch | Post 18 | Video 18 | (2016) Deep Learning Book |
| Gradient Descent | Post 2 | Video 2 | (2016) Overview of Gradient Descent |
| Inference | Post 9 | Video 9 | (2016) Deep Learning Book |
| Learning Rate Schedules | Post 23 | Video 23 | (2016) SGDR: Warm Restarts |
| Loss Surface Sharpness | Post 23 | Video 23 | (2016) Large-Batch Training |
| Reward Hacking | Post 24 | Video 24 | (2016) Concrete Problems in AI Safety |
| Softmax | Post 11 | Video 11 | (2016) Deep Learning Book |
| Train/Validation/Test Split | Post 16 | Video 16 | (2016) Deep Learning Book |
2017
2018
| Concept | Links (Post, Video, Paper) |
|---|---|
| BERT | Post 6 | Video 6 | (2018) BERT: Pre-training |
| Concept Drift vs Data Drift | Post 17 | Video 17 | (2018) Learning under Concept Drift |
| Inductive Bias | Post 12 | Video 12 | (2018) Relational Inductive Biases |
| Loss Landscapes | Post 14 | Video 14 | (2018) Visualizing Loss Landscape |
| Pre-training | Post 5 | Video 5 | (2018) BERT |
2019
| Concept | Links (Post, Video, Paper) |
|---|---|
| Data Augmentation | Post 26 | Video 26 | (2019) Survey on Data Augmentation |
| Double Descent | Post 25 | Video 25 | (2019) Deep Double Descent |
| GPT | Post 7 | Video 7 | (2019) Language Models are Unsupervised Multitask Learners |
| Inference Parallelism | Post 28 | Video 28 | (2019) Megatron-LM |
| Lottery Ticket Hypothesis | Post 28 | Video 28 | (2019) The Lottery Ticket Hypothesis |
| Manifold Hypothesis | Post 26 | Video 26 | (2019) Intro to VAEs |
| Monitoring & Drift Detection | Post 15 | Video 15 | (2019) Detecting Dataset Shift |
| Replay Buffers | Post 27 | Video 27 | (2019) Experience Replay |
| Weight Decay | Post 17 | Video 17 | (2019) Decoupled Weight Decay |



2020s
2020
| Concept | Links (Post, Video, Paper) |
|---|---|
| Diffusion Models | Post 8 | Video 8 | (2020) Denoising Diffusion |
| Few-shot Learning | Post 10 | Video 10 | (2020) Language Models are Few-Shot Learners |
| Fine-tuning | Post 3 | Video 3 | (2020) Survey on Transfer Learning |
| ICL (In-Context Learning) | Post 5 | Video 5 | (2020) Language Models are Few-Shot Learners |
| Neural Collapse | Post 29 | Video 29 | (2020) Prevalence of Neural Collapse |
| Preference Learning | Post 18 | Video 18 | (2020) Learning to Summarize |
| Prompting | Post 6 | Video 6 | (2020) Language Models are Few-Shot Learners |
| RAG | Post 10 | Video 10 | (2020) Retrieval-Augmented Generation |
| Scaling Laws | Post 17 | Video 17 | (2020) Scaling Laws for Neural Language Models |
| Self-Training Instability | Post 29 | Video 29 | (2020) Understanding Self-Training |
| Shortcut Learning | Post 13 | Video 13 | (2020) Shortcut Learning in DNNs |
2021
| Concept | Links (Post, Video, Paper) |
|---|---|
| Failure Analysis | Post 19 | Video 19 | (2021) Practical ML for CV |
| Human-in-the-Loop Systems | Post 20 | Video 20 | (2021) Human-in-the-Loop ML |
| Latency vs Throughput | Post 12 | Video 12 | (2021) Efficient Large-Scale Training |
| LoRA | Post 3 | Video 3 | (2021) LoRA: Low-Rank Adaptation |
| Mechanistic Interpretability | Post 29 | Video 29 | (2021) Transformer Circuits |
| Quantization | Post 9 | Video 9 | (2021) Survey of Quantization Methods |
| RoPE | Post 6 | Video 6 | (2021) RoFormer |
| SAM | Post 29 | Video 29 | (2021) Sharpness-Aware Minimization |
| VLM | Post 4 | Video 4 | (2021) CLIP |
2022
| Concept | Links (Post, Video, Paper) |
|---|---|
| Chain of Thought | Post 11 | Video 11 | (2022) Chain-of-Thought Prompting |
| Compute Optimality Hypothesis | Post 28 | Video 28 | (2022) Chinchilla |
| Constitutional AI | Post 26 | Video 26 | (2022) Constitutional AI |
| Cost vs Quality Tradeoffs | Post 18 | Video 18 | (2022) Efficient Transformers |
| Emergent Behavior | Post 23 | Video 23 | (2022) Emergent Abilities |
| Flash Attention | Post 9 | Video 9 | (2022) FlashAttention |
| Goodhart’s Law | Post 26 | Video 26 | (2022) Goodhart’s Law and ML |
| Grokking | Post 29 | Video 29 | (2022) Grokking |
| KV Cache | Post 8 | Video 8 | (2022) Fast Transformer Decoding |
| RLHF | Post 9 | Video 9 | (2022) Training with Human Feedback |
| Shadow Deployment | Post 17 | Video 17 | (2022) Reliable ML |
| Speculative Decoding | Post 5 | Video 5 | (2022) Fast Inference via Speculative Decoding |
| Superposition | Post 4 | Video 4 | (2022) Toy Models of Superposition |
2023
| Concept | Links (Post, Video, Paper) |
|---|---|
| DPO | Post 2 | Video 2 | (2023) Direct Preference Optimization |
| GQA | Post 7 | Video 7 | (2023) GQA: Training Generalized Multi-Query |
| Hallucination | Post 1 | Video 1 | (2023) Survey of Hallucination |
| Jailbreaks | Post 21 | Video 21 | (2023) Jailbroken |
| Mamba | Post 1 | Video 1 | (2023) Mamba: Linear-Time Sequence Modeling |
| Model Editing | Post 27 | Video 27 | (2023) Editing LLMs |
| Model Steerability | Post 22 | Video 22 | (2023) Controllable Generation |
| Planning vs Prediction | Post 21 | Video 21 | (2023) AI/ML Gap |
| Prompt Injection | Post 21 | Video 21 | (2023) Prompt Injection Attack |
| RSFT | Post 22 | Video 22 | (2023) Scaling Mathematical Reasoning |
| Tool Use | Post 23 | Video 23 | (2023) Toolformer |
2024
| Concept | Links (Post, Video, Paper) |
|---|---|
| MLA | Post 8 | Video 8 | (2024) DeepSeek-V2 |
2025 and Beyond
Since 2024, no widely-adopted new fundamental ML concepts have emerged. Research has shifted from inventing primitives to composing, scaling, and applying them. Papers from 2025–2026 will be covered in our new series: Frontier ML Thinking—one concept, two minutes, deeper implications.
Thank you for following along. The journey continues.
Part 30 of the Five ML Concepts series. View all parts
Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.
