30 episodes. 145 machine learning concepts.

Resource Link
Full Series Five ML Concepts Episodes 1-29
Video Five ML Concepts #30
Video
Papers Index Complete Concept Index
Comments Discord

The Journey So Far

For the past thirty episodes, we’ve explored 145 machine learning concepts in under 30 seconds each.

From backpropagation to scaling laws. From dropout to distribution shift. From RAG to reward hacking.

We covered:

  • Foundations — the building blocks of neural networks and learning algorithms
  • Failure modes — how models break, overfit, forget, and hallucinate
  • Deployment realities — what happens when models meet production
  • Alignment challenges — ensuring models do what we actually want

What’s Changing

Machine learning is evolving rapidly. The foundational primitives are now well-established—the concepts we covered form a stable vocabulary.

But new research is reshaping how we apply these primitives:

  • Memory and retrieval architectures
  • Reasoning and planning systems
  • Sparsity and efficiency at scale
  • Robustness and generalization
  • Alignment and safety

30 seconds per concept was a good start. But some ideas deserve more depth.

What’s Next: Frontier ML Thinking

Starting soon: Frontier ML Thinking.

One concept. Two minutes. Deeper implications.

We’ll explore the cutting edge—ideas from papers published in the last 12 months that build on the foundations we’ve covered.

If You’re New Here

Start with Five ML Concepts Episodes 1–29. Each episode covers five concepts in five minutes total. The full series provides a foundation in modern machine learning vocabulary.

If You’ve Been Here the Whole Time

You’re ready for the frontier.


Why the Papers Look “Old”

When I tabulated the papers behind the 145 concepts in this series, something looked odd: almost none of the cited papers were from the last two years.

This is not a mistake—it’s a feature of how ML knowledge evolves.

Seminal papers don’t keep getting re-written

Most concepts in this series are primitives: backpropagation, transformers, RAG, dropout, calibration, and so on. Each primitive has an origin paper that introduced it. Once the primitive exists, later research focuses on:

  • Scaling it
  • Combining it with other ideas
  • Benchmarking it
  • Making it more efficient
  • Making it safer

That kind of work produces new papers, but not new “origin papers.”

What this reveals about the field

The core intellectual breakthroughs of modern ML largely occurred between 2016 and 2022. The frontier has since shifted from inventing new primitives to:

  • Memory and retrieval systems
  • Continual learning
  • Agent architectures
  • Tool use and planning
  • Sparsity and efficiency at scale
  • Alignment and safety

That’s exactly what Frontier ML Thinking will explore: ideas from papers published in the last 12 months that build on these foundations.


Complete Concept Index

All 145 concepts organized chronologically by seminal paper year.

Pre-1990

1950s1950s1950s

1958

Concept Links (Post, Video, Paper)
Perceptron Post 5 | Video 5 | (1958) The Perceptron

1960s1960s1960s

1960s

 
None

1970s1970s1970s

1970s

 
None

1980s1980s1980s

1986

Concept Links (Post, Video, Paper)
Backpropagation Post 1 | Video 1 | (1986) Learning representations by back-propagating errors
RNN Post 11 | Video 11 | (1986) Learning representations

1989

Concept Links (Post, Video, Paper)
Universal Approximation Post 13 | Video 13 | (1989) Approximation by Superpositions

1990s1990s1990s

1990s

1995

Concept Links (Post, Video, Paper)
Cross-Validation Post 7 | Video 7 | (1995) A Study of Cross-Validation

1997

Concept Links (Post, Video, Paper)
LSTM Post 22 | Video 22 | (1997) Long Short-Term Memory

1998

Concept Links (Post, Video, Paper)
Early Stopping Post 13 | Video 13 | (1998) Early Stopping - But When?

2000s2000s2000s

2000s

2000

Concept Links (Post, Video, Paper)
Ensembling Post 18 | Video 18 | (2000) Ensemble Methods

2002

Concept Links (Post, Video, Paper)
Cold Start Problems Post 14 | Video 14 | (2002) Addressing Cold Start

2003

Concept Links (Post, Video, Paper)
Perplexity Post 15 | Video 15 | (2003) A Neural Probabilistic Language Model

2006

Concept Links (Post, Video, Paper)
Autoencoders Post 19 | Video 19 | (2006) Reducing Dimensionality
ROC / AUC Post 14 | Video 14 | (2006) An Introduction to ROC Analysis

2007

Concept Links (Post, Video, Paper)
Precision vs Recall Post 12 | Video 12 | (2007) The Truth of the F-Measure

2009

Concept Links (Post, Video, Paper)
A/B Testing Models Post 16 | Video 16 | (2009) Controlled Experiments
Bias-Variance Tradeoff Post 8 | Video 8 | (2009) Elements of Statistical Learning
Correlation vs Causation Post 19 | Video 19 | (2009) Causality
Covariate Shift Post 19 | Video 19 | (2009) Dataset Shift in ML
Curriculum Learning Post 19 | Video 19 | (2009) Curriculum Learning
Curse of Dimensionality Post 15 | Video 15 | (2009) Elements of Statistical Learning
Distribution Shift Post 11 | Video 11 | (2009) Dataset Shift in ML
Why ML Is Fragile Post 18 | Video 18 | (2009) Distribution Shift
Why More Data Beats Better Models Post 22 | Video 22 | (2009) Unreasonable Effectiveness of Data

2010s2010s2010s

2010s

2010

Concept Links (Post, Video, Paper)
Transfer Learning Post 4 | Video 4 | (2010) A Survey on Transfer Learning
Weight Initialization Post 15 | Video 15 | (2010) Understanding Difficulty of Training

2011

Concept Links (Post, Video, Paper)
Spurious Correlations Post 14 | Video 14 | (2011) Unbiased Look at Dataset Bias

2012

Concept Links (Post, Video, Paper)
CNN Post 10 | Video 10 | (2012) ImageNet Classification with Deep CNNs
Data Leakage Post 24 | Video 24 | (2012) Leakage in Data Mining

2013

Concept Links (Post, Video, Paper)
Adversarial Examples Post 25 | Video 25 | (2013) Intriguing properties of neural networks
Embedding Post 1 | Video 1 | (2013) Word2Vec
Gradient Clipping Post 14 | Video 14 | (2013) Difficulty of Training RNNs
Latent Space Post 5 | Video 5 | (2013) Auto-Encoding Variational Bayes
Representation Learning Post 25 | Video 25 | (2013) Representation Learning: A Review
VAEs Post 20 | Video 20 | (2013) Auto-Encoding Variational Bayes

2014

Concept Links (Post, Video, Paper)
Adam Post 4 | Video 4 | (2014) Adam: Stochastic Optimization
Attention Post 2 | Video 2 | (2014) Neural Machine Translation
Dropout Post 9 | Video 9 | (2014) Dropout: Prevent Overfitting
Encoder-Decoder Post 10 | Video 10 | (2014) Sequence to Sequence Learning
GRU Post 21 | Video 21 | (2014) Gated Recurrent Neural Networks
Memory-Augmented Networks Post 27 | Video 27 | (2014) Neural Turing Machines
Mode Collapse Post 24 | Video 24 | (2014) Generative Adversarial Nets
Overfitting Post 3 | Video 3 | (2014) Dropout
Regularization Post 6 | Video 6 | (2014) Dropout
Temperature Post 2 | Video 2 | (2014) Properties of Neural MT

2015

Concept Links (Post, Video, Paper)
Batch Normalization Post 16 | Video 16 | (2015) Batch Normalization
Distillation Post 10 | Video 10 | (2015) Distilling Knowledge
Label Smoothing Post 25 | Video 25 | (2015) Rethinking Inception
Learning Rate Post 2 | Video 2 | (2015) Cyclical Learning Rates
Tokenization Post 3 | Video 3 | (2015) Subword Units

2016

Concept Links (Post, Video, Paper)
Activation Functions Post 4 | Video 4 | (2016) Deep Learning Book
Benchmark Leakage Post 17 | Video 17 | (2016) Rethinking Inception
Checkpointing Post 13 | Video 13 | (2016) Sublinear Memory Cost
Epoch Post 18 | Video 18 | (2016) Deep Learning Book
Gradient Descent Post 2 | Video 2 | (2016) Overview of Gradient Descent
Inference Post 9 | Video 9 | (2016) Deep Learning Book
Learning Rate Schedules Post 23 | Video 23 | (2016) SGDR: Warm Restarts
Loss Surface Sharpness Post 23 | Video 23 | (2016) Large-Batch Training
Reward Hacking Post 24 | Video 24 | (2016) Concrete Problems in AI Safety
Softmax Post 11 | Video 11 | (2016) Deep Learning Book
Train/Validation/Test Split Post 16 | Video 16 | (2016) Deep Learning Book

2017

Concept Links (Post, Video, Paper)
Batch Size Post 12 | Video 12 | (2017) Large-Batch Training
Calibration Post 13 | Video 13 | (2017) On Calibration
Catastrophic Forgetting Post 15 | Video 15 | (2017) Overcoming Catastrophic Forgetting
Conditional Computation Post 28 | Video 28 | (2017) Sparsely-Gated MoE
Context Window Post 7 | Video 7 | (2017) Attention Is All You Need
Elastic Weight Consolidation Post 27 | Video 27 | (2017) Overcoming Catastrophic Forgetting (EWC)
Gradient Noise Post 20 | Video 20 | (2017) SGD as Approximate Bayesian Inference
Loss Function Post 3 | Video 3 | (2017) Survey of Loss Functions
Miscalibration Post 25 | Video 25 | (2017) On Calibration
Mixed Precision Post 8 | Video 8 | (2017) Mixed Precision Training
MoE Post 11 | Video 11 | (2017) Sparsely-Gated MoE
OOD Inputs Post 12 | Video 12 | (2017) Detecting Misclassified Examples
Optimization vs Generalization Post 16 | Video 16 | (2017) Rethinking Generalization
Overconfidence Post 16 | Video 16 | (2017) On Calibration
Parameter Routing Post 27 | Video 27 | (2017) Sparsely-Gated MoE
Positional Encoding Post 6 | Video 6 | (2017) Attention Is All You Need
Self-Attention Post 7 | Video 7 | (2017) Attention Is All You Need
Sparse Activation Post 28 | Video 28 | (2017) Sparsely-Gated MoE
Transformer Post 1 | Video 1 | (2017) Attention Is All You Need
Uncertainty Estimation Post 20 | Video 20 | (2017) Uncertainties in Bayesian DL
Warmup Post 24 | Video 24 | (2017) Accurate Large Minibatch SGD
Why Interpretability Is Hard Post 20 | Video 20 | (2017) Rigorous Science of Interpretability

2018

Concept Links (Post, Video, Paper)
BERT Post 6 | Video 6 | (2018) BERT: Pre-training
Concept Drift vs Data Drift Post 17 | Video 17 | (2018) Learning under Concept Drift
Inductive Bias Post 12 | Video 12 | (2018) Relational Inductive Biases
Loss Landscapes Post 14 | Video 14 | (2018) Visualizing Loss Landscape
Pre-training Post 5 | Video 5 | (2018) BERT

2019

Concept Links (Post, Video, Paper)
Data Augmentation Post 26 | Video 26 | (2019) Survey on Data Augmentation
Double Descent Post 25 | Video 25 | (2019) Deep Double Descent
GPT Post 7 | Video 7 | (2019) Language Models are Unsupervised Multitask Learners
Inference Parallelism Post 28 | Video 28 | (2019) Megatron-LM
Lottery Ticket Hypothesis Post 28 | Video 28 | (2019) The Lottery Ticket Hypothesis
Manifold Hypothesis Post 26 | Video 26 | (2019) Intro to VAEs
Monitoring & Drift Detection Post 15 | Video 15 | (2019) Detecting Dataset Shift
Replay Buffers Post 27 | Video 27 | (2019) Experience Replay
Weight Decay Post 17 | Video 17 | (2019) Decoupled Weight Decay

2020s2020s2020s

2020s

2020

Concept Links (Post, Video, Paper)
Diffusion Models Post 8 | Video 8 | (2020) Denoising Diffusion
Few-shot Learning Post 10 | Video 10 | (2020) Language Models are Few-Shot Learners
Fine-tuning Post 3 | Video 3 | (2020) Survey on Transfer Learning
ICL (In-Context Learning) Post 5 | Video 5 | (2020) Language Models are Few-Shot Learners
Neural Collapse Post 29 | Video 29 | (2020) Prevalence of Neural Collapse
Preference Learning Post 18 | Video 18 | (2020) Learning to Summarize
Prompting Post 6 | Video 6 | (2020) Language Models are Few-Shot Learners
RAG Post 10 | Video 10 | (2020) Retrieval-Augmented Generation
Scaling Laws Post 17 | Video 17 | (2020) Scaling Laws for Neural Language Models
Self-Training Instability Post 29 | Video 29 | (2020) Understanding Self-Training
Shortcut Learning Post 13 | Video 13 | (2020) Shortcut Learning in DNNs

2021

Concept Links (Post, Video, Paper)
Failure Analysis Post 19 | Video 19 | (2021) Practical ML for CV
Human-in-the-Loop Systems Post 20 | Video 20 | (2021) Human-in-the-Loop ML
Latency vs Throughput Post 12 | Video 12 | (2021) Efficient Large-Scale Training
LoRA Post 3 | Video 3 | (2021) LoRA: Low-Rank Adaptation
Mechanistic Interpretability Post 29 | Video 29 | (2021) Transformer Circuits
Quantization Post 9 | Video 9 | (2021) Survey of Quantization Methods
RoPE Post 6 | Video 6 | (2021) RoFormer
SAM Post 29 | Video 29 | (2021) Sharpness-Aware Minimization
VLM Post 4 | Video 4 | (2021) CLIP

2022

Concept Links (Post, Video, Paper)
Chain of Thought Post 11 | Video 11 | (2022) Chain-of-Thought Prompting
Compute Optimality Hypothesis Post 28 | Video 28 | (2022) Chinchilla
Constitutional AI Post 26 | Video 26 | (2022) Constitutional AI
Cost vs Quality Tradeoffs Post 18 | Video 18 | (2022) Efficient Transformers
Emergent Behavior Post 23 | Video 23 | (2022) Emergent Abilities
Flash Attention Post 9 | Video 9 | (2022) FlashAttention
Goodhart’s Law Post 26 | Video 26 | (2022) Goodhart’s Law and ML
Grokking Post 29 | Video 29 | (2022) Grokking
KV Cache Post 8 | Video 8 | (2022) Fast Transformer Decoding
RLHF Post 9 | Video 9 | (2022) Training with Human Feedback
Shadow Deployment Post 17 | Video 17 | (2022) Reliable ML
Speculative Decoding Post 5 | Video 5 | (2022) Fast Inference via Speculative Decoding
Superposition Post 4 | Video 4 | (2022) Toy Models of Superposition

2023

Concept Links (Post, Video, Paper)
DPO Post 2 | Video 2 | (2023) Direct Preference Optimization
GQA Post 7 | Video 7 | (2023) GQA: Training Generalized Multi-Query
Hallucination Post 1 | Video 1 | (2023) Survey of Hallucination
Jailbreaks Post 21 | Video 21 | (2023) Jailbroken
Mamba Post 1 | Video 1 | (2023) Mamba: Linear-Time Sequence Modeling
Model Editing Post 27 | Video 27 | (2023) Editing LLMs
Model Steerability Post 22 | Video 22 | (2023) Controllable Generation
Planning vs Prediction Post 21 | Video 21 | (2023) AI/ML Gap
Prompt Injection Post 21 | Video 21 | (2023) Prompt Injection Attack
RSFT Post 22 | Video 22 | (2023) Scaling Mathematical Reasoning
Tool Use Post 23 | Video 23 | (2023) Toolformer

2024

Concept Links (Post, Video, Paper)
MLA Post 8 | Video 8 | (2024) DeepSeek-V2

2025 and Beyond

Since 2024, no widely-adopted new fundamental ML concepts have emerged. Research has shifted from inventing primitives to composing, scaling, and applying them. Papers from 2025–2026 will be covered in our new series: Frontier ML Thinking—one concept, two minutes, deeper implications.


Thank you for following along. The journey continues.