Five ML Concepts - #28
448 words • 3 min read • Abstract

5 machine learning concepts. Under 30 seconds each.
| Resource | Link |
|---|---|
| Papers | Links in References section |
| Video | Five ML Concepts #28![]() |
| Comments | Discord |
References
| Concept | Reference |
|---|---|
| Lottery Ticket Hypothesis | The Lottery Ticket Hypothesis (Frankle & Carlin 2019) |
| Sparse Activation | Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017) |
| Conditional Computation | Sparsely-Gated MoE + Switch Transformers |
| Inference Parallelism | Megatron-LM (Shoeybi et al. 2019) |
| Compute Optimality | Chinchilla Scaling Laws (Hoffmann et al. 2022) |
Today’s Five
1. Lottery Ticket Hypothesis
Large neural networks contain smaller subnetworks that, when trained from the right initialization, achieve similar performance. These “winning tickets” exist before training begins.
The key insight: you can find and train just the winning subnetwork.
Like finding a winning lottery ticket hidden among many losing ones.
2. Sparse Activation
Only a subset of neurons activate for each input, even in models with many parameters. This allows large capacity without using everything at once.
Mixture-of-experts architectures explicitly design for this pattern.
Like a library where only relevant books light up for each query.
3. Conditional Computation
The model dynamically activates only certain components depending on the input. Different inputs route to different experts or pathways.
This improves efficiency and scalability without proportional compute increase.
Like routing patients to the right specialist instead of seeing every doctor.
4. Inference Parallelism
Model execution can be split across multiple devices to reduce latency or increase throughput. Tensor parallelism splits layers; pipeline parallelism splits stages.
Essential for serving large models in production.
Like dividing a puzzle so multiple people work on it simultaneously.
5. Compute Optimality Hypothesis
Empirical scaling laws suggest performance improves when model size, data, and compute are balanced. Adding only one resource may not yield optimal gains.
Chinchilla showed many models were undertrained relative to their size.
Like baking a cake where proportions matter more than just adding extra ingredients.
Quick Reference
| Concept | One-liner |
|---|---|
| Lottery Ticket Hypothesis | Small winning subnetworks hidden in large models |
| Sparse Activation | Using only part of a model per input |
| Conditional Computation | Dynamically routing inputs for efficiency |
| Inference Parallelism | Distributing inference across devices |
| Compute Optimality | Balancing model size, data, and compute |
Short, accurate ML explainers. Follow for more.
Part 28 of the Five ML Concepts series. View all parts | Next: Part 29 →
Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.
