5 machine learning concepts. Under 30 seconds each.

Resource Link
Papers Links in References section
Video Five ML Concepts #28
Video
Comments Discord

References

Concept Reference
Lottery Ticket Hypothesis The Lottery Ticket Hypothesis (Frankle & Carlin 2019)
Sparse Activation Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
Conditional Computation Sparsely-Gated MoE + Switch Transformers
Inference Parallelism Megatron-LM (Shoeybi et al. 2019)
Compute Optimality Chinchilla Scaling Laws (Hoffmann et al. 2022)

Today’s Five

1. Lottery Ticket Hypothesis

Large neural networks contain smaller subnetworks that, when trained from the right initialization, achieve similar performance. These “winning tickets” exist before training begins.

The key insight: you can find and train just the winning subnetwork.

Like finding a winning lottery ticket hidden among many losing ones.

2. Sparse Activation

Only a subset of neurons activate for each input, even in models with many parameters. This allows large capacity without using everything at once.

Mixture-of-experts architectures explicitly design for this pattern.

Like a library where only relevant books light up for each query.

3. Conditional Computation

The model dynamically activates only certain components depending on the input. Different inputs route to different experts or pathways.

This improves efficiency and scalability without proportional compute increase.

Like routing patients to the right specialist instead of seeing every doctor.

4. Inference Parallelism

Model execution can be split across multiple devices to reduce latency or increase throughput. Tensor parallelism splits layers; pipeline parallelism splits stages.

Essential for serving large models in production.

Like dividing a puzzle so multiple people work on it simultaneously.

5. Compute Optimality Hypothesis

Empirical scaling laws suggest performance improves when model size, data, and compute are balanced. Adding only one resource may not yield optimal gains.

Chinchilla showed many models were undertrained relative to their size.

Like baking a cake where proportions matter more than just adding extra ingredients.

Quick Reference

Concept One-liner
Lottery Ticket Hypothesis Small winning subnetworks hidden in large models
Sparse Activation Using only part of a model per input
Conditional Computation Dynamically routing inputs for efficiency
Inference Parallelism Distributing inference across devices
Compute Optimality Balancing model size, data, and compute

Short, accurate ML explainers. Follow for more.