Five ML Concepts - #7 | Software Wrighter Lab Blog

Five ML concepts in under 30 seconds each: Cross-Validation (rotate held-out data), GPT (predict next token at scale), GQA (shared keys/values for efficiency), Context Window (how much the model sees), Self-Attention (each token attends to all others).

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #7
Comments	Discord

References

Concept	Reference
Cross-Validation	A Study of Cross-Validation and Bootstrap (Kohavi 1995)
GPT	Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
GQA	GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al. 2023)
Context Window	Attention Is All You Need (Vaswani et al. 2017)
Self-Attention	Attention Is All You Need (Vaswani et al. 2017)

Today’s Five

1. Cross-Validation

A technique that splits data into multiple folds to evaluate model performance on data it wasn’t trained on. By rotating which data is held out, it gives a more reliable estimate of generalization.

Essential for honest model evaluation.

Like practicing with different sets of flashcards to see if you actually learned the material.

2. GPT

Generative Pre-trained Transformer. A family of autoregressive language models trained to predict the next token in a sequence.

Many AI assistants and chatbots are built on this approach.

Like autocomplete, but scaled up and trained on vast text data.

3. GQA (Grouped Query Attention)

An attention variant where multiple query heads share key and value projections. This reduces memory usage and can speed up inference compared to standard multi-head attention.

Widely adopted in efficient transformer architectures.

Like several students sharing one set of notes instead of copying everything separately.

4. Context Window

The maximum number of tokens a model can process in a single forward pass. Larger context windows allow longer inputs, but increase memory and compute costs.

A key constraint in language model design.

Like the size of a desk that limits how many papers you can spread out at once.

5. Self-Attention

A mechanism where each token computes attention scores with other tokens in the same sequence. This lets the model weigh which parts of the input are most relevant to each position.

The core operation inside transformers.

Like everyone in a meeting deciding who to listen to based on the conversation.

Quick Reference

Concept	One-liner
Cross-Validation	Rotate held-out data for reliable evaluation
GPT	Predict next token, at scale
GQA	Shared keys/values for efficient attention
Context Window	How much the model sees at once
Self-Attention	Each token attends to all others

Short, accurate ML explainers. Follow for more.