It was 2020 when GPT-3 shocked everyone. It could learn from examples in the query—without updating its weights. We called it In-Context Learning. But was it magic, or was it doing something deeper?

Resource Link
Video ICL Revisited
Video
Papers 4 References

Phase 1: The Empirical Discovery (2020)

The GPT-3 paper showed that large models could perform few-shot learning. Give them examples, and they generalize. No gradient updates. No retraining. Just forward passes.

The surprising part was that scaling alone seemed to unlock it.

Paper: Language Models are Few-Shot Learners

ELI5: Show a big language model a few examples of a task in your prompt, and it figures out how to do the task—without any retraining. Nobody told it to do this. It just emerged when models got big enough.

Main idea: Scale unlocks emergent capabilities. ICL was discovered, not designed.

Phase 2: Mechanistic Explanations (2022)

By 2022, researchers began probing the internal mechanisms. Several papers proposed that transformers implement implicit meta-learning. The model appears to learn during inference by performing gradient-descent-like operations internally.

Paper: What Explains In-Context Learning in Transformers?

ELI5: When you give a transformer examples, its attention layers do something that looks like fitting a simple linear model to those examples—on the fly, during the forward pass. It’s not memorizing; it’s computing a mini-solution.

Main idea: ICL works because attention can simulate linear regression internally.

Paper: Transformers Learn In-Context by Gradient Descent

ELI5: The transformer’s forward pass is secretly doing something similar to training. The attention mechanism acts like one step of gradient descent over the examples you provided. Learning happens inside inference.

Main idea: ICL is implicit gradient descent—learning hidden inside prediction.

Phase 3: Engineering the Effect

Once researchers understood that ordering and structure affect ICL, prompt design became less of an art and more of an optimization problem. The quality and arrangement of demonstrations directly shape performance.

ICL became tunable. Researchers could now deliberately improve it rather than just observe it.

Phase 4: Interactive ICL (2026)

Recent work pushes this further. Models are trained to predict natural language critiques and feedback. If a model can predict what a teacher would say, it can internalize that signal. External correction becomes an internal capability.

Paper: Improving Interactive In-Context Learning from Natural Language Feedback

ELI5: Train a model to guess what feedback a human would give. Now the model has internalized the “teacher” and can improve itself without needing the actual teacher present. Self-correction without weight updates.

Main idea: Models can learn to learn from feedback, making ICL interactive and self-improving.

Beyond Language

Newer work applies ICL to neuroscience discovery, showing that the mechanism is not limited to text tasks. It becomes a flexible reasoning substrate across domains. That’s when you know a concept has matured.

The Arc

Phase Era Key Insight
Discovery 2020 Emerges from scale
Explanation 2022 Implicit gradient descent
Engineering 2023-24 Prompt design as optimization
Self-improvement 2026 Learning from feedback

The Deeper Insight

In-Context Learning started as an emergent surprise. Now it’s becoming an engineered learning substrate inside transformers.

It was not magic. It was meta-learning hiding in plain sight.

References

Paper Link
Language Models are Few-Shot Learners (GPT-3) arXiv:2005.14165
What Explains In-Context Learning in Transformers? arXiv:2202.12837
Transformers Learn In-Context by Gradient Descent arXiv:2212.07677
Improving Interactive ICL from Natural Language Feedback arXiv:2602.16066

ICL started as “whoa, it works.” Now we understand “why it works.” Next: engineering it deliberately.