ML Frontier #02: In-Context Reinforcement Learning
823 words • 5 min read • Abstract

Second ML Frontier episode. This one covers In-Context Reinforcement Learning—how transformers learn decision policies from trajectory examples in the prompt, without weight updates.
| Resource | Link |
|---|---|
| Papers | 5 papers covered |
| Video | ML Frontier 2: ICRL![]() |
| Related | Saw (2/?): agentrail-rs — practical ICRL application |
| Comments | Discord |
What is In-Context Reinforcement Learning?
The model observes sequences of states, actions, and rewards stored in the prompt. Instead of updating weights through gradient descent, the transformer approximates a policy from those trajectory examples.
Think of it like learning to cook by reading a journal of recipes that worked—and ones that didn’t—with notes on what went wrong.
Why Does This Matter?
AI agents often lose procedural knowledge when context is truncated between sessions. They know the goal but forget which API to call, which flags to use, which client library to reference, or how to validate output.
The traditional approach—writing instructions in markdown files—isn’t reliable. Agents ignore rules even when they’re present. ICRL offers a different path: instead of telling the agent what to do, show it what worked and what didn’t, with reward signals attached.
By embedding successful execution traces in the prompt, agents can reuse proven approaches instead of improvising from scratch.
Research Evidence
Decision Transformer (Chen et al., 2021)
Paper: arXiv 2106.01345
In brief: The paper that started it all. Instead of training an RL agent with value functions and policy gradients, just frame the problem as sequence prediction. Feed the transformer trajectories of (return-to-go, state, action) and let it predict the next action conditioned on the desired return. The transformer learns a policy by modeling sequences—no Bellman equations needed.
Why it matters: Reframed RL as something transformers already do well: sequence modeling.
Transformers Learn TD Methods (Wang et al., ICLR 2025)
Paper: arXiv 2405.13861
In brief: This paper shows that transformers don’t just pattern-match on trajectories—they actually approximate temporal-difference (TD) learning algorithms during the forward pass. The model internally implements something resembling TD updates, without being explicitly trained to do so.
Why it matters: Transformers aren’t just memorizing trajectories. They’re learning the underlying RL algorithm in-context.
OmniRL (2025)
Paper: arXiv 2502.02869
In brief: OmniRL proposes a transformer architecture that emulates actor-critic RL in-context, improving decision quality across multiple tasks. Rather than specializing in one environment, the model generalizes its in-context RL capabilities across diverse settings.
Why it matters: ICRL isn’t limited to one task—it scales across environments.
Reflexion (Shinn et al., NeurIPS 2023)
Paper: arXiv 2303.11366
In brief: Reflexion takes a different angle: instead of embedding raw trajectories, the agent generates verbal reflections on its failures and successes. These self-critiques are stored and injected into future prompts. The agent learns from its own written analysis of what went wrong.
Why it matters: Shows that trajectory-based learning doesn’t require structured (state, action, reward) tuples—natural language reflections work too.
Voyager (Wang et al., 2023)
Paper: arXiv 2305.16291
In brief: An open-ended Minecraft agent that builds a skill library from successful code executions. When Voyager solves a task, it stores the working code as a reusable skill. Future tasks can retrieve and compose these skills. The agent explores, learns, and accumulates capabilities without any weight updates.
Why it matters: Demonstrates ICRL at scale—an agent that gets better over time by accumulating proven solutions.
Papers
| Date | Paper | Link |
|---|---|---|
| Jun 2021 | Decision Transformer: RL via Sequence Modeling | arXiv 2106.01345 |
| Mar 2023 | Reflexion: Language Agents with Verbal Reinforcement Learning | arXiv 2303.11366 |
| May 2023 | Voyager: Open-Ended Embodied Agent with LLMs | arXiv 2305.16291 |
| May 2024 | Transformers Learn TD Methods for In-Context RL | arXiv 2405.13861 |
| Feb 2025 | OmniRL: In-Context RL Across Multiple Tasks | arXiv 2502.02869 |
Practical Application: agentrail-rs
This isn’t just theory. I’m building agentrail-rs to apply ICRL to AI coding agents used for non-coding tasks—TTS generation, video compositing, file manipulation. The tool records trajectories (state, action, result, reward) and injects successful examples into future agent prompts. Early days, but the research says this should work.
See Saw (2/?): agentrail-rs for more on the engineering side.
Key Takeaways
| Concept | One-liner |
|---|---|
| ICRL | Learn RL policies from trajectory examples in the prompt |
| No Weight Updates | The transformer adapts during inference, not training |
| TD in the Forward Pass | Transformers approximate RL algorithms internally |
| Verbal Reflection | Natural language self-critique works as trajectory data |
| Skill Libraries | Accumulate proven solutions for reuse across sessions |
In-Context RL turns agents from amnesiacs into learners. Follow for more ML Frontier episodes exploring research at the edge.
Part 2 of the Machine Learning Frontier series. View all parts | Next: Part 3 →
Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.
