Most AI agents can use tools. Far fewer can remember how to use them better next time. XSkill addresses that gap with a structured memory layer that accumulates know-how from past runs—without retraining the model.

Resource Link
Paper arXiv 2603.12056
Project XSkill Project Page
Code GitHub (MIT)
Video XSkill: Memory Layer
Video
Comments Discord

The Problem: Isolated Episodes

Multimodal agents solve complex visual and tool-heavy tasks, but each run starts from scratch. An agent might figure out a multi-step workflow for extracting color data from an image—only to lose that knowledge when the next task begins. Useful lessons evaporate between sessions.

Two Kinds of Memory

XSkill introduces a dual-memory architecture that separates strategic knowledge from tactical knowledge:

Skills are structured Markdown documents containing workflows and tool templates for a class of tasks. A skill says: here is the overall approach for this kind of problem.

Experiences are smaller tactical lessons with triggering conditions, recommended actions, and failure notes. An experience says: when this specific pattern appears, use this tactic instead of guessing.

That split matters. Ablation analysis shows that removing either type hurts performance—skills alone aren’t enough, and experiences alone aren’t enough.

Two-Phase Framework

The framework operates in a loop:

Phase 1 — Accumulation. After completing rollout batches, the agent reviews past trajectories through visually-grounded summarization, cross-rollout critique, and hierarchical consolidation. This produces skill documents and experience items stored in persistent banks.

Phase 2 — Inference. Given a new task, the agent decomposes it, retrieves relevant skills and experiences via semantic search, adapts them to the current visual context, and injects them into the system prompt.

The key claim: agents improve through memory accumulation and retrieval, not parameter updates. No fine-tuning required.

Results

Evaluated across five benchmarks spanning visual tool use (VisualToolBench, TIR-Bench), multimodal search (MMSearch-Plus, MMBrowseComp), and comprehensive agent tasks (AgentVista):

Backbone Avg@4 Pass@4
Gemini-3-Flash + XSkill 40.34 58.95
Gemini-2.5-Pro + XSkill 28.63 46.38
o4-mini + XSkill 23.72 39.07
GPT-4o-mini + XSkill 23.19 38.90

Average gains of 2.6 to 6.7 points over baselines (Agent Workflow Memory, Dynamic CheatSheet, Agent-KB). Performance consistently improves as rollout count increases from 1 to 4.

Practical impact: syntax errors drop from 20.3% to 11.4% with skills, and tool name errors fall from 2.85% to 0.32%.

Concrete Example

A visual task requires identifying the color of a region behind specific text in an image. Without memory, the agent guesses. With XSkill, it retrieves a structured workflow: locate the text, isolate the region, sample pixels via code interpreter, and infer the color from actual data. Code interpreter usage increases from 66.6% to 77.0% on VisualToolBench—the agent learns to measure instead of guess.

Why This Matters

XSkill sits at the intersection of agents, tools, multimodal reasoning, and continual improvement. The practical takeaway isn’t just that memory helps—it’s that different kinds of memory help in different ways. Strategic workflows and situational tactics serve complementary roles.

Not a bigger model. A smarter memory layer.


References

Reference Link
XSkill paper arXiv 2603.12056
Project page xskill-agent.github.io
GitHub repo (MIT) XSkill-Agent/XSkill