AI Tools (1/?): XSkill --- A Memory Layer for Multimodal Agents
584 words • 3 min read • Abstract

| Resource | Link |
|---|---|
| Paper | arXiv 2603.12056 |
| Project | XSkill Project Page |
| Code | GitHub (MIT) |
| Video | XSkill: Memory Layer![]() |
| Comments | Discord |
The Problem: Isolated Episodes
Multimodal agents solve complex visual and tool-heavy tasks, but each run starts from scratch. An agent might figure out a multi-step workflow for extracting color data from an image—only to lose that knowledge when the next task begins. Useful lessons evaporate between sessions.
Two Kinds of Memory
XSkill introduces a dual-memory architecture that separates strategic knowledge from tactical knowledge:
Skills are structured Markdown documents containing workflows and tool templates for a class of tasks. A skill says: here is the overall approach for this kind of problem.
Experiences are smaller tactical lessons with triggering conditions, recommended actions, and failure notes. An experience says: when this specific pattern appears, use this tactic instead of guessing.
That split matters. Ablation analysis shows that removing either type hurts performance—skills alone aren’t enough, and experiences alone aren’t enough.
Two-Phase Framework
The framework operates in a loop:
Phase 1 — Accumulation. After completing rollout batches, the agent reviews past trajectories through visually-grounded summarization, cross-rollout critique, and hierarchical consolidation. This produces skill documents and experience items stored in persistent banks.
Phase 2 — Inference. Given a new task, the agent decomposes it, retrieves relevant skills and experiences via semantic search, adapts them to the current visual context, and injects them into the system prompt.
The key claim: agents improve through memory accumulation and retrieval, not parameter updates. No fine-tuning required.
Results
Evaluated across five benchmarks spanning visual tool use (VisualToolBench, TIR-Bench), multimodal search (MMSearch-Plus, MMBrowseComp), and comprehensive agent tasks (AgentVista):
| Backbone | Avg@4 | Pass@4 |
|---|---|---|
| Gemini-3-Flash + XSkill | 40.34 | 58.95 |
| Gemini-2.5-Pro + XSkill | 28.63 | 46.38 |
| o4-mini + XSkill | 23.72 | 39.07 |
| GPT-4o-mini + XSkill | 23.19 | 38.90 |
Average gains of 2.6 to 6.7 points over baselines (Agent Workflow Memory, Dynamic CheatSheet, Agent-KB). Performance consistently improves as rollout count increases from 1 to 4.
Practical impact: syntax errors drop from 20.3% to 11.4% with skills, and tool name errors fall from 2.85% to 0.32%.
Concrete Example
A visual task requires identifying the color of a region behind specific text in an image. Without memory, the agent guesses. With XSkill, it retrieves a structured workflow: locate the text, isolate the region, sample pixels via code interpreter, and infer the color from actual data. Code interpreter usage increases from 66.6% to 77.0% on VisualToolBench—the agent learns to measure instead of guess.
Why This Matters
XSkill sits at the intersection of agents, tools, multimodal reasoning, and continual improvement. The practical takeaway isn’t just that memory helps—it’s that different kinds of memory help in different ways. Strategic workflows and situational tactics serve complementary roles.
Not a bigger model. A smarter memory layer.
References
| Reference | Link |
|---|---|
| XSkill paper | arXiv 2603.12056 |
| Project page | xskill-agent.github.io |
| GitHub repo (MIT) | XSkill-Agent/XSkill |
Part 1 of the AI Tools series. View all parts
Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.
