Multi-Hop Reasoning (2/2): The Distribution Trap
796 words • 4 min read • Abstract

In Part 1, a tiny 135M model achieved 75% accuracy on multi-hop reasoning. This time we scale up to 360M—and discover that RSFT on easy examples makes performance worse.
| Resource | Link |
|---|---|
| Paper | KG-Guided RAG (arXiv) |
| Code | multi-hop-reasoning |
| ELI5 | eli5.md |
| Demo | Live Demo |
| Explainer | Coming soon |
Scaling Up: SmolLM-360M
Part 1 used the 135M model. For better reasoning traces and demo quality, we trained the 360M variant:
| Model | Parameters | Platform |
|---|---|---|
| SmolLM-135M-Instruct | 135M | MLX (macOS) |
| SmolLM-360M-Instruct | 360M | MLX + Unsloth (cross-platform) |
The 360M model produces more coherent traces and is used by the live inference demo.
The Distribution Trap
Here’s what happened when we trained RSFT on the “easy” training data:
| Phase | Training Data | Accuracy | Notes |
|---|---|---|---|
| Base | — | 0% | No format compliance |
| SFT (500 iters) | Easy (1-3 hop) | 37% | Learns TRACE + ANSWER format |
| RSFT | Easy (1-3 hop) | 27% | Worse than SFT! |
RSFT on easy examples performed worse than the SFT baseline.
Why?
The training examples (1-3 hops) don’t match the evaluation distribution (4-5 hops). The model learns shortcuts that work on easy problems but fail on hard ones.
| Training Distribution | Eval Distribution | Result |
|---|---|---|
| Easy (1-3 hop) | Hard (4-5 hop) | 27% (worse) |
| Hard (4-5 hop) | Hard (4-5 hop) | 75% (Part 1 result) |
The rejection sampling “winners” from easy examples teach strategies that don’t generalize.
The Key Finding
Rejection sampling must match your target distribution.
This is counterintuitive. You might expect that training on more examples (even easy ones) would help. Instead:
- Easy winners use shortcuts (fewer reasoning steps)
- Hard eval requires full chain reasoning
- Model learns the wrong patterns
The fix: train RSFT on eval.jsonl (hard examples), not train.jsonl (easy examples).
Demo Improvements
The demo now includes four interactive tabs:
| Tab | Feature |
|---|---|
| Training | Animated SFT→RSFT visualization with KG scoring |
| Inference | Pre-recorded inference examples |
| Try It | Live inference with 360M model |
| Distribution | Interactive visualization of the key finding |
Try It: Live Inference
Ask DevOps troubleshooting questions and watch the model reason:
Question: What causes TLSHandshakeError?
TRACE: TLSHandshakeError is caused by ClockSkew,
and ClockSkew leads to CertificateExpired,
and CertificateExpired is fixed by RenewCert...
ANSWER: B
The knowledge graph scores the reasoning path during training, but at inference the model reasons independently.
Cross-Platform Support
The pipeline now runs on both platforms:
| Platform | Framework | Command |
|---|---|---|
| macOS (Apple Silicon) | MLX | make train-360m |
| Linux (NVIDIA CUDA) | Unsloth | make train-360m-unsloth |
Unsloth provides 2x faster training with 60% less memory on NVIDIA GPUs.
Current Status
| Component | Status |
|---|---|
| SFT training (360M) | Complete |
| RSFT (wrong distribution) | Complete (27%) |
| RSFT (correct distribution) | Next step |
| Live demo with Try It | Complete |
| Cross-platform support | Complete |
Next Steps
| Priority | Task | Expected Result |
|---|---|---|
| High | Retrain RSFT on eval.jsonl | 75%+ accuracy |
| Medium | Update demo to use corrected model | Better live inference |
| Medium | Curriculum learning (easy→hard) | Smoother training |
| Low | Larger models (1B+) | Higher ceiling |
The corrected RSFT training:
python3 -m core.rsft \
--examples data/eval.jsonl \ # Hard examples!
--kg data/kg.json \
--sft-adapter data/runs/run_360m/models/sft \
--output data/runs/run_360m/models/rsft_eval \
--model HuggingFaceTB/SmolLM-360M-Instruct \
--k-samples 8 \
--max-examples 50
Lessons Learned
1. Distribution Matching is Non-Negotiable
This isn’t a minor optimization—it’s the difference between 27% and 75% accuracy. Wrong distribution = wrong winners = wrong model.
2. Easy Examples Can Hurt
More training data isn’t always better. Easy examples teach shortcuts that fail on hard problems.
3. Verify Your Pipeline
We trained a full RSFT model before realizing the distribution mismatch. Always check that training data matches eval distribution.
4. The Fix is Simple
Once identified, the fix is one flag change: --examples data/eval.jsonl instead of train.jsonl.
Resources
- Repository: multi-hop-reasoning
- Live Demo
- Part 1: Training Wheels for Small LLMs
- Paper: Knowledge Graph-Guided RAG
- Training Status
Training distribution matters. Easy examples teach easy shortcuts.
Part 2 of the Multi-Hop Reasoning series. View all parts