-
457 words • 3 min read • Abstract
Five ML Concepts - #29

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #29 
References
Concept Reference Neural Collapse Prevalence of Neural Collapse (Papyan et al. 2020) Grokking Grokking: Generalization Beyond Overfitting (Power et al. 2022) SAM Sharpness-Aware Minimization (Foret et al. 2021) Mechanistic Interpretability Transformer Circuits (Anthropic 2021) Self-Training Instability Understanding Self-Training (Wei et al. 2020) Today’s Five
1. Neural Collapse
In overparameterized networks trained to zero loss, class representations converge late in training to a symmetric, maximally separated structure. The last-layer features and classifiers align into a simplex equiangular tight frame.
This geometric phenomenon appears universally across architectures.
Like students settling into evenly spaced seats by the end of class.
2. Grokking
In some tasks, especially small algorithmic ones, models memorize quickly but only later suddenly generalize. The jump from memorization to understanding can happen long after training loss reaches zero.
Weight decay and longer training appear necessary for this phase transition.
Like cramming facts for an exam, then later realizing you truly understand.
3. SAM (Sharpness-Aware Minimization)
Instead of minimizing loss at a single point, SAM minimizes loss under small weight perturbations, finding flatter regions. Flatter minima tend to generalize better than sharp ones.
The optimizer seeks robustness to parameter noise.
Like choosing a wide hilltop instead of balancing on a sharp peak.
4. Mechanistic Interpretability
Researchers analyze activations and internal circuits to understand how specific computations are implemented inside models. The goal is reverse-engineering neural networks into understandable components.
This reveals attention heads, induction heads, and other interpretable patterns.
Like mapping the wiring of an unknown machine to see how it works.
5. Self-Training Instability
When models train on their own generated data, feedback loops can amplify small errors over time. Each iteration compounds mistakes, causing distributional drift.
Careful filtering and external grounding help mitigate this.
Like copying a copy repeatedly until the meaning drifts.
Quick Reference
Concept One-liner Neural Collapse Late-stage geometric convergence of class representations Grokking Sudden generalization after prolonged memorization SAM Optimizing for flat loss regions under perturbations Mechanistic Interpretability Analyzing internal circuits of neural networks Self-Training Instability Feedback loops that amplify errors in self-generated data
Short, accurate ML explainers. Follow for more.
-
443 words • 3 min read • Abstract
Five ML Concepts - #28

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #28 
References
Concept Reference Lottery Ticket Hypothesis The Lottery Ticket Hypothesis (Frankle & Carlin 2019) Sparse Activation Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017) Conditional Computation Sparsely-Gated MoE + Switch Transformers Inference Parallelism Megatron-LM (Shoeybi et al. 2019) Compute Optimality Chinchilla Scaling Laws (Hoffmann et al. 2022) Today’s Five
1. Lottery Ticket Hypothesis
Large neural networks contain smaller subnetworks that, when trained from the right initialization, achieve similar performance. These “winning tickets” exist before training begins.
The key insight: you can find and train just the winning subnetwork.
Like finding a winning lottery ticket hidden among many losing ones.
2. Sparse Activation
Only a subset of neurons activate for each input, even in models with many parameters. This allows large capacity without using everything at once.
Mixture-of-experts architectures explicitly design for this pattern.
Like a library where only relevant books light up for each query.
3. Conditional Computation
The model dynamically activates only certain components depending on the input. Different inputs route to different experts or pathways.
This improves efficiency and scalability without proportional compute increase.
Like routing patients to the right specialist instead of seeing every doctor.
4. Inference Parallelism
Model execution can be split across multiple devices to reduce latency or increase throughput. Tensor parallelism splits layers; pipeline parallelism splits stages.
Essential for serving large models in production.
Like dividing a puzzle so multiple people work on it simultaneously.
5. Compute Optimality Hypothesis
Empirical scaling laws suggest performance improves when model size, data, and compute are balanced. Adding only one resource may not yield optimal gains.
Chinchilla showed many models were undertrained relative to their size.
Like baking a cake where proportions matter more than just adding extra ingredients.
Quick Reference
Concept One-liner Lottery Ticket Hypothesis Small winning subnetworks hidden in large models Sparse Activation Using only part of a model per input Conditional Computation Dynamically routing inputs for efficiency Inference Parallelism Distributing inference across devices Compute Optimality Balancing model size, data, and compute
Short, accurate ML explainers. Follow for more.
-
894 words • 5 min read • Abstract
How AI Learns Part 7: Designing a Continuous Learning Agent

Resource Link Related RLM | Engram | Sleepy Coder The Layered Architecture
Continuous learning is layered coordination. Layer by Layer
Layer 4: Core Weights (Bottom)
The foundation. Trained once, changed rarely.
Aspect Details Contains General reasoning, language, base knowledge Update frequency Months or never Update method Full fine-tune or major consolidation Risk of change High (forgetting, capability shifts) Rule: Don’t touch this unless you have a very good reason.
Layer 3: Adapters (Parameter-Efficient Fine-Tuning (PEFT) / Low-Rank Adaptation (LoRA))
Modular skills that plug into the base.
Aspect Details Contains Task-specific capabilities Update frequency Weekly to monthly Update method Lightweight PEFT training Risk of change Medium (isolated, but validate) Rule: Train adapters for validated, recurring patterns. Version them. Enable rollback.
Layer 2: External Memory
Facts, experiences, and retrieved knowledge.
Aspect Details Contains Documents, logs, structured data Update frequency Continuous Update method Database writes Risk of change Low (doesn’t affect weights) Rule: Store experiences here first. Memory is cheap and safe.
Layer 1: Context Manager (Top)
The RLM-style interface that rebuilds focus each step.
Aspect Details Contains Current context, retrieved data, active state Update frequency Per call Update method Reconstruction from memory + query Risk of change None (ephemeral) Rule: Don’t drag context forward. Rebuild it.
The Feedback Loop
Logging
Capture everything the agent does:
- Prompts received
- Actions taken
- Tool calls made
- Errors encountered
- User signals
This is your training data.
Evaluation
Before any update reaches production:
Check Purpose Retention tests Did old skills degrade? Forward transfer Did new skills improve? Regression suite Known failure cases Safety checks Harmful outputs? Without evaluation, you’re updating blind.
Deployment
Updates should be:
- Modular: Can isolate and rollback
- Versioned: Know what changed when
- Staged: Test before full rollout
- Monitored: Track post-deployment metrics
The Error Flow
Where do errors go?
Error occurs ↓ Log it (immediate) ↓ Store in memory (same day) ↓ Pattern emerges over multiple occurrences ↓ Train adapter update (weekly/monthly) ↓ Validate update (before deployment) ↓ Deploy with rollback capabilityErrors feed into memory first. Only validated, recurring improvements reach adapters. Core weights almost never change.
What This Architecture Achieves
Problem Solution Catastrophic forgetting Core weights frozen; adapters isolated Context rot RLM rebuilds focus each step Hallucination Memory grounds responses Slow adaptation Memory updates continuously Unsafe changes Evaluation before deployment Design Principles
1. Separate Storage from Reasoning
Facts belong in memory. Reasoning belongs in weights. Don’t blur them.
2. Separate Speed from Permanence
Fast learning (memory) is temporary. Slow learning (weights) is permanent. Match the update speed to the desired permanence.
3. Evaluate Before Consolidating
Every update to adapters or weights must be validated. Regressions are silent killers.
4. Enable Rollback
Version everything. If an update causes problems, you must be able to undo it.
5. Log Everything
You cannot improve what you cannot measure. Structured logging is the foundation of continuous learning.
The Big Picture
AI does not learn in one place.
It learns in layers:
- Permanent (weights)
- Modular (adapters)
- External (memory)
- Temporary (context)
Continuous learning is not constant weight updates.
It is careful coordination across time scales.
Continuous learning systems don’t constantly retrain. They carefully consolidate what works.
References
Concept Paper LoRA LoRA: Low-Rank Adaptation (Hu et al. 2021) RAG Retrieval-Augmented Generation (Lewis et al. 2020) RLM Recursive Language Models (Zhou et al. 2024) Share Shared LoRA Subspaces (2025) Engram Engram: Conditional Memory (DeepSeek 2025) Series Summary
Part Key Insight 1. Time Scales Learning happens at different layers and speeds 2. Forgetting vs Rot Different failures need different fixes 3. Weight-Based Change the brain carefully 4. Memory-Based Store facts outside the brain 5. Context & RLM Rebuild focus instead of dragging baggage 6. Continuous Learning Learn in memory, consolidate in weights 7. Full Architecture Layered coordination enables safe improvement
Continuous learning is layered coordination.
-
419 words • 3 min read • Abstract
Five ML Concepts - #27

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #27 
References
Concept Reference Elastic Weight Consolidation Overcoming catastrophic forgetting (Kirkpatrick et al. 2017) Replay Buffers Experience Replay for Continual Learning (Rolnick et al. 2019) Parameter Routing Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017) Memory-Augmented Networks Neural Turing Machines (Graves et al. 2014) Model Editing Editing Large Language Models (Yao et al. 2023) Today’s Five
1. Elastic Weight Consolidation
Adding a penalty that discourages changing parameters important to previous tasks. Importance is estimated using Fisher information from prior training.
This helps models learn new tasks without catastrophic forgetting.
Like protecting well-worn neural pathways while building new ones.
2. Replay Buffers
Storing examples from earlier tasks and mixing them into new training. Past data is replayed alongside current examples during optimization.
This reinforces previous knowledge while learning new data.
Like reviewing old flashcards while studying new material.
3. Parameter Routing
Activating different subsets of model parameters depending on the task or input. Mixture-of-experts and conditional computation route inputs to specialized weights.
Enables specialization without fully separate models.
Like having different experts handle different questions.
4. Memory-Augmented Networks
Adding external memory modules that neural networks can read from and write to. The model learns to store and retrieve information during inference.
Extends beyond purely weight-based memory to explicit storage.
Like giving a calculator access to a notepad.
5. Model Editing
Targeted weight updates to modify specific behaviors without full retraining. Locate and adjust the parameters responsible for particular facts or behaviors.
Allows fast corrections and knowledge updates post-training.
Like editing a specific entry in an encyclopedia instead of rewriting the whole book.
Quick Reference
Concept One-liner Elastic Weight Consolidation Protecting important parameters during new learning Replay Buffers Mixing past examples to prevent forgetting Parameter Routing Activating task-specific parameter subsets Memory-Augmented Networks External memory modules for neural networks Model Editing Targeted weight updates without full retraining
Short, accurate ML explainers. Follow for more.
-
691 words • 4 min read • Abstract
How AI Learns Part 6: Toward Continuous Learning

Resource Link Related Sleepy Coder Part 1 | Sleepy Coder Part 2 The Continuous Learning Loop
Periodic consolidation, not constant updates. The Core Tradeoff
Goal Description Plasticity Learn new things quickly Stability Retain old things reliably You cannot maximize both simultaneously. The art is in the balance.
Approaches to Continuous Learning
1. Replay-Based Methods
Keep (or synthesize) some old data. Periodically retrain on old + new.
How it works:
- Store representative examples from each task
- Mix old data into new training batches
- Periodically consolidate
Recent work: FOREVER adapts replay timing using “model-centric time” (based on optimizer update magnitude) rather than fixed training steps.
Pros Cons Strong retention Storage costs Conceptually simple Privacy concerns Well-understood Data governance complexity 2. Replay-Free Regularization
Constrain weight updates to avoid interference, without storing old data.
Efficient Lifelong Learning Algorithm (ELLA) (Jan 2026): Regularizes updates using subspace de-correlation. Reduces interference while allowing transfer.
Share (Feb 2026): Maintains a single evolving shared low-rank subspace. Integrates new tasks without storing many adapters.
Pros Cons No replay needed Still active research Privacy-friendly Evaluation complexity Constant memory Subtle failure modes 3. Modular Adapters
Keep base model frozen. Train task-specific adapters. Merge or switch as needed.
Evolution:
- Low-Rank Adaptation (LoRA): Individual adapters per task
- Shared LoRA spaces: Adapters share subspace
- Adapter banks: Library of skills to compose
Pros Cons Modular, versioned Adapter proliferation Low forgetting risk Routing complexity Easy rollback Composition challenges 4. Memory-First Learning
Store experiences in external memory. Only consolidate to weights what’s proven stable.
Pattern:
- New information → Memory (fast)
- Validated patterns → Adapters (slow)
- Fundamental capabilities → Weights (rare)
This separates the speed of learning from the permanence of changes.
The Practical Loop
A working continuous learning system:
1. Run agent (with Recursive Language Model (RLM) context management) 2. Collect traces: prompts, tool calls, outcomes, failures 3. Score outcomes: tests, static analysis, user signals 4. Cluster recurring failure patterns 5. Train lightweight updates (LoRA/adapters) 6. Validate retention (did old skills degrade?) 7. Deploy modular update (with rollback capability)This is not real-time learning. It’s periodic consolidation.
Human analogy: Sleep. Process experiences, consolidate important patterns, prune noise.
Time Scales of Update
Frequency What Changes Method Every query Nothing (inference only) - Per session Memory Retrieval-Augmented Generation (RAG)/Engram Daily Adapters (maybe) Lightweight Parameter-Efficient Fine-Tuning (PEFT) Weekly Validated adapters Reviewed updates Monthly Core weights Major consolidation Most systems should:
- Update memory frequently
- Update adapters occasionally
- Update core weights rarely
Evaluation Is Critical
Continuous learning without continuous evaluation is dangerous.
Required:
- Retention tests (what got worse?)
- Forward transfer tests (what got better?)
- Regression detection
- Rollback capability
Without these, you’re flying blind.
References
Concept Paper ELLA Subspace Learning for Lifelong ML (2024) Share Shared LoRA Subspaces (2025) FOREVER Model-Centric Replay (2024) EWC Overcoming Catastrophic Forgetting (Kirkpatrick et al. 2017) Coming Next
In Part 7, we’ll put it all together: designing a practical continuous learning agent with layered architecture, logging, feedback loops, and safety.
Learn often in memory. Consolidate carefully in weights.
-
424 words • 3 min read • Abstract
Five ML Concepts - #26

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #26 
References
Concept Reference Data Augmentation A survey on Image Data Augmentation (Shorten & Khoshgoftaar 2019) Caching Strategies Systems engineering practice (no canonical paper) Constitutional AI Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022) Goodhart’s Law Goodhart’s Law and Machine Learning (Sevilla et al. 2022) Manifold Hypothesis An Introduction to Variational Autoencoders (Kingma & Welling 2019) Today’s Five
1. Data Augmentation
Creating additional training examples using label-preserving transformations. Rotate, flip, crop, or color-shift images without changing what they represent.
Effectively increases dataset size and improves generalization.
Like practicing piano pieces at different tempos to build flexibility.
2. Caching Strategies
Storing previous computation results to reduce repeated work and latency. Cache embeddings, KV states, or frequently requested outputs.
Essential for production inference at scale.
Like keeping frequently used books on your desk instead of the library.
3. Constitutional AI
Training models to follow explicit written principles alongside other alignment methods. The constitution provides clear rules for behavior.
Models critique and revise their own outputs against these principles.
Like giving someone written house rules instead of vague instructions.
4. Goodhart’s Law
When a measure becomes a target, it can stop being a good measure. Optimizing for a proxy metric can diverge from the true objective.
A core challenge in reward modeling and evaluation design.
Like studying only for the test instead of learning the subject.
5. Manifold Hypothesis
The idea that real-world data lies on lower-dimensional structures within high-dimensional space. Images of faces don’t fill all possible pixel combinations.
This structure is what representation learning exploits.
Like faces varying along a few key features instead of every pixel independently.
Quick Reference
Concept One-liner Data Augmentation Expanding training data with transformations Caching Strategies Reducing latency by reusing computation Constitutional AI Training models to follow explicit principles Goodhart’s Law Optimizing metrics distorts objectives Manifold Hypothesis Data lies on lower-dimensional structures
Short, accurate ML explainers. Follow for more.
-
697 words • 4 min read • Abstract
music-pipe-rs: Web Demo and Multi-Instrument Arrangements

Since the initial music-pipe-rs post, the project has grown. There’s now a web demo with playable examples, a new
seqstage for explicit note sequences, and multi-instrument arrangements that work in GarageBand.Resource Link Video YouTube Live Demo music-pipe-rs Samples Source GitHub Previous Unix Pipelines for MIDI Web Demo
The live demo showcases pre-built examples with playable audio:
Tab Style Description Bach Toccata (Organ) Classical Multi-voice church organ with octave doubling and pedal bass Bach Toccata (8-bit) Chiptune Gyruss-inspired arcade version with square wave Bach-esque Algorithmic Procedurally generated baroque-style background music Baroque Chamber Ensemble Six-channel piece with strings, harpsichord, and recorder Each tab shows the pipeline script alongside playable audio. See exactly what commands produce each result.
The seq Stage
The new
seqstage allows explicit note sequences instead of algorithmic generation:seed | seq "C4/4 D4/4 E4/4 F4/4 G4/2" | to-midi --out scale.midNotation:
NOTE/DURATIONwhere duration is in beats. Combine with other stages:seed | seq "D5/4 C#5/8 R/4 B4/4" | transpose --semitones 5 | humanize | to-midi --out melody.midThe
Rrepresents rests. This enables transcribing existing melodies or composing precise phrases.Multi-Instrument Arrangements
The Baroque chamber piece demonstrates six-channel composition:
{ seed 42 | seq "..." --ch 0 --patch 48; # Strings melody seed 42 | seq "..." --ch 1 --patch 6; # Harpsichord seed 42 | seq "..." --ch 2 --patch 74; # Recorder # ... additional voices } | humanize | to-midi --out baroque.midEach instrument gets its own channel and General MIDI patch. The same seed ensures timing coherence across parts.
GarageBand Integration
Import the MIDI files directly into GarageBand:
- Generate arrangement:
./examples/trio-demo.sh - Open GarageBand, create new project
- Drag the
.midfile into the workspace - GarageBand creates tracks for each channel
- Assign software instruments to taste
The demo includes a jazz trio arrangement:
- Piano: Bluesy melody with chords and swing
- Bass: Walking bass line with acoustic bass patch
- Drums: Hi-hat, snare, kick with dynamic variation
All generated from pipeline scripts.
Inspiration
This project was inspired by research into generative music tools and techniques:
References
Topic Link Analog Synthesizers Code Self Study Drum Synthesis JavaScript Drum Synthesis Generative Music Code Self Study Music Projects Software and Hardware FOSS Music Tools Open Source Music Production Eurorack Programming Patch.Init() Tutorial Opusmodus Algorithmic Composition in Lisp The key insight from Opusmodus: algorithmic composition isn’t random music—it’s programmable composition. Motif transformation, rule systems, deterministic generation. music-pipe-rs brings these ideas to Unix pipes.
What’s Next
The pipeline architecture makes extension natural:
- More generators: Markov chains, L-systems, cellular automata
- More transforms: Inversion, retrograde, quantization
- Live mode: Real-time MIDI output with clock sync
Each new capability is just another stage in the pipeline.
Series: Personal Software (Part 5) Previous: music-pipe-rs: Unix Pipelines Disclaimer
You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.
Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.
- Generate arrangement:
-
900 words • 5 min read • Abstract
Lucy 20%: Upgrading My Home AI Cluster

Lucy is getting an upgrade. I’m adding an X99 motherboard with an RTX 3090 to expand my AI cluster from 10% to 20% brain power.
Resource Link Video Lucy 20% Upgrade 
Previous Lucy 10% 
New Hardware: Queenbee
The cluster uses bee-themed naming. The new node is called queenbee:
Component Specification Motherboard X99 CPU Intel Xeon E5-2660 v4 (28 threads) RAM 64 GB DDR4 ECC GPU RTX 3090 (24GB VRAM) Storage 1TB NVMe SSD + 4TB HDD New AI Capabilities
With queenbee online, Lucy gains several new abilities:
Capability Model What It Does Voice Cloning VoxCPM High-quality text-to-speech with voice cloning Text-to-Image FLUX schnell Fast image generation from text prompts Text-to-Video Wan 2.2 Generate video clips from text descriptions Image-to-Video SVD Animate still images into video The Active Cluster
Currently active for AI workloads:
Node Role GPU hive MuseTalk lip-sync 2x P40 (48GB total) queenbee Generative AI workloads RTX 3090 (24GB) Together, they handle the full pipeline: generate images, animate them to video, add lip-synced speech, and produce the final output. See the full apiary inventory below.
Why Local AI?
Running AI locally means:
- Privacy - Data never leaves my network
- No API costs - Unlimited generations after hardware investment
- Customization - Full control over models and parameters
- Learning - Deep understanding of how these systems work
The 24GB of VRAM on the 3090 opens up models that wouldn’t fit on smaller cards. FLUX schnell produces high-quality images in seconds. VoxCPM creates natural-sounding speech that can clone voices from short audio samples.
Bee-Themed Host Names
The full apiary (current and planned nodes):
Host System CPU Cores RAM GPU apiary HPE DL360 G10 1x Xeon Gold 5188 12C/24T 188G - bees HPE DL360 G9 2x E5-2650 v4 24C/48T 128G - brood HPE DL380 G9 2x E5-2680 v4 28C/56T 64G 2x P100-16G colony Supermicro 6028U 2x E5-2680 v3 24C/48T TBD 2x K80-24G drones HPE DL380 G9 2x E5-2620 v4 16C/32T 256G - hive HPE DL380 G9 2x E5-2698 v3 32C/64T 128G 2x P40-24G honeycomb HPE DL180 G9 1x E5-2609 v4 8C/8T TBD - queenbee X99 1x E5-2660 v4 14C/28T 64G RTX 3090-24G swarm HPE DL380 G9 2x E5-2698 v3 32C/64T 374G 2x P100-12G workers HPE DL560 G8 4x E5-4617 v1 TBD 640G TBD Notes: Some nodes pending upgrade or configuration. Workers may upgrade to 4x E5-4657L v2 (48C/96T). Honeycomb needs unbrick. K80 GPUs are old and difficult to configure (limited CUDA version support)—will be replaced with M40 GPUs.
Power and Control
Remote management is essential for a home datacenter. The HPE servers include iLO (Integrated Lights-Out) for out-of-band access to BIOS, diagnostics, monitoring, and power control—even when the OS is down.
Category Technology Purpose Remote Management HPE iLO BIOS access, diagnostics, monitoring, power control IP KVM JetKVM, Sipeed KVM Console access for non-HPE servers (planned) Power Monitoring Kill-A-Watt, clones Per-outlet power consumption tracking Smart Outlets Home Assistant + Zigbee Remote power control, scheduling, automation Additional Circuits Bluetti LFP power stations Extra capacity to run more servers, remote control via BT/WiFi/Zigbee The combination of iLO and smart outlets means I can remotely power-cycle any server, access its console, and monitor power draw—all from my phone or Home Assistant dashboard. The Bluetti stations primarily provide additional circuits so I can run more servers simultaneously—home electrical limits are a real constraint. More LFP power stations will be needed to power Lucy at 100%.
Networking
Each server has 3 or more NICs, segmented by purpose:
Speed Purpose Switch 1G iLO/KVM management 1G switch 2.5G SSH, SCP, Chrome Remote Desktop 2x 2.5G switches 10G fiber Server-to-server data transfer (large models) 10G switch The 10G backbone is essential for moving multi-gigabyte model files between nodes. Loading a 70B parameter model over 1G would take forever—10G fiber makes it practical. The 2.5G network handles interactive work and smaller transfers (using USB NICs where needed), while the 1G management network stays isolated for out-of-band access.
Additional networking notes:
- WiFi 7 for wireless connectivity
- Managed switches with VLANs planned for better network segmentation
- Linux network bonding experiments to increase aggregate transfer rates
- Sneaker net - most servers have hot-swap SAS SSDs and hard drives, so physically moving drives between nodes is sometimes the fastest option for very large transfers
What’s Next
The 20% milestone is just a step. Future upgrades could include:
- Additional GPU nodes for parallel processing
- Larger language models for local inference
- Real-time video generation pipelines
- Integration with more specialized models
The bee hive keeps growing.
Building AI infrastructure one node at a time.
-
631 words • 4 min read • Abstract
How AI Learns Part 5: Context Engineering & Recursive Reasoning

Large context windows are not a complete solution.
As context grows:
- Attention dilutes
- Errors compound
- Reasoning quality degrades
Resource Link Related RLM | ICL Revisited The Context Problem
Transformers have finite attention. With limited attention heads and capacity, the model cannot attend equally to everything. As tokens accumulate:
- Earlier instructions lose influence
- Patterns average toward generic responses
- Multi-step reasoning fails
This is context rot—not forgetting weights, but losing signal in noise.
In-Context Learning (ICL)
The model adapts temporarily via examples in the prompt.
Aspect ICL Updates weights? No Persists across sessions? No Speed Instant Mechanism Activations, not gradients ICL is powerful but ephemeral. It’s working memory, not learning.
Limitation: As context grows, ICL examples compete with other content for attention.
Recursive Language Models (RLM)
Rebuild context each step instead of dragging it forward. RLMs decompose reasoning into multiple passes. Instead of dragging entire context forward:
- Query relevant memory
- Retrieve what’s needed now
- Execute tools
- Evaluate results
- Reconstruct focused context
- Repeat
This treats context as a dynamic environment, not a static blob.
Why RLM Works
Traditional approach:
[System prompt + 50k tokens of history + query]RLM approach:
[System prompt + retrieved relevant context + current query]Each reasoning step starts fresh with focused attention.
Context Engineering Techniques
Technique How It Helps Summarization Compress old context, preserve essentials Chunking Process in segments, aggregate results Retrieval Pull relevant content, not everything Tool offloading Store state externally, query on demand Structured prompts Clear sections, explicit priorities Tool Use as Context Management
Tools aren’t just for actions—they’re for state management.
Instead of keeping everything in context:
- Store in files, databases, or structured formats
- Query when needed
- Return focused results
This converts unbounded context into bounded queries.
The Agent Loop
Modern agents combine these ideas:
while not done: # 1. Assess current state relevant = retrieve_from_memory(query) # 2. Build focused context context = [system_prompt, relevant, current_task] # 3. Reason action = llm(context) # 4. Execute result = execute_tool(action) # 5. Update memory memory.store(result) # 6. Evaluate if goal_achieved(result): done = TrueEach iteration rebuilds context. No rot accumulation.
Test-Time Adaptation
A related technique: temporarily update weights during inference.
Aspect Test-Time Learning Updates weights? Yes, lightly (LoRA) Persists? No (rolled back) Purpose Adapt to input distribution This sits between ICL (no updates) and fine-tuning (permanent updates).
Key Insight
Context is not a static buffer. It’s a dynamic workspace.
Systems that treat context as “append everything” will rot. Systems that actively manage context stay coherent.
References
Concept Paper RLM Recursive Language Models (Zhou et al. 2024) ICL What Can Transformers Learn In-Context? (Garg et al. 2022) Test-Time Training TTT for Language Models (2024) Chain-of-Thought Chain-of-Thought Prompting (Wei et al. 2022) Coming Next
In Part 6, we’ll connect all of this to continuous learning: replay methods, subspace regularization, adapter evolution, and consolidation loops.
Rebuild focus instead of dragging baggage.
-
406 words • 3 min read • Abstract
Five ML Concepts - #25

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #25 
References
Concept Reference Label Smoothing Rethinking the Inception Architecture (Szegedy et al. 2015) Miscalibration On Calibration of Modern Neural Networks (Guo et al. 2017) Representation Learning Representation Learning: A Review (Bengio et al. 2013) Adversarial Examples Intriguing properties of neural networks (Szegedy et al. 2013) Double Descent Deep Double Descent (Nakkiran et al. 2019) Today’s Five
1. Label Smoothing
Replacing hard one-hot labels with softened target distributions during training. Instead of 100% confidence in one class, distribute small probability to other classes.
Reduces overconfidence and can improve generalization.
Like allowing small uncertainty instead of absolute certainty.
2. Miscalibration
When predicted confidence does not match observed accuracy. A model that says “90% confident” should be right 90% of the time.
Modern neural networks tend to be overconfident. Temperature scaling can help.
Like a forecast that sounds certain but is often wrong.
3. Representation Learning
Learning useful internal features automatically from raw data. Instead of hand-crafting features, the model discovers what matters.
The foundation of deep learning’s success across domains.
Like detecting edges before recognizing full objects.
4. Adversarial Examples
Inputs modified to cause incorrect predictions. Small, often imperceptible changes can flip model outputs.
A security concern and a window into model vulnerabilities.
Like subtle changes that fool a system without obvious differences.
5. Double Descent
Test error that decreases, increases, then decreases again as model capacity grows. The classical bias-variance tradeoff captures only the first part.
Modern overparameterized models operate in the second descent regime.
Like getting worse before getting better—twice.
Quick Reference
Concept One-liner Label Smoothing Softening targets to reduce overconfidence Miscalibration Confidence not matching accuracy Representation Learning Automatically learning useful features Adversarial Examples Inputs crafted to cause errors Double Descent Test error decreasing twice with model size
Short, accurate ML explainers. Follow for more.
-
627 words • 4 min read • Abstract
How AI Learns Part 4: Memory-Based Learning

Modern AI systems increasingly rely on external memory.
This shifts “learning” away from parameters.
Resource Link Related Engram | Engram Revisited | Multi-hop RAG The Memory Paradigm
Store facts outside the brain. Why External Memory?
Most “learning new facts” should not modify weights.
Weights are for generalization. They encode reasoning patterns, language structure, and capability.
Memory is for storage. It holds specific facts, documents, and experiences.
If you store everything in weights:
- You create interference
- You risk forgetting
- You must retrain
If you store facts in memory:
- No forgetting
- Fast updates
- Survives model upgrades
Retrieval-Augmented Generation (RAG)
Documents are embedded into vectors. At query time:
- Embed the query
- Search the vector database
- Retrieve relevant documents
- Inject into prompt
- Generate grounded response
The model does not need to remember facts internally. It retrieves them on demand.
RAG Benefits
Benefit Description No forgetting External storage, not weights Persistent Survives restarts and model changes Scalable Add documents without retraining Verifiable Can cite sources RAG Challenges
- Retrieval precision (wrong docs = bad answers)
- Latency (search takes time)
- Index maintenance
- Chunk boundaries
Cache-Augmented Generation (CAG)
Instead of retrieving from vector DB, cache previous context or KV states.
Use cases:
- Repeated knowledge tasks
- Multi-turn conversations
- Pre-computed context windows
Benefits over RAG:
- Often faster (no embedding + search)
- More deterministic
- Good for structured repeated workflows
Trade-offs:
- Less flexible
- Cache management complexity
Engram-Style Memory
Recent proposals (e.g., DeepSeek research) introduce conditional memory modules with direct indexing.
Instead of scanning long context or searching vectors:
- Memory slots indexed directly
- O(1) lookup instead of O(n) attention
- Separates static knowledge from dynamic reasoning
The goal: Constant-time memory access that doesn’t scale with context length.
This changes the compute story:
- Don’t waste attention on “known facts”
- Reserve compute for reasoning
- Avoid context rot
Model Editing
A related technique: surgically patch specific facts without full fine-tuning.
Example: The model says “The capital of Australia is Sydney.” You edit the specific association to “Canberra” without retraining.
Pros:
- Targeted fixes
- Fast
Cons:
- Side effects possible
- Consistency not guaranteed
The Key Distinction
Aspect Weight Learning Memory Learning Location Parameters External storage Persistence Model lifetime Storage lifetime Forgetting risk High None Update speed Slow (training) Fast (database) Survives model change? No Yes When to Use What
Situation Approach Need new reasoning capability Weight-based (fine-tune) Need to know new facts Memory-based (RAG) Need domain expertise Weight-based (LoRA) Need to cite sources Memory-based (RAG) Frequently changing data Memory-based (RAG/CAG) References
Concept Paper RAG Retrieval-Augmented Generation (Lewis et al. 2020) Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025) REALM REALM: Retrieval-Augmented Pre-Training (Guu et al. 2020) Model Editing Editing Factual Knowledge (De Cao et al. 2021) Coming Next
In Part 5, we’ll examine context engineering and recursive reasoning: ICL, RLM, and techniques that prevent context rot during inference.
The brain stays stable. The notebook grows.
-
426 words • 3 min read • Abstract
Five ML Concepts - #24

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #24 
References
Concept Reference Warmup Accurate, Large Minibatch SGD (Goyal et al. 2017) Data Leakage Leakage in Data Mining (Kaufman et al. 2012) Mode Collapse Generative Adversarial Nets (Goodfellow et al. 2014) Blue/Green Deployment MLOps best practice (no canonical paper) Reward Hacking Concrete Problems in AI Safety (Amodei et al. 2016) Today’s Five
1. Warmup
Gradually increasing the learning rate at the start of training as part of a learning rate schedule. This helps stabilize early training when gradients can be noisy.
Warmup is especially important for large batch training.
Like stretching before a sprint instead of starting at full speed.
2. Data Leakage
When information unavailable at deployment accidentally influences model training. This creates artificially high validation scores that don’t reflect real-world performance.
Common sources include future data, preprocessing on full dataset, or duplicate samples.
Like memorizing test answers instead of learning the material.
3. Mode Collapse
When a generative model produces limited output diversity. The generator learns to produce only a few outputs that fool the discriminator.
A major challenge in GAN training that various architectures attempt to address.
Like a musician who only plays one song no matter the request.
4. Blue/Green Deployment
Maintaining two production environments and switching traffic between them. One serves live traffic while the other is updated and tested.
Enables instant rollback if problems occur.
Like having a backup stage ready so the show never stops.
5. Reward Hacking
When agents exploit reward functions in unintended ways. The agent optimizes the reward signal rather than the intended objective.
A key challenge in reinforcement learning and AI alignment.
Like gaming the grading rubric instead of learning the material.
Quick Reference
Concept One-liner Warmup Gradually increasing learning rate at start Data Leakage Training on unavailable deployment info Mode Collapse Limited generative output variety Blue/Green Deployment Switching between parallel environments Reward Hacking Exploiting reward function flaws
Short, accurate ML explainers. Follow for more.
-
1231 words • 7 min read • Abstract
TBT (5/?): IBM 1130 System Emulator - Experience 1960s Computing

Resource Link Live Demo IBM 1130 System Emulator Source GitHub Video IBM 1130 System Emulator 
IBM Docs Functional Characteristics (GA26-5881) More Docs Bitsavers IBM 1130 Collection The System
This isn’t just an assembly emulator—it’s a full system visualization:
Component What It Does Console Panel Authentic indicator lights, toggle switches, speed control Assembler Game Write and execute IBM 1130 code with real-time visualization Keypunch IBM 029 text cards and 1442 object deck visualization Printer IBM 1131 console printer with greenbar paper Console Panel
The console panel recreates the physical operator interface with all indicator light groups documented in IBM’s Functional Characteristics manual.
Register Display (6 rows × 16 positions)
Row Register Bits Shown Purpose 1 IAR 15 Instruction Address Register (program counter) 2 SAR 15 Storage Address Register (memory access) 3 SBR 16 Storage Buffer Register (data word) 4 AFR 16 Arithmetic Factor Register (operand) 5 ACC 16 Accumulator (main arithmetic register) 6 EXT 16 Extension (double-precision, multiply/divide) Right-Side Indicators
Beyond the register displays, the console shows:
- Operation Register (5 bits) - Binary op-code of current instruction
- Format/Tag Indicators - Long instruction format, index register selection
- Cycle Control (T0-T7) - Internal timing pulses for debugging
- Status Lights - Wait, Run, Fetch, Execute, Indirect Address
Control Panel Lights
Light Purpose DISK UNLOCK Safe to swap 2315 disk cartridge FILE READY Disk drive up to speed FORMS CHECK Printer out of paper RUN CPU executing instructions PARITY Memory parity error FREEZE Fatal hardware error Operator Controls
- 16-bit toggle switches for manual data entry
- 7-position speed knob - Single Step, SMC, INT RUN, RUN, SI, DISP, LOAD
- Lamp test to verify all indicators function
- Emergency stop button
Assembler Game
Learn the IBM 1130 instruction set interactively:
- Complete instruction set - LD, STO, LDX, STX, A, S, AND, OR, SLA, SRA, BSC, BSI, WAIT
- Memory-mapped index registers - XR1-3 at addresses 1, 2, 3 (historically accurate)
- Step-by-step execution with change highlighting
- Interactive examples covering arithmetic, indexing, shifts
- Progressive challenges with validation
Keypunch
The keypunch simulation supports two card types:
IBM 029 Text Cards
- Hollerith encoding - Standard character-to-punch mapping
- Visual card display - Watch holes appear as you type
- Multi-card decks - Manage multiple cards
IBM 1130 Object Deck (1442 Output)
- Binary card visualization - Machine code punch patterns
- Object deck format - Matches authentic assembler output
- No character printing - Pure binary data cards
The IBM 029 Keypunch produced human-readable text cards. For binary object decks (compiled programs), the IBM 1442 Card Read-Punch would create cards with arbitrary punch patterns that don’t map to characters.
Printer
The IBM 1131 Console Printer simulation:
- Greenbar paper rendering - Authentic line printer output
- Typewriter-style characters - Period-appropriate appearance
- Console output - System messages and program output
Technology
Component Choice Language Rust Target WebAssembly UI Framework Yew Build Tool Trunk Hosting GitHub Pages Planned Enhancements
This is a work in progress. Planned features include:
- Additional challenges (10 total)
- Code save/load functionality
- URL sharing of programs
- Breakpoints and memory watches
- Keyboard shortcuts
- Full 1442 Card Read-Punch integration
IBM Documentation References
Document Description GA26-5881 Functional Characteristics - Console panel details GA26-5717 Operating Procedures - Operator instructions GA26-5914 Physical Planning - System dimensions Bitsavers Collection Complete IBM 1130 documentation archive Project Goals
This is an early proof-of-concept for trying out components that could be extended to produce a more realistic system of devices that could actually run programs. The modular architecture allows each peripheral (console, keypunch, printer) to be developed and refined independently.
A key goal is educational challenges that teach assembly language step by step. The assembler game provides progressive exercises that build understanding from basic load/store operations through arithmetic, indexing, and control flow.
Historical Significance
The IBM 1130 was the first computer for many programmers in the late 1960s and 1970s. Its clean architecture and accessible price point (~$32,000) made it ideal for education.
A Transitional Technology
The IBM 1130 arrived after mechanical calculators and vacuum tube computers, but before dense integrated circuits and microprocessors. This was a unique moment in computing history when machines were complex enough to be powerful, yet simple enough to be fully understood by one person.
The system shipped with complete schematics and diagnostic listings. A field engineer could use an oscilloscope to probe the pins on every transistor. The “integrated circuit” of the era was a small can with a 4×4 pin grid containing just two transistors, mounted on a pluggable card connected via a wire-wrapped backplane. When something failed, you could see it, touch it, and replace it.
Non-Volatile Core Memory
One remarkable feature: magnetic core memory was non-volatile. You could stop the system, power down overnight, come back in the morning, power up, and start your program exactly where it left off—without reloading from cards, tape, or disk.
Each bit was stored as the magnetic polarity of a tiny ferrite ring. No electricity required to maintain state. This made the 1130 remarkably resilient and practical for environments where power wasn’t guaranteed.
Notable fact: The Forth programming language was developed on the IBM 1130 by Charles Moore in the late 1960s.
Personal Experience
In the late 1970s, I worked as an IBM Customer Engineer maintaining a large number of IBM 1130 and 1800 systems used primarily by IBM manufacturing facilities in Kingston, Poughkeepsie, and East Fishkill, New York.
Field service on these machines was hands-on in ways that seem almost unimaginable today. I would often hand-assemble code on paper, converting mnemonics to binary, then enter machine code via the console toggle switches to create a small program. That program’s job? To punch another program onto a card.
I could then insert that punched card into a diagnostic deck to loop on an error condition while I used an oscilloscope and logic schematics to diagnose a failing circuit card. The blinking lights weren’t decoration—they were essential debugging tools that showed exactly what the CPU was doing at each moment.
This emulator recreates that experience: the same indicator lights, the same toggle switches, the same intimate connection between human and machine that made these systems so memorable to work with.
Experience 1960s computing in your browser. Work in progress.
-
649 words • 4 min read • Abstract
How AI Learns Part 3: Weight-Based Learning

Weight-based learning modifies the neural network itself.
It is slow. It is powerful. It is dangerous.
Resource Link Related Sleepy Coder: When Fine-Tuning Fails | 5MLC #3: LoRA The Weight-Based Methods
Weight-based learning modifies the brain itself. Pretraining
This creates the base model.
It encodes language structure, reasoning patterns, and general world knowledge. The process:
- Trains on terabytes of text
- Uses self-supervised learning (predict next token)
- Runs for weeks or months
- Costs millions of dollars
This learning is rarely repeated for cost reasons. The result is a foundation that everything else builds upon.
Fine-Tuning
Fine-tuning adapts models for specific tasks.
Standard Fine-Tuning
Adjust some or all weights using task-specific data.
Pros:
- Can significantly change behavior
- Works with small datasets
Cons:
- Risk of catastrophic forgetting
- Expensive if you modify all weights
- Hard to undo
Supervised Fine-Tuning (SFT)
Train on instruction → response pairs.
This teaches the model to:
- Follow directions
- Produce helpful outputs
- Maintain conversation structure
Risk: Can reduce other capabilities if data is narrow.
Preference Optimization
Instead of “correct answers,” train from comparisons: preferred vs rejected responses.
Method Description Reinforcement Learning from Human Feedback (RLHF) Reward model + reinforcement learning Direct Preference Optimization (DPO) Simpler alternative to RLHF RLAIF AI-generated preferences Pros: Strong style/safety/helpfulness steering
Cons: Can drift (“over-align”), may conflict with domain competence
Parameter-Efficient Fine-Tuning (PEFT)
Instead of changing all weights, inject small trainable modules.
LoRA (Low-Rank Adaptation)
Insert small low-rank matrices into transformer layers. Only train these matrices.
Benefits:
- Faster training: Fewer parameters to update
- Modular: Can swap adapters
- Version control: Different adapters for different tasks
- Lower forgetting risk: Base weights frozen
Other PEFT Methods
- Prompt tuning: Learn soft prompts
- Prefix tuning: Prepend learned vectors
- Adapters: Small bottleneck layers
- IA³: Learned vectors that scale activations
Shared LoRA Subspaces
Multiple tasks share adapter subspaces to reduce interference.
Recent work (ELLA, Share) maintains evolving shared low-rank subspaces that:
- Reduce interference between tasks
- Enable continual learning
- Keep memory constant
Distillation
Train a smaller model using a larger model as teacher.
Aspect Teacher Student Size Large Small Cost High inference Low inference Knowledge Full Compressed Distillation benefits:
- Speeds up inference
- Often improves consistency
- Can reduce hallucination
- Makes deployment cheaper
This is not runtime learning—it’s offline structural learning.
The Alignment Pipeline
Modern models typically go through:
- Pretraining → General competence
- SFT → Follow instructions
- RLHF/DPO → Align with preferences
- Safety fine-tuning → Reduce harmful outputs
Each step modifies weights. Each step risks forgetting previous capabilities.
Key Insight
Fine-tuning changes the brain. RAG changes the notes on the desk.
Weight-based learning is the core capability layer. It’s slow to change, expensive to update, and risky to modify—but it forms the stable foundation that everything else builds upon.
References
Concept Paper LoRA LoRA: Low-Rank Adaptation (Hu et al. 2021) RLHF Training LMs with Human Feedback (Ouyang et al. 2022) DPO Direct Preference Optimization (Rafailov et al. 2023) Distillation Distilling Knowledge in Neural Networks (Hinton et al. 2015) Adapters Parameter-Efficient Transfer Learning (Houlsby et al. 2019) Coming Next
In Part 4, we’ll explore memory-based learning: RAG, CAG, Engram, and other techniques that learn without touching weights.
Change the brain carefully.
-
440 words • 3 min read • Abstract
Five ML Concepts - #23

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #23 
References
Concept Reference Emergent Behavior Emergent Abilities of Large Language Models (Wei et al. 2022) Tool Use Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al. 2023) Loss Surface Sharpness On Large-Batch Training for Deep Learning (Keskar et al. 2016) Learning Rate Schedules SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov & Hutter 2016) Canary Deployment MLOps best practice (no canonical paper) Today’s Five
1. Emergent Behavior
Some capabilities appear only when models reach sufficient scale. These behaviors were not directly programmed but arise from learned representations.
Emergence is a key phenomenon in large language models.
Like a child learning words and then suddenly understanding full sentences.
2. Tool Use
Modern AI systems can generate structured commands to call external tools. These include search engines, calculators, or code interpreters.
This extends model capabilities beyond internal knowledge.
Like asking a librarian to look something up instead of guessing.
3. Loss Surface Sharpness
Sharp minima are sensitive to small weight changes. Flatter minima tend to be more robust and often generalize better.
Training methods that find flatter regions can improve test performance.
Like standing on a plateau instead of balancing on a narrow peak.
4. Learning Rate Schedules
Instead of keeping the learning rate constant, training often starts high and gradually reduces it. Schedules like step decay or cosine annealing improve convergence.
Warm restarts can help escape local minima.
Like running fast at first, then slowing down to finish precisely.
5. Canary Deployment
A new model version is rolled out to a small percentage of users first. If problems appear, rollout stops before affecting everyone.
Essential MLOps practice for safe production updates.
Like tasting food before serving it to all your guests.
Quick Reference
Concept One-liner Emergent Behavior Capabilities appearing at sufficient scale Tool Use AI calling external tools Loss Surface Sharpness Flatter minima generalize better Learning Rate Schedules Adjusting learning rate during training Canary Deployment Gradually rolling out new models safely
Short, accurate ML explainers. Follow for more.
-
641 words • 4 min read • Abstract
How AI Learns Part 2: Catastrophic Forgetting vs Context Rot

There are two fundamentally different failure modes in modern AI systems.
They are often confused. They should not be.
Resource Link Related Sleepy Coder: Routing Prevents Forgetting | RLM The Two Failures
Two different failure modes require two different solutions. Catastrophic Forgetting (Weight-Space Failure)
When you fine-tune a model on new tasks, performance on older tasks may degrade.
This happens because gradient descent updates overlap in parameter space. The model does not “know” which weights correspond to which task. It optimizes globally.
Example: Fine-tune a model on medical text. Its ability to write code degrades. The new learning overwrote old capabilities.
Why It Happens
Neural networks store knowledge distributed across many weights. When you update those weights for Task D, you modify the same parameters that encoded Task A. The old knowledge gets overwritten.
This is the stability vs plasticity tradeoff:
- Plasticity: Learn new things quickly
- Stability: Retain old things reliably
You cannot maximize both simultaneously.
Solutions
Method How It Helps Replay Train on old + new data Subspace regularization Constrain weight updates to avoid interference Shared Low-Rank Adaptation (LoRA) spaces Modular updates that don’t overwrite base weights Freezing base weights Keep foundation stable, train adapters only Context Rot (Inference-Time Failure)
Context rot is not weight damage.
It happens when:
- Prompts grow too large
- Earlier instructions get diluted
- Attention spreads thin
- The model begins averaging patterns instead of reasoning
Example: A 50,000 token conversation. The original system prompt is still there, but the model stops following it. Earlier context gets “forgotten” even though it’s technically present.
Why It Happens
Transformer attention is finite. With limited attention heads and capacity, the model cannot attend equally to everything. As context grows, earlier tokens receive less attention weight.
This creates:
- Instruction drift: Original instructions lose influence
- Pattern averaging: The model reverts to generic responses
- Lost coherence: Multi-step reasoning fails
Solutions
Method How It Helps Retrieval-based context Pull relevant passages, not everything Recursive Language Models (RLM) Rebuild context each step Summarization Compress old context Memory indexing Constant-time lookup instead of linear attention Structured tool calls Offload state to external systems The Critical Distinction
Aspect Catastrophic Forgetting Context Rot Where Weights Prompt window When During training During inference Persists? Permanently Session only Analogy Brain damage Working memory overload Why This Matters
If you confuse these failure modes, you apply the wrong fix.
- Forgetting problem? Don’t add more context. Fix your training.
- Context rot problem? Don’t retrain. Fix your context management.
Many “AI agents that forget” discussions conflate both. Modern systems need solutions for both simultaneously.
References
Concept Paper Catastrophic Forgetting Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017) Continual Learning Survey A Comprehensive Survey of Continual Learning (Wang et al. 2023) ELLA ELLA: Subspace Learning for Lifelong Machine Learning (2024) Share Share: Shared LoRA Subspaces for Continual Learning (2025) RLM Recursive Language Models (Zhou et al. 2024) Coming Next
In Part 3, we’ll examine weight-based learning in detail: pretraining, fine-tuning, LoRA, alignment methods, and distillation.
Different failures need different fixes.
-
472 words • 3 min read • Abstract
Five ML Concepts - #22

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #22 
References
Concept Reference RSFT Scaling Relationship on Learning Mathematical Reasoning (Yuan et al. 2023) Model Steerability Controllable Generation from Pre-trained Language Models (Zhang et al. 2023) LSTM Long Short-Term Memory (Hochreiter & Schmidhuber 1997) More Data Beats Better Models The Unreasonable Effectiveness of Data (Halevy et al. 2009) System Reliability vs Quality MLOps best practice (no canonical paper) Today’s Five
1. RSFT (Rejection Sampling Fine-Tuning)
A method where many model outputs are generated, weaker ones are filtered out, and the best samples are used for further fine-tuning. It improves output quality without full reinforcement learning.
The model learns from its own best attempts.
Like practicing many attempts and studying only your best ones.
2. Model Steerability
The ability to adjust a model’s behavior through prompts, parameters, or control mechanisms. This allows flexible behavior without retraining.
Steerable models can adapt to different tasks or styles at inference time.
Like steering a car instead of letting it move in a fixed direction.
3. LSTM (Long Short-Term Memory)
A recurrent neural network architecture with gates that regulate memory flow. It was designed to mitigate vanishing gradient problems in sequence modeling.
LSTMs decide what to remember and what to forget at each time step.
Like a notebook where you choose what to keep and what to forget.
4. Why More Data Beats Better Models
In many cases, adding high-quality data improves performance more than small architecture improvements. Data scale often matters as much as model design.
This is sometimes called “the unreasonable effectiveness of data.”
Like practicing with many real conversations instead of perfecting one grammar rule.
5. System Reliability vs Model Quality
A slightly less accurate model that runs reliably can outperform a fragile but slightly better one. Engineers balance uptime, latency, and stability against pure accuracy.
Production systems need both correctness and dependability.
Like choosing a reliable car over a faster one that breaks down often.
Quick Reference
Concept One-liner RSFT Fine-tuning on filtered best outputs Model Steerability Adjusting behavior at inference time LSTM Gated memory for sequence modeling More Data Beats Better Models Data scale trumps architecture tweaks System Reliability vs Quality Balancing accuracy with uptime
Short, accurate ML explainers. Follow for more.
-
1393 words • 7 min read • Abstract
Many-Eyes Learning: Intrinsic Rewards and Diversity

In Part 1, we demonstrated that multiple scouts dramatically improve learning in sparse-reward environments. Five scouts achieved 60% success where a single scout achieved 0%.
This post explores how scouts explore: intrinsic rewards that drive novelty-seeking behavior, and what happens when you mix different exploration strategies.
Resource Link Code many-eyes-learning Part 1 Solving Sparse Rewards with Many Eyes Video Many-Eyes Learning: Watch AI Scouts Explore 
Recap: The Many-Eyes Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Scout 1 │ │ Scout 2 │ │ Scout N │ │ (strategy A)│ │ (strategy B)│ │ (strategy N)│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ v v v ┌─────────────────────────────────────────────────┐ │ Experience Buffer │ └─────────────────────────────────────────────────┘ │ v ┌─────────────────────────────────────────────────┐ │ Shared Learner │ └─────────────────────────────────────────────────┘Scouts are information gatherers, not independent learners. They explore with different strategies, pool their discoveries, and a shared learner benefits from the combined experience.
New Scout Strategies
CuriousScout: Count-Based Novelty
IRPO formalizes intrinsic rewards as the mechanism that drives scout exploration. CuriousScout implements count-based curiosity:
class CuriousScout(Scout): def __init__(self, bonus_scale: float = 1.0): self.state_counts = defaultdict(int) self.bonus_scale = bonus_scale def intrinsic_reward(self, state): count = self.state_counts[state] return self.bonus_scale / sqrt(count + 1)How it works:
- Track how many times each state has been visited
- Reward =
bonus_scale / √(count + 1) - Novel states get high rewards; familiar states get diminishing returns
The intuition: A curious scout is drawn to unexplored territory. The first visit to a state is exciting (reward = 1.0). The fourth visit is mundane (reward = 0.5). This creates natural pressure to explore widely.
OptimisticScout: Optimism Under Uncertainty
A different philosophy: assume unknown states are valuable until proven otherwise.
class OptimisticScout(Scout): def __init__(self, optimism: float = 10.0): self.optimism = optimism def initial_q_value(self): return self.optimism # Instead of 0How it works:
- Initialize all Q-values to a high value (e.g., 10.0)
- The agent is “optimistic” about unvisited state-action pairs
- As it explores and receives actual rewards, Q-values decay toward reality
The intuition: If you’ve never tried something, assume it might be great. This drives exploration without explicit novelty bonuses.
Strategy Comparison
Strategy Mechanism Best For Random Uniform random actions Baseline, maximum coverage Epsilon-Greedy Random with probability ε, greedy otherwise Balancing exploit/explore CuriousScout Novelty bonus for unvisited states Systematic coverage OptimisticScout High initial Q-values Early exploration pressure The Diversity Experiment
Does mixing strategies help, or is it enough to have multiple scouts with the same good strategy?
Setup
- 7x7 sparse grid, 100 training episodes
- All configurations use exactly 5 scouts (fair comparison)
- 5 random seeds for statistical significance
Configurations
- Homogeneous Random: 5 identical random scouts
- Homogeneous Epsilon: 5 identical epsilon-greedy scouts (ε=0.2)
- Diverse Mix: Random + 2 epsilon-greedy (ε=0.1, 0.3) + CuriousScout + OptimisticScout
Results
Configuration Success Rate Random baseline 7% Homogeneous random 20% Homogeneous epsilon 40% Diverse mix 40% Analysis
Finding: Strategy quality matters more than diversity in simple environments.
- Epsilon-greedy (homogeneous or mixed) outperforms pure random
- Diverse mix performs the same as homogeneous epsilon-greedy
- Having 5 good scouts beats having 5 diverse but weaker scouts
Why doesn’t diversity help here?
In a simple 7x7 grid, the exploration problem is primarily about coverage, not strategy complementarity. Five epsilon-greedy scouts with different random seeds already explore different regions due to stochastic action selection.
Diversity likely provides more benefit in:
- Complex environments with multiple local optima
- Tasks requiring different behavioral modes
- Environments with deceptive reward structures
Web Visualization
The web visualization demonstrates Many-Eyes Learning with real-time parallel scout movement. (The upcoming video walks through this demo—the post focuses on the underlying mechanism.)

How It Works
The web version uses Q-learning with a shared Q-table (simpler than DQN for clarity). All scouts contribute to the same Q-table—the core “many eyes” concept: more explorers = faster Q-value convergence.
Scout Role Epsilon Behavior Random Baseline 1.0 (constant) Always random, never follows policy Scouts 1-N Learning Agents 0.5-0.8 → 0.01 Epsilon-greedy with decay Exploration Modes
The UI provides a dropdown to select different exploration strategies:
Mode Heatmap Diversity Learning Performance Shared Policy Low (identical paths) Best (lowest avg steps) Diverse Paths High (distinct paths) Worse (biases override optimal) High Exploration High Worst (never fully exploits) Boltzmann Moderate Moderate The Diversity vs Performance Trade-off
There’s a fundamental trade-off between visual diversity and learning performance:
-
Shared Policy wins on performance: The “many eyes” benefit comes from diverse exploration during learning (finding the goal faster). But once Q-values converge, all scouts should follow the same optimal policy.
-
Diverse Paths sacrifices performance for visuals: Scout-specific directional biases (Scout 1 prefers right, Scout 2 prefers down) create visually interesting heatmaps but suboptimal behavior.
-
High Exploration never converges: Fixed 50% random actions means scouts never fully exploit the learned policy.
Key insight: For best learning, use Shared Policy. Use other modes to visualize how different exploration strategies affect the learning process, but expect higher average steps.
Learning Phases
Phase Episodes Avg Steps Behavior Random 1-5 ~70 All scouts exploring randomly Early Learning 5-15 40-60 Policy starts forming Convergence 15-30 15-25 Clear optimal path emerges Stable 30+ 12-18 Near-optimal with random scout noise Why “Average Steps to Goal”?
Success rate is coarse-grained—with 5 scouts, only 6 values are possible (0%, 20%, 40%, 60%, 80%, 100%). After ~10 episodes, all scouts typically reach the goal. Average steps shows continued policy refinement, dropping from ~70 (random) to ~8 (optimal).
Running the Visualization
./scripts/serve.sh # Open http://localhost:3200- Yew/WASM frontend with FastAPI backend
- Speed control from 1x to 100x
- Replay mode to step through recorded training
What’s Next
Potential future directions:
Direction Why It Matters Larger environments Test scaling to 15x15, 25x25 grids Scout communication Real-time sharing vs passive pooling Adaptive intrinsic rewards Learn the reward function (closer to full IRPO) Multi-goal environments Multiple sparse rewards to discover Key Takeaways
-
Intrinsic rewards drive exploration. CuriousScout and OptimisticScout implement different philosophies: novelty bonuses vs optimistic initialization.
-
Strategy quality > diversity in simple environments. Five good scouts beat five diverse but weaker scouts.
-
Diversity during learning, convergence after. The “many eyes” benefit comes from diverse exploration during learning. Once Q-values converge, all scouts should follow the same optimal policy.
-
Shared Q-table enables collective learning. All scouts contribute to one Q-table—more explorers means faster convergence.
-
Visual diversity costs performance. Modes like “Diverse Paths” create interesting heatmaps but suboptimal behavior. Use “Shared Policy” for best learning results.
References
Concept Paper IRPO Intrinsic Reward Policy Optimization (Cho & Tran 2026) Reagent Reasoning Reward Models for Agents (Fan et al. 2026) ICM Curiosity-driven Exploration (Pathak et al. 2017)
Diverse exploration, convergent execution. Many eyes see more, but the best path is the one they all agree on.
-
592 words • 3 min read • Abstract
How AI Learns Part 1: The Many Meanings of Learning

When people say, “AI learned something,” they usually mean one of at least four very different things.
Large Language Models (LLMs) do not learn in one single way. They learn at different time scales, in different locations, and with very different consequences. To understand modern AI systems—especially agents—we need to separate these layers.
Resource Link Related ICL Revisited | RLM | Engram Four Time Scales of Learning
Learning happens at different layers with different persistence and speed. 1. Pretraining (Years)
This is the foundation.
The model trains on massive datasets using gradient descent. The result is a set of weights—billions of parameters—encoding statistical structure of language and knowledge.
This learning:
- Is slow and expensive
- Persists across restarts
- Cannot easily be reversed
- Is vulnerable to interference if modified later
Think of this as long-term biological memory.
2. Fine-Tuning (Days to Weeks)
Fine-tuning modifies the weights further, but with narrower data.
This includes:
- Instruction tuning (following directions)
- Alignment methods (Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO))
- Domain adaptation
- Parameter-efficient methods like Low-Rank Adaptation (LoRA)
This is still weight-based learning.
It persists across restarts. It risks catastrophic forgetting. It modifies the brain itself.
3. Memory-Based Learning (Seconds to Minutes)
This is where many modern systems shift.
Instead of changing weights, they store information externally:
- RAG (Retrieval-Augmented Generation)
- CAG (Cache-Augmented Generation)
- Vector databases
- Engram-style memory modules
The model retrieves relevant memory per query.
The brain stays stable. The notebook grows.
This learning:
- Persists across restarts
- Survives model upgrades
- Does not cause forgetting
- Is fast
4. In-Context Learning (Milliseconds)
This is temporary reasoning scaffolding.
Information exists only in the prompt window.
It:
- Does not update weights
- Does not persist across sessions
- Is powerful but fragile
- Suffers from context rot
This is working memory.
Why This Matters
Most discussions collapse all of this into “the model learned.”
But:
- Updating weights risks forgetting
- Updating memory does not
- Updating prompts does not persist
- Updating adapters can be modular and reversible
Continuous learning systems must coordinate all four.
Persistence Comparison
Mechanism Persists Across Chat? Persists Across Restart? Persists Across Model Change? Pretraining Yes Yes No Fine-tune Yes Yes No LoRA Yes Yes Usually Distillation Yes Yes No ICL No No No RAG Yes Yes Yes Engram Yes Yes Yes CAG Yes Yes Yes That last column is subtle but powerful for agents.
References
Concept Paper LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021) RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020) ICL What Can Transformers Learn In-Context? (Garg et al. 2022) Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025) DPO Direct Preference Optimization (Rafailov et al. 2023) Coming Next
In Part 2, we’ll examine the two fundamental failure modes that arise from confusing these layers: catastrophic forgetting and context rot.
Learning happens in layers of permanence.
-
1173 words • 6 min read • Abstract
music-pipe-rs: Unix Pipelines for MIDI Composition

After building midi-cli-rs for quick mood-based generation, I wanted something more surgical. What if music generation worked like Unix commands—small tools connected by pipes?
Resource Link Code music-pipe-rs Related midi-cli-rs Next Web Demo and Multi-Instrument The Unix Philosophy for Music
Most generative music tools are monolithic. You get one application with a closed workflow. If you want to inspect intermediate results, you can’t. If you want to swap one transformation for another, you rebuild everything.
Unix solved this decades ago: small tools that do one thing well, connected by pipes. Each tool reads from stdin, writes to stdout. You can inspect any point in the pipeline with
head, filter withgrep, transform withjq.music-pipe-rs applies this philosophy to MIDI composition.
A Pipeline in Action
seed 12345 | motif --notes 16 --bpm 120 | humanize | to-midi --out melody.midFour stages:
- seed establishes the random seed for the entire pipeline
- motif generates a melodic pattern (using the pipeline seed)
- humanize adds timing and velocity variation (using the same seed)
- to-midi converts the event stream to a standard .mid file
The output plays in any DAW.
Seed-First Architecture
The
seedstage goes at the head of the pipeline:# Explicit seed for reproducibility seed 12345 | motif --notes 16 | humanize | to-midi --out melody.mid # Auto-generated seed (printed to stderr) seed | motif --notes 16 | humanize | to-midi --out melody.mid # stderr: seed: 1708732845All downstream stages read the seed from the event stream. No
--seedarguments scattered across the pipeline. One seed, set once, used everywhere.This means:
- Same seed = identical output across all random stages
- Different seed = different composition with same structure
- Reproducibility is trivial: just save the seed number
JSONL: The Intermediate Format
Between stages, events flow as JSONL (JSON Lines). Each line is a complete event:
{"type":"Seed","seed":12345} {"type":"NoteOn","t":0,"ch":0,"key":60,"vel":80} {"type":"NoteOff","t":480,"ch":0,"key":60}This format is human-readable and tool-friendly:
# See the first 10 events seed 42 | motif --notes 8 | head -10 # Count how many NoteOn events seed 42 | motif --notes 16 | grep NoteOn | wc -l # Pretty-print with jq seed 42 | motif --notes 4 | jq .No binary formats to decode. No proprietary protocols. Just text.
Visualization with viz
The
vizstage prints a sparkline to stderr while passing events through:seed 12345 | motif --notes 16 | viz | humanize | to-midi --out melody.midOutput on stderr:
▃▅▇▅▃▁▂▄▆▇▆▄▂▁▃▅For more detail, use piano roll mode:
seed 12345 | motif --notes 16 | viz --rollG6 │···█············│ F#6 │·····█··········│ F6 │····█···········│ G5 │·██·········█···│ F5 │···········█····│ E5 │·········██···█·│ C5 │█·····███····█·█│The visualization goes to stderr; the JSONL events continue to stdout. You can inspect the music without breaking the pipeline.
Available Stages
Stage Type Description seedStart Establish random seed for pipeline motifGenerate Create melodic patterns euclidGenerate Euclidean rhythm generation transposeTransform Shift notes by semitones scaleTransform Constrain notes to a scale humanizeTransform Add timing/velocity variation vizInspect Print sparkline visualization to-midiOutput Convert to .mid file Each stage is a separate binary. Mix and match as needed.
Euclidean Rhythms
The
euclidstage generates Euclidean rhythms—mathematically optimal distributions of hits across steps:# 3 hits distributed across 8 steps (Cuban tresillo) seed | euclid --pulses 3 --steps 8 --note 36 | to-midi --out kick.mid # 4-on-the-floor kick pattern seed | euclid --pulses 4 --steps 16 --note 36 | to-midi --out four-floor.midThese patterns appear in music worldwide because they “feel right”—the spacing is as even as possible.
Scale Locking
The
scalestage constrains notes to a musical scale:seed 42 | motif --notes 16 | scale --root C --mode minor | to-midi --out c-minor.midNo wrong notes. Every pitch fits the harmonic context.
Layering Streams
Generate drum and melody separately, then combine:
{ seed 100 | euclid --pulses 4 --steps 16 --note 36 --ch 9; seed 100 | motif --notes 16 | scale --root C --mode pentatonic; } | to-midi --out layered.midChannel 9 is General MIDI drums. Same seed ensures coherence between parts. Both streams merge into a single MIDI file.
Why Not Just Use midi-cli-rs?
Different tools for different needs:
Tool Strength Use Case midi-cli-rs Quick mood presets “Give me 5 seconds of jazz” music-pipe-rs Compositional control “Generate a motif, constrain to scale, add swing” midi-cli-rs is high-level: pick a mood, get music. music-pipe-rs is low-level: build compositions from primitive operations.
Both are useful. Both work with AI coding agents.
The Personal Software Pattern
This continues the theme: build small tools that compose well. Don’t try to solve everything in one application. Let Unix handle orchestration.
The best part? Standard tools still work.
head,grep,jq,wc—all participate in the pipeline. No special music knowledge required to inspect the data.
Series: Personal Software (Part 4) Previous: midi-cli-rs: Custom Mood Packs Next: music-pipe-rs: Web Demo Disclaimer
You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.
Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.
-
447 words • 3 min read • Abstract
Five ML Concepts - #21

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #21 
References
Concept Reference Prompt Injection Prompt Injection attack against LLM-integrated Applications (Liu et al. 2023) Jailbreaks Jailbroken: How Does LLM Safety Training Fail? (Wei et al. 2023) GRU Empirical Evaluation of Gated Recurrent Neural Networks (Chung et al. 2014) Planning vs Prediction Between accurate prediction and poor decision making (Zaffalon et al. 2023) Production Rollbacks MLOps best practice (no canonical paper) Today’s Five
1. Prompt Injection
Malicious instructions embedded in user input that override intended system behavior. An attacker crafts text that tricks an AI into ignoring its original instructions.
This is a major security concern for LLM-integrated applications.
Like slipping a forged instruction into a trusted document.
2. Jailbreaks
Techniques that attempt to bypass safety constraints in AI systems. These attacks exploit gaps between a model’s capabilities and its safety training.
Safety training can fail due to competing objectives or mismatched generalization.
Like convincing a guard to bend the rules.
3. GRU (Gated Recurrent Unit)
A recurrent neural network unit with gates that control memory flow. GRUs decide what information to keep and what to discard at each time step.
Simpler than LSTM but designed for similar sequence modeling tasks.
Like a notepad where you decide what to keep and what to erase.
4. Planning vs Prediction
Prediction forecasts likely outcomes. Planning evaluates actions across possible futures. Accurate predictions don’t guarantee good decisions—you also need to model how actions affect outcomes.
This is a key gap in many AI/ML systems.
Like knowing it will rain versus deciding whether to bring an umbrella.
5. Production Rollbacks
Reverting to a previous stable model version after deployment issues. When a new model causes problems in production, rolling back quickly minimizes impact.
Essential MLOps practice for maintaining system reliability.
Like reloading a saved game state when something breaks.
Quick Reference
Concept One-liner Prompt Injection Malicious instructions overriding AI behavior Jailbreaks Bypassing safety constraints GRU Gated memory for sequence modeling Planning vs Prediction Action evaluation vs forecasting Production Rollbacks Reverting to stable model versions
Short, accurate ML explainers. Follow for more.
-
1300 words • 7 min read • Abstract
midi-cli-rs: Extending with Custom Mood Packs

Personal Software doesn’t stop at “it works.” It evolves. After building midi-cli-rs for AI agents to generate music, I wanted more moods—without recompiling Rust every time.
The solution: a plugin system that lets anyone create custom mood packs using simple TOML files.
Resource Link Examples Listen to Samples Wiki Plugin Documentation Video Plugin Moods Explainer 
Code midi-cli-rs The Problem with Built-in Only
The original midi-cli-rs shipped with a handful of mood presets: suspense, eerie, upbeat, calm, ambient, jazz. Useful, but limited. What if you want synthwave? Chillout? Something faster or in a different key?
Hardcoding every possible mood isn’t practical. And asking users to modify Rust source code isn’t friendly.
Three Levels of Extensibility
Level What You Get What You Change Skill Required ✓ Built-in Moods 9 curated generators Nothing—use as-is None ✓ Plugin Moods Parameter variations TOML config files Text editing ✗ Custom Generators New musical patterns Rust source code Programming (future) This post covers Plugin Moods—the middle tier. You can preset combinations of tempo, key, and intensity, but you’re still using the built-in generators’ musical logic. Want a “smooth-jazz” preset (slower, mellower)? Plugin mood. Want bebop or Latin jazz with different chord progressions? That requires a custom generator.
Custom generators (writing new Rust code) will be covered in a future post when the plugin editor ships.
The Plugin Architecture
Custom moods live in
~/.midi-cli-rs/moods/as TOML files. Each file is a “mood pack” that can define multiple moods. The CLI discovers them automatically.Here’s how it works:
~/.midi-cli-rs/ └── moods/ ├── electronic.toml # Your synthwave, techno, etc. ├── cinematic.toml # Epic, tension, wonder └── seasonal.toml # Holiday themesCreating a Mood Pack
A plug-in mood pack has two parts: pack metadata and mood definitions.
[pack] name = "electronic" version = "1.0.0" author = "Your Name" description = "Electronic music styles" [[moods]] name = "synthwave" base_mood = "upbeat" default_tempo = 118 default_key = "Am" default_intensity = 65 description = "80s synthwave vibes" tags = ["electronic", "retro"] [[moods]] name = "chillout" base_mood = "ambient" default_tempo = 85 default_key = "Em" default_intensity = 40 description = "Relaxed electronic"Each mood delegates to a built-in generator (
base_mood) but overrides specific parameters. You get the musical logic of the built-in mood with your customizations applied.Available Base Moods
Your custom moods can extend any of the nine built-in generators:
Base Mood Character suspenseTense, building eerieDark, unsettling upbeatEnergetic, positive calmPeaceful, slow ambientAtmospheric, textural jazzSwing, improvisation chiptune8-bit, retro gaming orchestralClassical instruments showBroadway, theatrical Configuration Options
Each mood definition supports these overrides:
Field Description Example nameCLI name (required) "synthwave"base_moodBuilt-in to extend (required) "upbeat"default_tempoBPM 118default_keyMusical key "Am","C","Eb"default_intensity0-100 energy level 65descriptionHuman-readable description "80s vibes"tagsDiscovery tags ["electronic", "retro"]How Seeds Create Variation
Seeds aren’t random—they’re deterministic variation selectors. The same mood + same seed always produces identical output. But different seeds create observable musical differences across multiple dimensions:
Parameter Variation Range Tempo ±15% from base Layer inclusion Which instruments appear Melodic contour 16 different phrase shapes Note density 0.6x to 1.4x Rest probability 0% to 35% silence Phrase length 3-8 notes Velocity -15 to +15 offset The system uses hash-based mixing with unique salts for each parameter. This means adjacent seeds (42 vs 43) produce completely different outputs—no gradual transitions between seeds.
When you combine plugin moods with seed variation, you get a matrix: your custom tempo/key/intensity settings applied across different seed-driven variations of the underlying generator’s patterns.
Using Custom Moods
Once your TOML file is in place, the mood appears automatically:
# List all moods (built-in + plugins) midi-cli-rs moods # Generate with your custom mood midi-cli-rs preset -m synthwave -d 5 -s 42 -o output.wavThe seed system still works—same mood + same seed = identical output.
Example: Electronic Pack
Here’s a complete pack with four electronic moods:
[pack] name = "electronic" version = "1.0.0" description = "Electronic music styles" [[moods]] name = "synthwave" base_mood = "upbeat" default_tempo = 118 default_key = "Am" default_intensity = 65 [[moods]] name = "chillout" base_mood = "ambient" default_tempo = 85 default_key = "Em" default_intensity = 40 [[moods]] name = "techno" base_mood = "upbeat" default_tempo = 130 default_key = "Dm" default_intensity = 85 [[moods]] name = "8bit" base_mood = "chiptune" default_tempo = 140 default_key = "C" default_intensity = 70Drop this in
~/.midi-cli-rs/moods/electronic.tomland you have four new moods.What’s Next
This plugin system handles mood variations—different tempos, keys, and intensities applied to existing generators. A future update will add a plugin editor for creating entirely new musical patterns without writing Rust.
For now, the delegation model covers most use cases: want faster jazz? Darker ambient? Major-key suspense? Create a TOML file and you’re done.
The Personal Software Pattern
This follows the Personal Software philosophy: start with something that works, then extend it as needs emerge. The plugin system wasn’t in the original design. It grew from actual use—wanting more moods without recompiling.
Good personal software leaves room to grow.
Series: Personal Software (Part 3) Previous: midi-cli-rs: Music for AI Agents Next: music-pipe-rs: Unix Pipelines Disclaimer
You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.
Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.
-
456 words • 3 min read • Abstract
Five ML Concepts - #20

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #20 
References
Concept Reference VAEs Auto-Encoding Variational Bayes (Kingma & Welling 2013) Uncertainty Estimation What Uncertainties Do We Need in Bayesian Deep Learning? (Kendall & Gal 2017) Interpretability Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez & Kim 2017) Gradient Noise Stochastic Gradient Descent as Approximate Bayesian Inference (Mandt et al. 2017) Human-in-the-Loop Human-in-the-Loop Machine Learning (Monarch 2021) Today’s Five
1. Variational Autoencoders (VAEs)
VAEs are probabilistic autoencoders that learn a structured latent distribution. By sampling from that distribution, they can generate new examples similar to the training data.
The key innovation is regularizing the latent space to be smooth and continuous.
Like learning not just to summarize books, but to create new ones in a similar style.
2. Uncertainty Estimation
Models can estimate how confident they should be in predictions. Some uncertainty comes from noisy data (aleatoric), and some from limited knowledge (epistemic).
Knowing when a model is uncertain enables safer decision-making.
Like a weather forecast giving seventy percent chance of rain instead of a simple yes or no.
3. Why Interpretability Is Hard
Neural networks represent information across many interacting parameters. No single component cleanly maps to a human concept.
Distributed representations enable powerful learning but resist simple explanations.
Like trying to explain a dream by pointing to individual neurons.
4. Gradient Noise
When training with mini-batches, gradients vary from step to step. A little noise can help exploration, but too much can slow convergence.
Batch size, learning rate, and gradient clipping all influence this noise level.
Like getting slightly different directions each time you ask for help.
5. Human-in-the-Loop Systems
Humans review, supervise, or override model decisions in critical workflows. This improves safety and accountability in high-stakes applications.
The approach combines model efficiency with human judgment where it matters most.
Like a pilot monitoring autopilot and stepping in when necessary.
Quick Reference
Concept One-liner VAEs Generative models with structured latent spaces Uncertainty Estimation Know when you don’t know Interpretability Distributed representations resist explanation Gradient Noise Mini-batch variation in training Human-in-the-Loop Human oversight for critical decisions
Short, accurate ML explainers. Follow for more.
-
643 words • 4 min read • Abstract
In-Context Learning Revisited: From Mystery to Engineering

It was 2020 when GPT-3 shocked everyone. It could learn from examples in the query—without updating its weights. We called it In-Context Learning. But was it magic, or was it doing something deeper?
Resource Link Video ICL Revisited 
Papers 4 References Phase 1: The Empirical Discovery (2020)
The GPT-3 paper showed that large models could perform few-shot learning. Give them examples, and they generalize. No gradient updates. No retraining. Just forward passes.
The surprising part was that scaling alone seemed to unlock it.
Paper: Language Models are Few-Shot Learners
ELI5: Show a big language model a few examples of a task in your prompt, and it figures out how to do the task—without any retraining. Nobody told it to do this. It just emerged when models got big enough.
Main idea: Scale unlocks emergent capabilities. ICL was discovered, not designed.
Phase 2: Mechanistic Explanations (2022)
By 2022, researchers began probing the internal mechanisms. Several papers proposed that transformers implement implicit meta-learning. The model appears to learn during inference by performing gradient-descent-like operations internally.
Paper: What Explains In-Context Learning in Transformers?
ELI5: When you give a transformer examples, its attention layers do something that looks like fitting a simple linear model to those examples—on the fly, during the forward pass. It’s not memorizing; it’s computing a mini-solution.
Main idea: ICL works because attention can simulate linear regression internally.
Paper: Transformers Learn In-Context by Gradient Descent
ELI5: The transformer’s forward pass is secretly doing something similar to training. The attention mechanism acts like one step of gradient descent over the examples you provided. Learning happens inside inference.
Main idea: ICL is implicit gradient descent—learning hidden inside prediction.
Phase 3: Engineering the Effect
Once researchers understood that ordering and structure affect ICL, prompt design became less of an art and more of an optimization problem. The quality and arrangement of demonstrations directly shape performance.
ICL became tunable. Researchers could now deliberately improve it rather than just observe it.
Phase 4: Interactive ICL (2026)
Recent work pushes this further. Models are trained to predict natural language critiques and feedback. If a model can predict what a teacher would say, it can internalize that signal. External correction becomes an internal capability.
Paper: Improving Interactive In-Context Learning from Natural Language Feedback
ELI5: Train a model to guess what feedback a human would give. Now the model has internalized the “teacher” and can improve itself without needing the actual teacher present. Self-correction without weight updates.
Main idea: Models can learn to learn from feedback, making ICL interactive and self-improving.
Beyond Language
Newer work applies ICL to neuroscience discovery, showing that the mechanism is not limited to text tasks. It becomes a flexible reasoning substrate across domains. That’s when you know a concept has matured.
The Arc
Phase Era Key Insight Discovery 2020 Emerges from scale Explanation 2022 Implicit gradient descent Engineering 2023-24 Prompt design as optimization Self-improvement 2026 Learning from feedback The Deeper Insight
In-Context Learning started as an emergent surprise. Now it’s becoming an engineered learning substrate inside transformers.
It was not magic. It was meta-learning hiding in plain sight.
References
Paper Link Language Models are Few-Shot Learners (GPT-3) arXiv:2005.14165 What Explains In-Context Learning in Transformers? arXiv:2202.12837 Transformers Learn In-Context by Gradient Descent arXiv:2212.07677 Improving Interactive ICL from Natural Language Feedback arXiv:2602.16066
ICL started as “whoa, it works.” Now we understand “why it works.” Next: engineering it deliberately.
-
451 words • 3 min read • Abstract
Five ML Concepts - #19

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #19 
References
Concept Reference Autoencoders Reducing the Dimensionality of Data with Neural Networks (Hinton & Salakhutdinov 2006) Correlation vs Causation Causality (Pearl 2009) Curriculum Learning Curriculum Learning (Bengio et al. 2009) Failure Analysis Practical Machine Learning for Computer Vision (Lakshmanan et al. 2021) Covariate Shift Dataset Shift in Machine Learning (Quinonero-Candela et al. 2009) Today’s Five
1. Autoencoders
Autoencoders are neural networks trained to compress inputs into a smaller representation and reconstruct them. The bottleneck forces the model to capture essential structure.
This learned compression is useful for dimensionality reduction, denoising, and feature learning.
Like summarizing a book into key points and then rebuilding the story from that summary.
2. Correlation vs Causation
Two variables can move together without one causing the other. Models typically learn correlations present in data, not true cause-and-effect relationships.
This matters because interventions based on correlation alone may not produce intended effects.
Like noticing umbrella sales rise with rain—umbrellas don’t cause rain.
3. Curriculum Learning
Training starts with easier examples and gradually introduces harder ones. This can improve stability and learning speed in some settings.
The approach mirrors how humans learn complex subjects incrementally.
Like teaching math by starting with addition before moving to calculus.
4. Failure Analysis
Failure analysis groups model errors into categories to understand where performance breaks down. This helps target improvements instead of guessing.
Systematic error analysis often reveals actionable patterns invisible in aggregate metrics.
Like a teacher reviewing which types of questions students miss most often.
5. Covariate Shift
Covariate shift occurs when the input distribution changes between training and deployment, while the task itself remains the same. The model may underperform because it sees unfamiliar inputs.
Monitoring input distributions helps detect this shift early.
Like training a driver in sunny weather and testing them in snow.
Quick Reference
Concept One-liner Autoencoders Compress and reconstruct to learn structure Correlation vs Causation Co-occurrence isn’t cause Curriculum Learning Start easy, progress to hard Failure Analysis Categorize errors to guide fixes Covariate Shift New inputs, same task
Short, accurate ML explainers. Follow for more.
-
2241 words • 12 min read • Abstract
JSON et al: A Deep Dive into Data Serialization Formats

JSON is everywhere. APIs. Logs. Databases. Configuration files. But it’s not alone. A whole ecosystem of formats exists—each optimizing for different tradeoffs.
This post expands on the JSON et al short, providing technical depth on each format: when it was created, where it’s specified, and what problems it solves.
The Tradeoff Triangle
Before diving in, understand the fundamental constraint. Data formats balance three competing goals:
Goal Description Human Readability Can a developer read and edit it directly? Compactness How many bytes does it take to represent data? Query Performance How fast can you access specific fields? You usually only get two. JSON optimizes readability. Protobuf optimizes compactness. JSONB optimizes query performance. No format wins everywhere.
JSON: The Ubiquitous Baseline
Created: 2001 (discovered/formalized by Douglas Crockford) Specification: ECMA-404 (2013), RFC 8259 (2017) File Extension:
.jsonJSON (JavaScript Object Notation) emerged from JavaScript’s object literal syntax but became language-agnostic. Crockford didn’t invent it—he “discovered” it already existing in JavaScript and formalized the specification.
Technical Details
- Encoding: UTF-8 text (UTF-16/32 allowed but rare)
- Data Types: Objects
{}, arrays[], strings, numbers, booleans,null - Schema: None required
- Comments: Not allowed in strict JSON
Strengths
- Universal parser support (every language has one)
- Human readable without tools
- Web-native (JavaScript parses it natively)
- Simple specification (fits on a business card)
Weaknesses
- Verbose (field names repeated for every object)
- No binary data type (must base64-encode)
- No comments (frustrating for config files)
- Parsing overhead (tokenization + string decoding every time)
ELI5
Like typing a long email instead of sending a terse text. Every message spells everything out—clear, but verbose.
When to Use
REST APIs, configuration (when comments aren’t needed), data interchange between systems, anywhere human readability matters more than efficiency.
JSONL / NDJSON: Streaming JSON
Created: ~2013 (formalized) Specification: JSON Lines, NDJSON File Extension:
.jsonl,.ndjsonJSONL (JSON Lines) and NDJSON (Newline-Delimited JSON) are the same concept: one valid JSON object per line, separated by newlines.
Technical Details
{"name": "Alice", "score": 95} {"name": "Bob", "score": 87} {"name": "Carol", "score": 92}No wrapping array. Each line is independently parseable.
Strengths
- Streaming: Process line-by-line without loading entire file
- Append-only: Add records without rewriting the file
- Parallel processing: Split by line, distribute to workers
- Fault-tolerant: One corrupt line doesn’t invalidate the file
Weaknesses
- Not valid JSON (can’t parse with standard JSON parser)
- Still text-based (same verbosity as JSON)
- No random access by index
ELI5
Like removing one comma per line to save some typing. Each line is self-contained, so you can grab and process them one at a time.
When to Use
Log files, big data pipelines (Spark, Pandas), ML datasets, event streams, anywhere you need to process records incrementally.
JSONB: Binary JSON for Databases
Created: 2014 (PostgreSQL 9.4) Specification: Implementation-specific (no universal standard) Storage: Database column type
JSONB isn’t a file format—it’s a database storage optimization. PostgreSQL’s JSONB differs from MongoDB’s BSON, which differs from other implementations.
PostgreSQL JSONB Details
- Parsed once: Text converted to binary on INSERT
- Keys sorted: Deterministic ordering for indexing
- Duplicates removed: Last value wins
- Offset table: O(log n) field lookup instead of O(n) text scanning
MongoDB BSON
Specification: bsonspec.org
BSON (Binary JSON) is MongoDB’s serialization format. Unlike PostgreSQL’s JSONB, BSON is a standalone binary format:
- Type-prefixed values
- Supports additional types (Date, Binary, ObjectId)
- Length-prefixed for fast skipping
- ~10-15% smaller than JSON typically
Strengths
- Fast queries without re-parsing
- Indexable (GIN indexes on JSONB in PostgreSQL)
- Type coercion at storage time
Weaknesses
- Not portable (implementation-specific)
- Not human-readable
- INSERT overhead (parsing cost upfront)
ELI5
Instead of cooking from scratch every time, you heat a pre-made meal. The prep work happens once (on INSERT), so serving (queries) is fast.
When to Use
Database storage where you query into JSON structures. PostgreSQL JSONB + GIN indexes enable fast
@>containment queries.
Protocol Buffers: Google’s Schema-First Format
Created: 2001 (internal Google), 2008 (open-sourced) Specification: developers.google.com/protocol-buffers File Extension:
.proto(schema), binary wire formatProtocol Buffers (Protobuf) is Google’s language-neutral, schema-required serialization format. It powers gRPC.
Technical Details
Schema definition:
message Sensor { int32 temperature = 1; int32 humidity = 2; }Wire format uses field numbers, not names:
Field 1: 72 Field 2: 40Key Features
- Varint encoding: Small integers use fewer bytes
- Field numbers: Enable backward compatibility
- Code generation:
.proto→ language-specific classes - No self-description: Receiver must know schema
Strengths
- Extremely compact (3-10x smaller than JSON typically)
- Fast serialization/deserialization
- Strong versioning semantics
- gRPC integration
Weaknesses
- Requires schema agreement
- Not human-readable
- Tooling required for debugging
- Schema evolution has rules
ELI5
Everyone agrees upfront what “field 1” means. You don’t waste space spelling out “temperature”—you just send the number 1 and the value. Both sides know the code.
When to Use
Microservices (gRPC), internal APIs, anywhere bandwidth and latency matter more than debuggability.
ASN.1: The Telecom Veteran
Created: 1984 (ITU-T X.208) Specification: ITU-T X.680-X.683 Encoding Rules: BER, DER, PER, XER, and more
ASN.1 (Abstract Syntax Notation One) predates all modern formats. It defines both schema and encoding, with multiple encoding rules for different use cases.
Encoding Rules Comparison
Rule Use Case BER (Basic Encoding Rules) Flexible, general purpose DER (Distinguished Encoding Rules) Deterministic, for cryptography PER (Packed Encoding Rules) Most compact, for bandwidth-constrained XER (XML Encoding Rules) XML-based, for interop Where You See ASN.1
- X.509 certificates (SSL/TLS certs are DER-encoded ASN.1)
- LDAP (directory services)
- SNMP (network management)
- Telecom protocols (SS7, GSM, LTE)
Strengths
- Bit-level precision
- Proven over 40 years
- Multiple encoding options
- Formal verification possible
Weaknesses
- Complex specification
- Steep learning curve
- Tooling can be expensive
- Security vulnerabilities in parsers (historically)
ELI5
Same idea as Protobuf—everyone agrees upfront what each field number means. ASN.1 just got there 20 years earlier and handles even more edge cases.
When to Use
You probably won’t choose ASN.1 for new projects. You’ll encounter it in cryptography, certificates, and legacy telecom systems.
YAML: Human-Friendly Configuration
Created: 2001 (Clark Evans, Ingy döt Net, Oren Ben-Kiki) Specification: yaml.org/spec/1.2.2 File Extension:
.yaml,.ymlYAML (YAML Ain’t Markup Language) prioritizes human readability. It’s a superset of JSON—any valid JSON is valid YAML.
Technical Details
# Comments allowed! server: host: localhost port: 8080 features: - auth - loggingKey Features
- Indentation-based: Whitespace matters
- Comments:
#for single-line - Anchors/aliases:
&nameand*namefor references - Multiple documents:
---separator
Strengths
- Highly readable
- Comments supported
- Multi-line strings without escaping
- Complex data structures
Weaknesses
- “Norway problem”:
NOparses as booleanfalse - Whitespace sensitivity causes errors
- Multiple ways to express same data
- Security concerns (arbitrary code execution in some parsers)
ELI5
Optimized for clarity, not bandwidth. YAML is for humans editing config files—not for machines exchanging data over networks.
When to Use
Configuration files (Kubernetes, Docker Compose, CI/CD), anywhere humans edit data directly and comments help.
TOML: Minimal Configuration
Created: 2013 (Tom Preston-Werner) Specification: toml.io File Extension:
.tomlTOML (Tom’s Obvious Minimal Language) emerged as a reaction to YAML’s complexity. It’s used by Rust (Cargo.toml), Python (pyproject.toml), and others.
Technical Details
[server] host = "localhost" port = 8080 [server.features] auth = true logging = trueKey Features
- Explicit typing: Dates, times, arrays have clear syntax
- Sections:
[section]and[section.subsection] - No anchors: Intentionally simpler than YAML
- Deterministic: Same data = same representation
Strengths
- Easy to read and write
- Unambiguous parsing
- Clear error messages
- Growing ecosystem support
Weaknesses
- Less expressive than YAML
- Nested structures can be verbose
- Smaller ecosystem than JSON/YAML
ELI5
Same goal as YAML—clarity for humans, not bandwidth for machines—but with stricter rules so you make fewer mistakes.
When to Use
Configuration files where YAML’s complexity isn’t needed. Rust projects (mandatory). Python packaging (pyproject.toml).
TOON: Token-Optimized for LLMs
Created: October 2025 (toon-format organization) Specification: github.com/toon-format/toon (v3.0) File Extension:
.toonMedia Type:text/toon(provisional)TOON (Token Oriented Object Notation) is the newest format in this list, designed specifically for LLM input. It’s a lossless representation of JSON that minimizes tokens.
Technical Details
TOON combines YAML-style indentation for nested objects with CSV-like tabular layouts for uniform arrays:
users[2]{name,age}: Alice,25 Bob,30Equivalent JSON:
{"users": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]}Key Features
- Header-based: Field names declared once, values follow
- 40% fewer tokens: Than equivalent JSON typically
- Lossless: Round-trips to JSON perfectly
- UTF-8 always: No encoding ambiguity
Performance
Metric JSON TOON Accuracy 69.7% 73.9% Efficiency (acc/1K tokens) 15.3 26.9 Strengths
- Significant token savings at scale
- Better context window utilization
- Lower API costs for LLM applications
- Human-readable (unlike binary formats)
Weaknesses
- New format (October 2025)
- Limited tooling compared to JSON
- Requires conversion layer for existing systems
- Not yet widely adopted
ELI5
Like having one header row for each column in a table instead of repeating the column name for every single row. You declare field names once, then just list the values.
When to Use
LLM prompts with structured data, RAG applications, anywhere token efficiency matters. Especially useful for large datasets with uniform object arrays.
Implementations
- TypeScript: Reference implementation
- Python: toons (Rust-based, fast)
- Go, Rust, .NET: Available via toon-format org
Alternatives Not in the Video
MessagePack
Created: 2008 (Sadayuki Furuhashi) Specification: msgpack.org
Binary JSON without schema. Type-prefixed values, efficient numeric encoding.
Use when: You want JSON semantics but smaller/faster.
CBOR
Created: 2013 (IETF) Specification: RFC 8949
Concise Binary Object Representation. Designed for constrained environments (IoT).
Use when: Resource-constrained devices, need a standard binary format.
Apache Avro
Created: 2009 (Apache, Doug Cutting) Specification: avro.apache.org
Schema-based, row-oriented binary format. Schema embedded or stored separately. Strong schema evolution support.
Use when: Big data pipelines (Hadoop, Kafka), schema evolution is critical.
Apache Parquet
Created: 2013 (Twitter + Cloudera) Specification: parquet.apache.org
Columnar storage format. Not for serialization—for analytics storage.
Use when: Large-scale analytics, data warehousing, Spark/Pandas workflows.
Cap’n Proto
Created: 2013 (Kenton Varda, ex-Protobuf author) Specification: capnproto.org
Zero-copy serialization. The serialized form is the in-memory form.
Use when: Extreme performance requirements, inter-process communication.
FlatBuffers
Created: 2014 (Google) Specification: google.github.io/flatbuffers
Zero-copy like Cap’n Proto but with better tooling. Used in games, mobile.
Use when: Games, mobile apps, anywhere memory allocation matters.
Quick Reference
Format Year Schema Binary Human-Readable Best For JSON 2001 No No Yes APIs, interchange JSONL 2013 No No Yes Logs, streaming JSONB 2014 No Yes No Database queries Protobuf 2008 Yes Yes No Microservices ASN.1 1984 Yes Yes No Crypto, telecom YAML 2001 No No Yes Config files TOML 2013 No No Yes Simple config TOON 2025 No No Yes LLM prompts MessagePack 2008 No Yes No Fast JSON CBOR 2013 Optional Yes No IoT Avro 2009 Yes Yes No Big data
Key Takeaways
-
No “best” format exists. Each optimizes for different constraints.
-
Text formats favor humans. JSON, YAML, TOML prioritize readability over efficiency.
-
Binary formats favor machines. Protobuf, MessagePack, CBOR prioritize compactness and speed.
-
Schema formats favor correctness. Protobuf, Avro, ASN.1 catch errors at compile time.
-
The tradeoff triangle is real. Readability, compactness, query performance—pick two.
The question isn’t “which format wins?” The question is: what problem are you solving?
Resources
- ECMA-404 JSON Specification
- RFC 8259 JSON
- JSON Lines Specification
- PostgreSQL JSONB Documentation
- Protocol Buffers Documentation
- YAML 1.2.2 Specification
- TOML v1.0.0 Specification
- RFC 8949 CBOR
- MessagePack Specification
- Apache Avro Specification
Data formats are design decisions. Choose based on your constraints, not trends.
Questions? Find me on YouTube @SoftwareWrighter.
-
444 words • 3 min read • Abstract
Five ML Concepts - #18

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #18 
References
Concept Reference Preference Learning Learning to summarize from human feedback (Stiennon et al. 2020) Ensembling Ensemble Methods in Machine Learning (Dietterich 2000) ML Fragility Distribution Shift (Quinonero-Candela et al. 2009) Epoch Deep Learning (Goodfellow et al. 2016), Chapter 8 Cost vs Quality Efficient Transformers: A Survey (Tay et al. 2022) Today’s Five
1. Preference Learning
Instead of learning from fixed labels, models are trained from comparisons between outputs. This helps align model behavior with human judgments.
The approach works well when absolute quality is hard to define but relative preferences are easier to express.
Like learning to cook by asking which dish tastes better.
2. Ensembling
Ensembling combines predictions from multiple models. Different models make different errors, and combining them can improve robustness.
Common strategies include voting, averaging, and stacking models together.
Like asking several experts and averaging their opinions.
3. Why ML Is Fragile
Models rely on statistical patterns learned from data. When those patterns shift, performance can degrade quickly.
This fragility emerges because models optimize for training distributions, not arbitrary future scenarios.
Like a spell checker that works on common words but struggles with unusual ones.
4. Epoch
An epoch is one complete pass through the training dataset. Multiple epochs allow the model to refine its weights over repeated passes.
Training typically continues for many epochs until validation performance stops improving.
Like reading a textbook from beginning to end more than once.
5. Cost vs Quality Tradeoffs
Increasing model size or compute often improves performance, but also increases cost and latency. Engineers balance quality against budget and responsiveness.
Production systems often use smaller, faster models rather than the largest available.
Like choosing between a luxury car and an economy car depending on your needs.
Quick Reference
Concept One-liner Preference Learning Train from comparisons, not labels Ensembling Combine models for robustness ML Fragility Statistical models break on distribution shift Epoch One pass through training data Cost vs Quality Bigger isn’t always better in production
Short, accurate ML explainers. Follow for more.
-
1063 words • 6 min read • Abstract
midi-cli-rs: Music Generation for AI Coding Agents

AI coding agents can write code, generate images, and produce text. But what about music? When I needed background audio for explainer videos, I wanted a tool that AI agents could use directly—no music theory required.
Resource Link Video midi-cli-rs Explainer 
Examples Listen to Samples Code midi-cli-rs The Problem
Generating music programmatically is hard. Traditional approaches require understanding music theory, MIDI specifications, instrument mappings, and audio synthesis. That’s a lot to ask of an AI agent that just needs a 5-second intro.
I wanted something simpler: a CLI tool where an agent could say “give me 5 seconds of suspenseful music” and get a usable WAV file.
The Solution: Mood Presets
midi-cli-rs solves this with mood presets—curated musical generators that produce complete compositions from a single command:
# Generate a 5-second suspenseful intro midi-cli-rs preset --mood suspense --duration 5 -o intro.wav # Upbeat outro with specific key midi-cli-rs preset -m upbeat -d 7 --key C --seed 42 -o outro.wavSix moods are available:
Mood Character suspenseLow drones, tremolo strings, tension eerieSparse tones, diminished harmony upbeatRhythmic chords, energetic calmWarm pads, gentle arpeggios ambientTextural drones, pentatonic bells jazzWalking bass, brushed drums, piano trio Each mood generates multi-layer compositions with appropriate instruments, rhythms, and harmonies. The
--seedparameter ensures reproducibility—same seed, same output. Different seeds produce meaningful variations in melody contour, rhythm patterns, and instrument choices.Melodic Variation
The presets don’t just randomize notes—they use a contour-based variation system. Changing the seed produces melodies that follow different shapes (ascending, descending, arch, wave) while staying musically coherent. This means you can generate multiple versions of a mood and pick the one that fits best.
How It Works
The tool generates MIDI programmatically, then renders to WAV using FluidSynth:
Mood Preset → MIDI Generation → FluidSynth → WAV OutputMIDI generation uses the
midlycrate to create standard MIDI files. Each preset generates multiple tracks with different instruments, note patterns, and dynamics.Audio rendering calls FluidSynth as a subprocess with a SoundFont (instrument samples). This avoids LGPL licensing complications—subprocess execution doesn’t trigger copyleft.
Note-Level Control
When presets aren’t enough, you can specify exact notes:
# Note format: PITCH:DURATION:VELOCITY[@OFFSET] midi-cli-rs generate \ --notes "C4:0.5:80@0,E4:0.5:80@0.5,G4:0.5:80@1,C5:1:90@1.5" \ -i piano -t 120 -o arpeggio.wavOr use JSON for complex multi-track arrangements:
echo '{"tempo":90,"instrument":"piano","notes":[ {"pitch":"C4","duration":0.5,"velocity":80,"offset":0}, {"pitch":"E4","duration":0.5,"velocity":80,"offset":0.5}, {"pitch":"G4","duration":1,"velocity":90,"offset":1} ]}' | midi-cli-rs generate --json -o output.wavWeb UI
For interactive composition, there’s a browser-based interface:
midi-cli-rs serve # Starts on http://127.0.0.1:3105The Presets tab lets you adjust mood, key, duration, intensity, and tempo with immediate audio preview. Click the clock button to generate a time-based seed for unique but reproducible results.
The Melodies tab provides note-by-note composition with keyboard shortcuts:
a-gfor note pitch[/]to adjust duration+/-to change octaveTabto navigate between notes
For AI Agents
The CLI is designed for AI agent usage:
- Simple commands: One line generates complete audio
- Reproducible: Seed values ensure consistent output
- Self-documenting:
--helpincludes agent-specific instructions - Composable: Generate tracks separately, combine with ffmpeg
# AI agent workflow midi-cli-rs preset -m suspense -d 5 --seed 1 -o intro.wav midi-cli-rs preset -m upbeat -d 10 --seed 2 -o main.wav ffmpeg -i intro.wav -i main.wav -filter_complex concat=n=2:v=0:a=1 final.wavSoundFont Quality Matters
The quality of generated audio depends heavily on the SoundFont used. SoundFonts are collections of audio samples for each instrument—a tiny SoundFont with compressed samples will sound thin and artificial, while a larger one with high-quality recordings produces professional results.
SoundFont Size Quality License TimGM6mb ~6MB Basic GPL v2 GeneralUser GS ~30MB Good Permissive FluidR3_GM ~140MB Very Good MIT MuseScore_General ~200MB Excellent MIT For anything beyond quick prototypes, use a quality SoundFont. The difference is dramatic—the same MIDI file can sound like a toy keyboard or a real instrument depending on the samples.
The tool auto-detects SoundFonts in common locations (
~/.soundfonts/,/opt/homebrew/share/soundfonts/, etc.), or specify one explicitly with--soundfont.Technical Details
Built with Rust 2024 edition using permissively licensed dependencies:
Crate Purpose midly MIDI file generation clap CLI argument parsing serde JSON serialization rand Randomization for presets axum Web server (for servecommand)FluidSynth is called as a subprocess for WAV rendering, keeping the main codebase MIT-licensed.
Try It
Listen to sample outputs, or build locally:
git clone https://github.com/softwarewrighter/midi-cli-rs.git cd midi-cli-rs cargo build --release ./target/release/midi-cli-rs preset -m jazz -d 5 -o jazz.wavRequires FluidSynth for WAV output (
brew install fluid-synthon macOS).
Series: Personal Software (Part 2) Previous: cat-finder: Local ML in Rust Next: midi-cli-rs: Custom Mood Packs Disclaimer
You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.
Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.
-
472 words • 3 min read • Abstract
Five ML Concepts - #17

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #17 
References
Concept Reference Benchmark Leakage Rethinking the Inception Architecture for Computer Vision (Szegedy et al. 2016) Concept/Data Drift Learning under Concept Drift: A Review (Lu et al. 2018) Weight Decay Decoupled Weight Decay Regularization (Loshchilov & Hutter 2019) Scaling Laws Scaling Laws for Neural Language Models (Kaplan et al. 2020) Shadow Deployment Reliable Machine Learning (Cathy Chen et al. 2022) Today’s Five
1. Benchmark Leakage
When benchmark or test data influences training, tuning, or model selection, evaluation results become unreliable. This inflates reported performance beyond real-world capability.
Strict separation between development and evaluation data is essential for honest assessment.
Like practicing with the exact questions that will appear on the final exam.
2. Concept Drift vs Data Drift
Data drift occurs when input distributions change. Concept drift occurs when the relationship between inputs and outputs changes. Both can degrade model performance over time.
Data drift: customers buy different products. Concept drift: what “good” means has changed.
Like customers buying different products versus products changing what they mean.
3. Weight Decay
A regularization method that penalizes large weights, often implemented as L2 regularization. This encourages simpler models that generalize better.
Weight decay adds a term proportional to the squared magnitude of weights to the loss function.
Like encouraging shorter, simpler answers instead of overly complicated ones.
4. Scaling Laws
Empirical relationships showing how performance tends to improve as model size, data, or compute increase. These relationships follow predictable power-law curves.
Scaling laws help predict resource requirements for target performance levels.
Like noticing that adding horsepower often increases a car’s speed, but with diminishing returns.
5. Shadow Deployment
Running a new model in parallel with production without affecting live user decisions. The shadow model processes real traffic but its outputs are only logged, not served.
This allows safe evaluation before full deployment.
Like a new chef preparing the same dishes in the back kitchen before serving customers.
Quick Reference
Concept One-liner Benchmark Leakage Test data contaminating training/selection Concept vs Data Drift Changed relationships vs changed inputs Weight Decay L2 penalty discourages large weights Scaling Laws Performance scales predictably with resources Shadow Deployment Test safely alongside production
Short, accurate ML explainers. Follow for more.
-
1069 words • 6 min read • Abstract
TBT (4/?): ToonTalk - Teaching Robots to Program

I first discovered ToonTalk during the Windows XP era—probably around 2003 or 2004. It was unlike anything I’d seen: a programming environment disguised as a video game where you trained robots by showing them what to do. The concept stuck with me for two decades.
Resource Link Video ToonTalk in Rust 
tt-rs Demo Live Demo tt-rs Repo tt-rs What is ToonTalk?
ToonTalk is a visual programming environment created by Ken Kahn in 1995. The “Toon” stands for cartoon—every abstract programming concept is mapped to a concrete, animated metaphor:
Concept ToonTalk Metaphor Variables Boxes with numbered holes Values Numbers, text, images in boxes Comparison Scales that tip when values differ Functions Robots that watch and learn Message passing Birds that carry items to nests Garbage collection Trucks that haul away unused items The design was influenced by games like The Legend of Zelda and Robot Odyssey—the kind of games that made you think while you played.
Programming by Demonstration
The core idea is radical: you don’t write code, you show a robot what to do.
- Create a robot and put it in “training mode”
- Perform actions while the robot watches (move items, compare values, etc.)
- The robot records your actions as a program
- Give the robot a box matching the training pattern—it executes the learned behavior
This is programming by demonstration. The robot generalizes from your example, matching patterns and applying transformations. It’s the same conceptual model as teaching a child: “Watch what I do, then you try.”
Three Generations
ToonTalk has existed in three forms:
Version Era Technology Original ToonTalk 1995-2009 C++, 3D desktop application ToonTalk Reborn 2014-2017 JavaScript/jQuery web app tt-rs 2025-2026 Rust/WebAssembly/Yew The original was a full 3D world—cities, houses, helicopters, even bombs for debugging. Ken Kahn later created ToonTalk Reborn, a simplified JavaScript version that runs in browsers.
Why I Built tt-rs
When I rediscovered ToonTalk Reborn a few years ago, I wanted to experiment with the concepts myself. But diving into a large jQuery codebase wasn’t appealing. So I did what any reasonable person would do: I vibe coded my own version in Rust.
tt-rs is a modern reimplementation using:
- Rust for core logic
- WebAssembly for browser execution
- Yew for reactive UI
- SVG/CSS for graphics and animations
It’s not a port—it’s a fresh implementation inspired by the same ideas. Building it myself lets me understand the concepts deeply and experiment with variations.
Three Learning Levels
The demo introduces concepts progressively through three levels:
Level Concepts Widgets tt1 Basics Numbers, boxes, scales, wand, vacuum tt2 Messaging Birds and nests for communication tt3 Automation Sensors (time, random) + robots Level one covers the fundamentals: numbers with arithmetic, boxes as containers, scales for comparison, and tools for copying and removing. Level two adds asynchronous messaging—birds carry items to their paired nests. Level three brings sensors that produce values and robots that automate actions.
Current Features
The live demo includes:
Widgets:
- Numbers: Rational arithmetic with +, -, *, / operators
- Boxes: Configurable containers with 0-9 holes (resize with keyboard)
- Text: Basic text display
- Scales: Visual comparison that tips when values differ
- Robot: Training mode, action recording, execution
- Bird/Nest: Message passing with pairing and delivery
- Sensors: Time (milliseconds) and random number generation
Tools:
- Wand: Copy any widget
- Vacuum: Remove widgets
- Magnifier: Inspect nest message queues and robot actions
Interactions:
- Drag-and-drop with visual feedback
- Box joining (drop box on edge of another)
- Box splitting (drop box on a number)
- Contextual help panel with level-specific content
- Puzzle system with animated “Show Me” demos
Robot Training
The core feature is programming by demonstration:
- Click robot to enter training mode (yellow glow indicates “I’m watching”)
- Perform actions while the robot records (arithmetic, copy, remove, move to box)
- Click robot again to stop training
- Click robot to replay—it executes the recorded sequence
The tutorials demonstrate this workflow step by step. In the “Train Robot” tutorial, you teach a robot to move a number into a box. In “Robot Sensors,” you train a robot to generate random numbers, apply modulo, and send results to a nest via a bird.
Interactive Tutorials
Each tutorial has two parts:
- Show Me: Watch an animated demonstration where a cursor walks through the solution
- Practice: Try it yourself with the same widgets
The tutorials cover:
- Fill a box with numbers
- Add numbers together
- Copy widgets with the wand
- Send messages with birds and nests
- Train your first robot
- Combine robots with sensors
What’s Next
The immediate priorities:
- Pattern matching - Robot generalizes from specific values to “any number”
- Watched execution - See robot work step-by-step with animated cursor
- Persistence - Save and load workspaces
Long term, I’d like to add the 3D elements from the original—the cities, the houses, the helicopter view. But that’s a much larger project.
The Enduring Appeal
What makes ToonTalk fascinating isn’t just the visual metaphors—it’s the computational model. Under the hood, ToonTalk implements concurrent constraint logic programming. The robots are essentially guarded Horn clauses. The birds and nests implement the actor model.
Heavy concepts, but you don’t need to know any of that to use it. You just train robots by example. The abstraction is complete.
That’s why it stuck with me for twenty years. Good abstractions are rare. When you find one, it’s worth understanding deeply.
References
Resource Link ToonTalk Website toontalk.com ToonTalk on Wikipedia Wikipedia ToonTalk Reborn (JS) github.com/ToonTalk/ToonTalk ToonTalk Reborn Demo toontalk.github.io/ToonTalk ToonTalk Reborn Wiki Wiki Ken Kahn’s Page Ken Kahn Original Paper (1995) ERIC - ToonTalk: An Animated Programming Environment Ken Kahn’s Research Academia.edu
Some ideas are worth rediscovering. ToonTalk is one of them.
-
468 words • 3 min read • Abstract
Five ML Concepts - #16

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #16 
References
Concept Reference Train/Val/Test Split Deep Learning (Goodfellow et al. 2016), Chapter 5 Overconfidence On Calibration of Modern Neural Networks (Guo et al. 2017) Batch Normalization Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy 2015) Optimization vs Generalization Understanding Deep Learning Requires Rethinking Generalization (Zhang et al. 2017) A/B Testing Controlled Experiments on the Web (Kohavi et al. 2009) Today’s Five
1. Train / Validation / Test Split
Data is divided into training, validation, and test sets. Training learns patterns, validation tunes hyperparameters, test evaluates final performance.
Never use test data for any decisions during development—it should only be touched once.
Like practicing on homework, checking with practice tests, then taking the real exam.
2. Overconfidence
Models can assign very high probabilities to incorrect predictions. This is often related to poor calibration and can be dangerous in high-stakes applications.
Temperature scaling and other calibration methods can help align confidence with accuracy.
Like a student who is absolutely certain of a wrong answer.
3. Batch Normalization
Normalizes layer activations during training to improve stability and convergence. Each mini-batch’s activations are normalized to have zero mean and unit variance.
This reduces internal covariate shift and often allows higher learning rates.
Like keeping everyone on a similar pace during training so no one runs too far ahead.
4. Optimization vs Generalization
Training loss can decrease while test performance does not improve. Good optimization does not guarantee good generalization.
A model can perfectly fit training data while failing on new examples—this is overfitting.
Like memorizing last year’s exam instead of understanding the subject.
5. A/B Testing Models
Comparing two model versions using controlled live traffic experiments. Users are randomly assigned to see predictions from model A or model B.
Statistical analysis determines which model performs better on real-world metrics.
Like taste-testing two recipes with real customers to see which works better.
Quick Reference
Concept One-liner Train/Val/Test Separate data for learning, tuning, and evaluation Overconfidence High probability on wrong predictions Batch Normalization Normalize activations for stable training Optimization vs Generalization Low train loss ≠ good test performance A/B Testing Compare models with live experiments
Short, accurate ML explainers. Follow for more.
-
796 words • 4 min read • Abstract
Multi-Hop Reasoning (2/2): The Distribution Trap

In Part 1, a tiny 135M model achieved 75% accuracy on multi-hop reasoning. This time we scale up to 360M—and discover that RSFT on easy examples makes performance worse.
Resource Link Paper KG-Guided RAG (arXiv) Code multi-hop-reasoning ELI5 eli5.md Demo Live Demo Explainer Coming soon Scaling Up: SmolLM-360M
Part 1 used the 135M model. For better reasoning traces and demo quality, we trained the 360M variant:
Model Parameters Platform SmolLM-135M-Instruct 135M MLX (macOS) SmolLM-360M-Instruct 360M MLX + Unsloth (cross-platform) The 360M model produces more coherent traces and is used by the live inference demo.
The Distribution Trap
Here’s what happened when we trained RSFT on the “easy” training data:
Phase Training Data Accuracy Notes Base — 0% No format compliance SFT (500 iters) Easy (1-3 hop) 37% Learns TRACE + ANSWER format RSFT Easy (1-3 hop) 27% Worse than SFT! RSFT on easy examples performed worse than the SFT baseline.
Why?
The training examples (1-3 hops) don’t match the evaluation distribution (4-5 hops). The model learns shortcuts that work on easy problems but fail on hard ones.
Training Distribution Eval Distribution Result Easy (1-3 hop) Hard (4-5 hop) 27% (worse) Hard (4-5 hop) Hard (4-5 hop) 75% (Part 1 result) The rejection sampling “winners” from easy examples teach strategies that don’t generalize.
The Key Finding
Rejection sampling must match your target distribution.
This is counterintuitive. You might expect that training on more examples (even easy ones) would help. Instead:
- Easy winners use shortcuts (fewer reasoning steps)
- Hard eval requires full chain reasoning
- Model learns the wrong patterns
The fix: train RSFT on
eval.jsonl(hard examples), nottrain.jsonl(easy examples).Demo Improvements
The demo now includes four interactive tabs:
Tab Feature Training Animated SFT→RSFT visualization with KG scoring Inference Pre-recorded inference examples Try It Live inference with 360M model Distribution Interactive visualization of the key finding Try It: Live Inference
Ask DevOps troubleshooting questions and watch the model reason:
Question: What causes TLSHandshakeError? TRACE: TLSHandshakeError is caused by ClockSkew, and ClockSkew leads to CertificateExpired, and CertificateExpired is fixed by RenewCert... ANSWER: BThe knowledge graph scores the reasoning path during training, but at inference the model reasons independently.
Cross-Platform Support
The pipeline now runs on both platforms:
Platform Framework Command macOS (Apple Silicon) MLX make train-360mLinux (NVIDIA CUDA) Unsloth make train-360m-unslothUnsloth provides 2x faster training with 60% less memory on NVIDIA GPUs.
Current Status
Component Status SFT training (360M) Complete RSFT (wrong distribution) Complete (27%) RSFT (correct distribution) Next step Live demo with Try It Complete Cross-platform support Complete Next Steps
Priority Task Expected Result High Retrain RSFT on eval.jsonl 75%+ accuracy Medium Update demo to use corrected model Better live inference Medium Curriculum learning (easy→hard) Smoother training Low Larger models (1B+) Higher ceiling The corrected RSFT training:
python3 -m core.rsft \ --examples data/eval.jsonl \ # Hard examples! --kg data/kg.json \ --sft-adapter data/runs/run_360m/models/sft \ --output data/runs/run_360m/models/rsft_eval \ --model HuggingFaceTB/SmolLM-360M-Instruct \ --k-samples 8 \ --max-examples 50Lessons Learned
1. Distribution Matching is Non-Negotiable
This isn’t a minor optimization—it’s the difference between 27% and 75% accuracy. Wrong distribution = wrong winners = wrong model.
2. Easy Examples Can Hurt
More training data isn’t always better. Easy examples teach shortcuts that fail on hard problems.
3. Verify Your Pipeline
We trained a full RSFT model before realizing the distribution mismatch. Always check that training data matches eval distribution.
4. The Fix is Simple
Once identified, the fix is one flag change:
--examples data/eval.jsonlinstead oftrain.jsonl.Resources
- Repository: multi-hop-reasoning
- Live Demo
- Part 1: Training Wheels for Small LLMs
- Paper: Knowledge Graph-Guided RAG
- Training Status
Training distribution matters. Easy examples teach easy shortcuts.
-
775 words • 4 min read • Abstract
Towards Continuous LLM Learning (2): Routing Prevents Forgetting

In Part 1, naive LoRA fine-tuning caused catastrophic forgetting. Now we’re implementing the Share algorithm properly—and we’re about 60% of the way to verifying the paper’s claims.
Resource Link Code sleepy-coder Part 1 When Fine-Tuning Fails ELI5 eli5.md Share Paper arXiv:2602.06043 Paper Claims vs Implementation Status
We’re systematically verifying the claims from the Share and UWSH papers:
Paper Claim Infrastructure Demonstrated? Shared basis via SVD Complete Yes ~100x parameter reduction Complete (76x) Yes Task routing beats averaging Tested (Exp 1b) Partial Prevents catastrophic forgetting Tested (Exp 1b) Partial Sequential learning Not tested No UWSH subspace stability Not tested No Overall: ~60% complete. Infrastructure is solid. Routing tested. Sequential learning remains.
What We Built
The full Share algorithm implementation:
- Phase 1: SVD-based subspace extraction from 51 LoRA adapters (60% variance threshold)
- Phase 2: Coefficient-only training with frozen basis (83K params vs 1.6M full LoRA)
- Phase 3: Basis merging and updates
- Routing: Error pattern classification for coefficient selection
Bug Fixes That Unlocked Progress
Two critical bugs blocked proper Phase 2 training:
Bug 1: Zero-Gradient Saddle Point
Both coefficient matrices initialized to zero:
eps_beta = 0, eps_alpha = 0 → delta_W = 0 @ 0 = 0 → zero gradients, no learningFix: Dual small-random initialization.
Bug 2: Half-Parameter Training
LoRA-style initialization only trained one coefficient set:
Before: 112/224 parameters getting gradients After: 224/224 parameters getting gradientsFix: Both coefficient matrices need random initialization.
Experiment 1b: Routing Works
With gradient-trained v4 coefficients and proper routing:
Strategy Pass Rate BC RH TB Regressions Baseline (no LoRA) 46.7% 70% 40% 30% – Averaged 50.0% 70% 40% 40% 1 Routed 50.0% 70% 50% 30% 0 Result handling improved 40% → 50%. Zero regressions. This is the first positive transfer from Share coefficients.
The Forgetting Heatmap
We applied each coefficient individually to all 30 koans:
Koan BL mut_bc dbl_mt ret_lr mis_cl mis_hs mis_or opt_ok res_me ROUTED bc_001-009 P P P P P P P P P P bc_003,5,10. . . . . . . . . . rh_002 . . +GAIN . . +GAIN +GAIN +GAIN +GAIN +GAIN rh_008 P -LOST -LOST -LOST -LOST -LOST -LOST -LOST -LOST P tb_005 P P P P P -LOST P P P PKey finding: rh_008 regresses under every coefficient applied globally. But routing saves it by falling back to the base model when no pattern matches.
This is exactly what the Share paper predicts: task-specific coefficients improve targeted patterns without interfering with unrelated ones.
What the Papers Claim vs What We’ve Verified
Verified
-
Shared basis via SVD — We extract principal components from 51 adapters. Works.
-
76x parameter reduction — 83K coefficient parameters vs 1.6M full LoRA. Verified.
-
Routing prevents forgetting — Zero regressions with routed inference. The fragile rh_008 koan survives because it falls back to base model.
-
Positive transfer possible — Result handling improved 40% → 50% with routed coefficients.
Not Yet Verified
-
Sequential learning — The core continual learning claim. Train task 1 → eval → train task 2 → eval (verify task 1 still passes). This is next.
-
UWSH subspace stability — Do different adapter subsets converge to similar subspaces? Grassmann distance measurement needed.
Next Experiments
Priority Experiment Target High Sequential learning curve No degradation on prior tasks High Fix k_alpha=32 (paper recommends) Match paper exactly Medium UWSH verification >70% subspace overlap Medium Add rank update vectors Full algorithm The Architecture
Day: Agent attempts to fix Rust errors ↓ Successes and failures logged ↓ Night: Train coefficients (frozen basis) ↓ 83K params per task ↓ Eval: Route to appropriate coefficients ↓ Pattern-matched inference ↓ (repeat)The key insight: train small, route smart. The shared basis captures common structure. Per-task coefficients specialize without interference.
Resources
- sleepy-coder Repository
- Part 1: When Fine-Tuning Fails
- Paper Checklists
- Share Paper (arXiv:2602.06043)
- UWSH Paper (arXiv:2512.05117)
60% of the way to verifying the papers. Sequential learning is next.
-
470 words • 3 min read • Abstract
Five ML Concepts - #15

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #15 
References
Concept Reference Perplexity A Neural Probabilistic Language Model (Bengio et al. 2003) Catastrophic Forgetting Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017) Weight Initialization Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010) Curse of Dimensionality The Elements of Statistical Learning (Hastie et al. 2009), Chapter 2 Monitoring & Drift Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift (Rabanser et al. 2019) Today’s Five
1. Perplexity
A metric for language models that reflects how well the model predicts the next token. Lower perplexity means better predictive performance.
Perplexity is the exponentiated average negative log-likelihood per token.
Like a test where lower scores mean you found the answers easier to guess.
2. Catastrophic Forgetting
When training on new tasks causes a model to lose performance on previously learned tasks. This is a key challenge in continual learning.
Techniques like elastic weight consolidation help preserve important weights.
Like learning a new phone number and forgetting the old one.
3. Weight Initialization
The starting values of model weights influence how well training progresses. Poor initialization can cause vanishing or exploding gradients.
Xavier and He initialization are common strategies for setting initial weights appropriately.
Like starting a race from a good position instead of stuck in a ditch.
4. Curse of Dimensionality
In high-dimensional spaces, data becomes sparse and distances behave differently, making learning harder. Points that seem close in low dimensions can be far apart in high dimensions.
Feature selection and dimensionality reduction help mitigate this effect.
Like searching for a friend in a city versus across the entire universe.
5. Monitoring & Drift Detection
Continuously tracking model performance and detecting shifts in input data distributions. Production models can degrade silently without proper monitoring.
Automated alerts help catch problems before they affect users.
Like a weather station alerting you when conditions change.
Quick Reference
Concept One-liner Perplexity How surprised the model is by the data Catastrophic Forgetting New learning erases old knowledge Weight Initialization Starting values affect training stability Curse of Dimensionality High dimensions make data sparse Monitoring & Drift Track performance and data changes
Short, accurate ML explainers. Follow for more.
-
448 words • 3 min read • Abstract
Five ML Concepts - #14

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #14 
References
Concept Reference ROC/AUC An Introduction to ROC Analysis (Fawcett 2006) Spurious Correlations Unbiased Look at Dataset Bias (Torralba & Efros 2011) Gradient Clipping On the Difficulty of Training Recurrent Neural Networks (Pascanu et al. 2013) Loss Landscapes Visualizing the Loss Landscape of Neural Nets (Li et al. 2018) Cold Start Addressing Cold Start in Recommender Systems (Schein et al. 2002) Today’s Five
1. ROC / AUC
ROC curves plot true positive rate against false positive rate across all classification thresholds. AUC (Area Under the Curve) summarizes overall ranking performance in a single number.
AUC of 0.5 means random guessing; 1.0 means perfect ranking.
Like judging a student by considering every possible passing grade cutoff.
2. Spurious Correlations
Coincidental patterns in training data that don’t reflect true relationships. Models that rely on them can fail when the coincidence disappears.
Dataset curation and diverse evaluation help detect spurious features.
Like assuming umbrellas cause rain because you always see them together.
3. Gradient Clipping
Limiting the size of gradients during backpropagation. This helps prevent exploding gradients and unstable training, especially in recurrent networks.
Clipping can be by value or by global norm.
Like putting a speed limit on a car so it doesn’t lose control.
4. Loss Landscapes
How model error changes across different parameter settings. Training is like navigating this surface toward regions of lower loss.
Flat minima may generalize better than sharp ones.
Like hiking through mountains searching for the lowest valley, feeling the slope beneath your feet.
5. Cold Start Problems
Difficulty predicting for new users or items with no history. Without prior data, personalization becomes difficult.
Solutions include content-based features, popularity fallbacks, or asking initial questions.
Like a librarian trying to recommend books to someone who just walked in.
Quick Reference
Concept One-liner ROC / AUC Classifier performance across thresholds Spurious Correlations Coincidental patterns that don’t generalize Gradient Clipping Limit gradient size for stability Loss Landscapes Error surface over parameter space Cold Start No history for new users/items
Short, accurate ML explainers. Follow for more.
-
448 words • 3 min read • Abstract
Five ML Concepts - #13

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #13 
References
Concept Reference Calibration On Calibration of Modern Neural Networks (Guo et al. 2017) Shortcut Learning Shortcut Learning in Deep Neural Networks (Geirhos et al. 2020) Early Stopping Early Stopping - But When? (Prechelt 1998) Universal Approximation Approximation by Superpositions of a Sigmoidal Function (Cybenko 1989) Checkpointing Training Deep Nets with Sublinear Memory Cost (Chen et al. 2016) Today’s Five
1. Calibration
How well a model’s predicted probabilities match real-world outcomes. If a model predicts 70% confidence many times, it should be correct about 70% of those cases.
Well-calibrated models enable better decision-making under uncertainty.
Like a weather forecast that predicts rain 30% of the time and is right about 30% of those forecasts.
2. Shortcut Learning
When models rely on superficial patterns instead of meaningful features. For example, identifying cows by detecting grass and failing when cows appear indoors.
Shortcuts can inflate benchmark scores while masking poor real-world performance.
Like passing a test by memorizing answer positions instead of learning the material.
3. Early Stopping
Training is stopped when validation performance stops improving. This helps prevent overfitting by halting before the model memorizes training data.
Patience hyperparameters control how long to wait before stopping.
Like knowing when to stop practicing before you start reinforcing mistakes.
4. Universal Approximation
The theorem stating that neural networks can approximate any continuous function, given enough capacity. In practice, finding the right weights through optimization is the challenge.
The theorem guarantees existence, not learnability.
Like having enough Lego blocks to build almost any shape—assembly is still hard.
5. Checkpointing
Saving the model’s state during training. This allows recovery from interruptions and comparison across training stages.
Checkpoints also enable selecting the best model rather than just the final one.
Like saving your game progress so you can reload if something goes wrong.
Quick Reference
Concept One-liner Calibration Predicted probabilities match outcomes Shortcut Learning Exploiting spurious patterns Early Stopping Stop when validation plateaus Universal Approximation NNs can approximate any function Checkpointing Save model state during training
Short, accurate ML explainers. Follow for more.
-
488 words • 3 min read • Abstract
Five ML Concepts - #12

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #12 
References
Concept Reference Precision/Recall The Truth of the F-Measure (Sasaki 2007) OOD Detection A Baseline for Detecting Misclassified and Out-of-Distribution Examples (Hendrycks & Gimpel 2017) Batch Size On Large-Batch Training for Deep Learning (Goyal et al. 2017) Inductive Bias Relational Inductive Biases, Deep Learning, and Graph Networks (Battaglia et al. 2018) Latency/Throughput Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al. 2021) Today’s Five
1. Precision vs Recall
Precision measures how often positive predictions are correct. Recall measures how many actual positives are successfully found. Improving one often reduces the other.
The tradeoff depends on your application: spam filters favor precision, medical screening favors recall.
Like a search party: you can find everyone but raise false alarms, or be very certain and miss some people.
2. OOD Inputs (Out-of-Distribution)
Data that differs significantly from what the model saw during training. Models may fail silently or produce confident but wrong answers.
Detecting OOD inputs is an active research area for safer AI deployment.
Like asking a chef trained only in Italian food to make sushi.
3. Batch Size
The number of training examples processed before updating model weights. Larger batches can be more efficient computationally, but may generalize worse.
Finding the right batch size involves balancing speed, memory, and model quality.
Like grading tests one at a time or waiting to grade a full stack.
4. Inductive Bias
The assumptions built into a model that guide how it learns from data. Without inductive bias, models cannot generalize beyond training examples.
CNNs assume spatial locality; transformers assume tokens can attend to any position.
Like expecting nearby houses to have similar prices before looking at the data.
5. Latency vs Throughput
Latency is how long a single request takes. Throughput is how many requests can be handled per second. Optimizing one often comes at the expense of the other.
Batching improves throughput but increases latency for individual requests.
Like a restaurant serving one table quickly or many tables at once.
Quick Reference
Concept One-liner Precision vs Recall Correct positives vs finding all positives OOD Inputs Data unlike training distribution Batch Size Examples per weight update Inductive Bias Built-in learning assumptions Latency vs Throughput Speed per request vs total capacity
Short, accurate ML explainers. Follow for more.
-
1048 words • 6 min read • Abstract
Neural-Net-RS: An Educational Neural Network Platform

I wanted a neural network implementation where every step is visible—no framework magic hiding the math. Something I could use to teach the fundamentals, with a CLI for quick experiments and a web UI for visual demonstrations. Claude Code built it.
This is Personal Software for education: a complete neural network training platform with multiple interfaces, all from a single Rust codebase.
Resource Link Repo neural-net-rs Video Neural-Net-RS Explainer 
Why Build Your Own Neural Network?
Frameworks like PyTorch and TensorFlow are production-ready, but they hide the fundamentals. When teaching or learning, you want to see:
- How weights and biases actually change during training
- Why XOR needs a hidden layer when AND doesn’t
- What backpropagation really computes
Neural-Net-RS exposes all of this. No autograd magic—every gradient is computed explicitly. No tensor abstractions—just matrices with clear row-major storage.
What Got Built
A modular Rust workspace with multiple interfaces to the same core:
neural-net-rs/ ├── matrix/ # Linear algebra foundation ├── neural-network/ # Core ML implementation ├── neural-net-cli/ # Command-line interface ├── neural-net-server/ # REST API with SSE streaming └── neural-net-wasm/ # WebAssembly for browserOne codebase, three ways to interact:
- CLI: Train from terminal with progress bars
- Web UI: Visual training with real-time loss charts
- WASM: Run entirely in browser, no server needed
The Classic Problems
The platform includes 8 built-in examples that teach ML concepts progressively:
Problem Architecture Key Concept AND, OR 2→2→1 Linear separability XOR 2→3→1 Why hidden layers matter Parity3 3→6→1 Scaling non-linearity Quadrant 2→8→4 Multi-class classification Adder2 4→8→3 Learning arithmetic Iris 4→8→3 Real-world dataset Pattern3x3 9→6→4 Visual pattern recognition The XOR Problem
XOR is the canonical neural network problem. AND and OR are linearly separable—a single line can divide the outputs. XOR isn’t. You need a hidden layer.
AND: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1 ← One line separates XOR: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0 ← No line worksWatch XOR training and you see why neural networks are powerful: they learn to create intermediate representations that make non-linear problems separable.
Implementation Details
Feed-Forward with Backpropagation
pub struct Network { pub layers: Vec<usize>, // [input, hidden..., output] pub weights: Vec<Matrix>, // Learned connections pub biases: Vec<Matrix>, // Per-neuron offsets pub learning_rate: f64, // Training step size }Forward pass: Each layer computes
activation(weights × input + bias)Backward pass: Gradients flow backward using the chain rule, updating weights to reduce error.
The sigmoid activation function maps any input to (0, 1):
σ(x) = 1 / (1 + e^(-x))Custom Matrix Library
Educational clarity over maximum performance:
pub struct Matrix { rows: usize, cols: usize, data: Vec<f64>, // Row-major storage }Operations: dot product, transpose, element-wise multiply, map. Everything visible, nothing hidden.
Checkpoint System
Training can be interrupted and resumed:
# Train for 5000 epochs, save checkpoint neural-net-cli train xor --epochs 5000 --checkpoint model.json # Resume from checkpoint neural-net-cli train xor --epochs 10000 --resume model.jsonCheckpoints include version metadata to prevent loading incompatible models.
CLI Usage
# List available examples neural-net-cli examples # Train XOR with progress bar neural-net-cli train xor --epochs 10000 --learning-rate 0.5 # Predict with trained model neural-net-cli predict model.json --input "0,1" # Run web UI neural-net-cli serve --port 8080The CLI uses
indicatiffor real-time progress bars:Training XOR [=========> ] 7500/10000 (75%) Loss: 0.0023Web Interface
The server embeds all assets at compile time—one binary serves everything:
- Training panel: Select problem, set hyperparameters, watch loss decrease
- Network visualization: See layer structure and connection strengths
- Prediction panel: Test the trained model interactively
- Loss chart: Real-time plotting via Server-Sent Events
Two training modes:
- Local (WASM): Runs entirely in browser
- Remote (API): Server-side with streaming progress
Technology Choices
Component Purpose Rust Performance, safety, single-binary distribution Axum Lightweight async web framework wasm-bindgen Rust → WebAssembly compilation Indicatif Terminal progress bars Serde JSON serialization for checkpoints The WASM module is ~248KB after optimization.
Test Coverage
136+ tests across the workspace:
- Matrix operations (unit tests)
- Network training (integration tests)
- CLI commands (integration tests)
- Server endpoints (integration tests)
- WASM bindings (unit tests)
Zero clippy warnings. Reproducible results via seeded RNG.
References
Resource Link Backpropagation Learning representations by back-propagating errors (Rumelhart et al. 1986) Multi-Layer Perceptron Multilayer perceptron (Wikipedia) XOR Problem Perceptrons (Minsky & Papert 1969) Weight Initialization Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010) Inspired by codemoonsxyz/neural-net-rs The Vibe Coding Process
This project grew through iterative conversation with Claude Code:
- “Build a basic neural network in Rust with backpropagation”
- “Add a CLI with progress bars”
- “Add a web UI with real-time training visualization”
- “Compile to WASM so it runs in the browser”
- “Add checkpoint save/resume”
- “Include classic ML examples with educational documentation”
Each request built on the previous. The AI handled architecture decisions, chose appropriate crates, and maintained test coverage throughout.
When you want to understand how neural networks actually work, sometimes you need to see every weight update. That’s what this platform provides—education through transparency.
-
914 words • 5 min read • Abstract
Cat Finder: Personal Software via Vibe Coding

I needed to find cat photos scattered across my system. Instead of searching the app store, signing up for a cloud service, or uploading my personal photos to someone else’s servers, I asked Claude Code to build me the tool I needed. An hour later, I had it.
This is Personal Software—software that exists because you needed it, built the way you want it, running entirely under your control.
Resource Link Repo cat-finder Video Cat Finder Explainer 
The Vibe Coding Approach
Vibe Coding is about describing what you want and letting AI handle the implementation details. No boilerplate, no Stack Overflow rabbit holes, no fighting with build systems. You focus on the what, the AI handles the how.
For Cat Finder, the conversation went something like:
“I want a CLI tool that scans directories for images containing cats. Run locally, no cloud. Use YOLO for detection. Output just the file paths so I can pipe them to other commands.”
Claude Code chose the tech stack (Rust, YOLOv8n, ONNX Runtime), handled the tensor math, figured out the COCO class IDs, and produced a working tool. I guided the direction; the AI wrote the code.
Why Personal Software?
The traditional options for “find cat photos” would be:
- Cloud service: Upload photos to Google/Apple/Amazon, let them scan everything, hope they respect your privacy
- Desktop app: Find something in an app store, hope it does what you want, deal with subscription fees or ads
- Write it yourself: Spend days learning YOLO integration, tensor formats, image preprocessing
Personal Software offers a fourth path: describe what you need, get exactly that, own the result completely.
Cat Finder runs entirely on my machine. No accounts, no uploads, no subscriptions, no ads. The code is mine to modify, extend, or share.
What Got Built
A Rust CLI tool using YOLOv8n (the nano variant) through ONNX Runtime:
Directory Traversal → Image Preprocessing → YOLO Inference → Cat Detection → OutputThe Detection Pipeline
- Walk directories recursively, finding image files (jpg, png, gif, webp, etc.)
- Preprocess each image: resize to 640×640, normalize to 0.0-1.0, convert to NCHW tensor format
- Run inference through the YOLOv8n ONNX model
- Parse output for class ID 15 (cat in COCO ordering) above confidence threshold
- Print matching paths to stdout for easy piping to other tools
Unix Philosophy
# stdout: just paths (machine-parseable) # stderr: logging and progress cat-finder ~/Photos | xargs -I {} cp {} ~/CatPhotos/This separation enables composable Unix pipelines. The tool does one thing well and plays nicely with others.
Technology Stack
Component Purpose Rust Memory-safe, high-performance core YOLOv8n Lightweight object detection (12MB model) ONNX Runtime Cross-platform inference engine clap CLI argument parsing ndarray Tensor operations walkdir Recursive directory traversal Total footprint: ~80MB (runtime + model + binary)
I didn’t choose this stack—Claude Code did, based on the requirements. It made good choices.
Usage
# Basic usage cat-finder ~/Photos # Adjust confidence threshold cat-finder --confidence 0.5 ~/Photos # Verbose output with timestamps cat-finder -v -t ~/Photos # Copy all cat photos to a new folder cat-finder ~/Photos | xargs -I {} cp {} ~/CatAlbum/Honest About Limitations
The README documents failure cases transparently:
Image Type Detection Success Clear photographs High Artistic/stylized images Low Cats in clothing Low Small/partial cats Variable Low quality/blurry Variable Test results: 7 of 9 cat images detected (77.8% recall). Oil paintings and anthropomorphized cats confuse models trained on photographs. This is documented, not hidden.
Bonus Features
The project grew organically based on related needs:
Duplicate Finder: A second binary for finding duplicate images using size-based filtering followed by SHA-256 checksums.
find-duplicates ~/PhotosWeb Demo: A Flask-based interface for visual feedback with real-time progress via Server-Sent Events.
These emerged from “while you’re at it…” requests during development. Vibe coding makes feature additions nearly frictionless.
Setup
git clone https://github.com/sw-ml-study/cat-finder cd cat-finder ./scripts/setup.sh # Downloads model, builds project ./cat-finder ~/PhotosThe Personal Software Philosophy
Privacy-first: All processing happens locally. No cloud APIs, no external services, no data leaving your machine.
Ownership: The code is yours. Modify it, extend it, share it, delete it.
Fit-for-purpose: Built for exactly what you need, nothing more, nothing less.
Transparency: Known limitations documented. No marketing spin.
References
Resource Link YOLOv8 Ultralytics YOLOv8 - State-of-the-art object detection ONNX Runtime ONNX Runtime - Cross-platform inference engine ort crate ort - Rust bindings for ONNX Runtime COCO Dataset COCO Classes - Class ID 15 = cat
You don’t always need an app store or a cloud service. Sometimes you just need to describe what you want and let an AI build it for you. That’s vibe coding. That’s personal software.
-
503 words • 3 min read • Abstract
Five ML Concepts - #11

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #11 
References
Concept Reference RNN Learning representations by back-propagating errors (Rumelhart et al. 1986) Chain of Thought Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022) Softmax Deep Learning (Goodfellow et al. 2016), Chapter 6 MoE Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al. 2017) Distribution Shift Dataset Shift in Machine Learning (Quiñonero-Candela et al. 2009) Today’s Five
1. RNN (Recurrent Neural Network)
Networks designed for sequential data that maintain a hidden state carrying information across time steps. This makes them useful for language, time series, and audio.
LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are improved variants that better handle long-range dependencies.
Like reading a story while keeping mental notes about characters and plot as you go.
2. Chain of Thought
A prompting technique that encourages step-by-step reasoning in language models. Instead of producing an answer immediately, the model generates intermediate steps.
This can improve performance on math, logic, and multi-step problems.
Like showing your work on a math test instead of just writing the final answer.
3. Softmax
Converts a vector of scores into a probability distribution where each output falls between zero and one, and all outputs sum to one. It is commonly used in classification models.
Softmax makes raw scores easier to interpret as probabilities.
Like turning test scores into percentages that add up to 100%.
4. MoE (Mixture of Experts)
Instead of one large network, the model contains many smaller expert networks with a routing mechanism that selects which experts process each input. This allows models to scale capacity while keeping computation efficient.
Only a subset of experts activates for any given input.
Like a hospital with specialists where a receptionist directs you to the right doctor.
5. Distribution Shift
Occurs when deployment data differs from training data, causing a model trained on one environment to perform poorly in another. Common causes include seasonal changes, user behavior shifts, or new populations.
Monitoring for drift and retraining helps maintain performance.
Like a weather model trained on summer data struggling to predict winter storms.
Quick Reference
Concept One-liner RNN Sequential processing with memory across time Chain of Thought Step-by-step reasoning in prompts Softmax Scores to normalized probabilities MoE Route inputs to specialized experts Distribution Shift Training vs deployment data mismatch
Short, accurate ML explainers. Follow for more.
-
995 words • 5 min read • Abstract
RLM: Recursive Language Models for Massive Context

What happens when your data won’t fit in a context window? RLM expands the workspace instead of cramming everything into limited memory. This post covers the MIT paper, my Rust implementation, and six video demonstrations.
Resource Link Paper arXiv:2512.24601 Code rlm-project Playlist RLM Implementations The Problem: Context Limits
Large language models have a hard limit. They can only process so much text at once.
Imagine a cookie jar that holds 100 cookies. What if you need to search through ten thousand? When you force too much in, the model forgets things—this is called context rot.
Bigger models help, but the limit always exists. We need a different approach.
The RLM Solution
Recursive Language Models flip the problem. Instead of bigger jars, use better tools.
The data stays in a context box. The model gets tools to peek inside:
Tool Purpose sliceGet a character range findSearch for text regexPattern matching countCount occurrences llm_queryAsk a sub-LLM to analyze a chunk Small, focused, deliberate. The model thinks about what it needs, then asks for just that.
The Results
From the MIT paper—on tasks that don’t fit in context:
Approach Accuracy Standard prompting 0% RLM 87-91% Results hold across GPT-4, Claude, Llama, Mistral, and Gemini.
My Implementation: Four Capability Levels
I built a Rust implementation with four capability levels:
Level Name Description L1 DSL Built-in commands (find, regex, count) L2 WASM LLM generates Rust → compiles to WebAssembly sandbox L3 CLI LLM generates Rust → compiles to native binary L4 LLM Recursive delegation to sub-LLMs Each level trades off safety for capability:
- L1 is instant but limited to predefined operations
- L2 runs custom code but in a sandboxed environment
- L3 breaks free for large datasets that would timeout in WASM
- L4 uses LLM reasoning for semantic analysis
The Video Series
Six videos demonstrate RLM in action:
1. RLM Explained
The foundational video. Covers the MIT paper, the cookie jar analogy, and benchmark results showing 0% → 91% accuracy improvement.
Key insight: Expand the workspace, not the context.
2. War and Peace Demo
Can AI read all of War and Peace to find a hidden secret? The full text is 3.2 MB with 65,666 lines—way too big for any context window.
RLM finds “the password to Prince Andrei’s secret vault” in just 2 iterations using only 3,000 tokens. That’s 100% savings compared to sending the full document.
3. WASM Sandboxing
What if your LLM could write custom analysis code on the fly? Level 2 demonstrates WebAssembly sandboxing.
The LLM writes Rust code that compiles to WASM and runs in a secure sandbox. Demos include:
- Error ranking in logs
- Response time percentiles
- Unique IP counting
Trade-offs: ASCII only, 64MB memory limit, subset of Rust.
4. Native CLI Binaries
When 5,000 lines would timeout in WASM, Level 3 breaks free. Native Rust binaries process massive datasets with no limits.
Four CLI demos:
- Error ranking: Hash map counts error types
- Unique IPs: Hash set finds distinct addresses
- Percentiles: Sort and index for p50/p95/p99
- Word frequency: Tokenize, filter stop words, count
5. Detective Mystery Demo
A murder at the manor. Seven suspects. Dozens of clues. Can an LLM solve it?
Level 4 delegates reasoning to sub-LLMs. Instead of code execution, the model calls other models to:
- Analyze witness statements
- Compare alibis
- Draw conclusions
Watch as L4 examines each suspect and identifies the killer.
6. Large Context Processing
War and Peace is 3MB—far too large for any context window. This video shows Level 4 extracting noble family relationships from the entire novel.
The process:
- L3 extracts relationship sentences (father, mother, son, daughter…)
- L4 analyzes filtered data with sub-LLMs
- Final output: structured family trees
Three million characters → structured family trees in ~90 seconds.
Architecture
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ Client │────▶│ RLM Server │────▶│ Root LLM │ │ /visualize │ │ (Rust/Axum) │ │ (DeepSeek) │ └─────────────┘ └────────┬────────┘ └─────────────┘ │ ┌────────▼────────┐ │ Command Executor │ │ slice, find, │ │ regex, count, │ │ llm_query... │ └────────┬────────┘ │ ┌──────────────┼──────────────┐ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Ollama │ │ Ollama │ │ Ollama │ │ (local) │ │ (remote) │ │ (other) │ └──────────┘ └──────────┘ └──────────┘ Sub-LM Pool (for llm_query)Quick Start
cd rlm-orchestrator # Configure providers in config.toml export DEEPSEEK_API_KEY="your-key" # Run the server cargo run --bin rlm-server # Open visualizer open http://localhost:8080/visualizeThe Cookie Jar Analogy
Think of it like this:
- Old way: Dump everything on the table, then dig through the mess
- RLM way: Use a scoop—grab just the cookies you need
The key insight is simple: expand the workspace, not the context.
Resources
- RLM Paper (arXiv:2512.24601) - Zhang, Kraska, Khattab (MIT CSAIL)
- rlm-project Repository
- rlm-project Wiki
- RLM Implementations Playlist
- ELI5: What is RLM?
When context windows aren’t enough, RLM gives your LLM tools to explore. Six videos, four capability levels, one insight: expand the workspace, not the context.
-
499 words • 3 min read • Abstract
Five ML Concepts - #10

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #10 
References
Concept Reference CNN ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al. 2012) Encoder-Decoder Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014) RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020) Few-shot Learning Language Models are Few-Shot Learners (Brown et al. 2020) Distillation Distilling the Knowledge in a Neural Network (Hinton et al. 2015) Today’s Five
1. CNN (Convolutional Neural Network)
Networks designed for image data that use small filters sliding across an image to detect edges, textures, and shapes. Early layers find simple patterns, while deeper layers recognize complex objects.
CNNs are a foundation of modern computer vision.
Like scanning a photo with a magnifying glass that learns to recognize patterns at different scales.
2. Encoder-Decoder
A model architecture with two parts: the encoder compresses input into a representation, and the decoder generates an output from that representation. This pattern is common in translation, summarization, and speech systems.
The representation acts as a bottleneck that captures essential information.
Like summarizing a book into notes, then writing a new version from those notes.
3. RAG (Retrieval-Augmented Generation)
Instead of relying only on learned parameters, the model retrieves relevant documents and uses them during generation. This helps ground responses in external information and can reduce hallucinations.
RAG combines the strengths of retrieval systems and generative models.
Like an open-book exam where you can look up facts instead of relying purely on memory.
4. Few-shot Learning
Adapting behavior from just a few examples provided directly in the prompt. Instead of retraining, the model infers the pattern and applies it to new inputs.
Zero-shot learning relies only on instructions, without examples.
Like learning a card game by watching a few hands before playing.
5. Distillation
Transferring knowledge from a large teacher model to a smaller student. The student learns to match the teacher’s outputs, not its internal weights.
This produces models that are smaller and cheaper while retaining much of the original capability.
Like an apprentice learning by imitating a master’s finished work, not by copying their brain.
Quick Reference
Concept One-liner CNN Sliding filters for hierarchical image features Encoder-Decoder Compress input, then generate output RAG Retrieve context before generating Few-shot Learning Learn from examples in the prompt Distillation Small student mimics large teacher
Short, accurate ML explainers. Follow for more.
-
1633 words • 9 min read • Abstract
TBT (3/?): Vector Graphics Games

Before pixels, there were vectors. This Throwback Thursday explores the evolution of vector graphics gaming—from military radar displays to arcade classics—and my attempt to recreate them in Rust and WebAssembly.
Resource Link Live Demo Play in Browser Video TBT Vector Graphics Games 
Games vectorcade-games Shared vectorcade-shared Fonts vectorcade-fonts Renderer vectorcade-render-wgpu Web vectorcade-web-yew My First Vector Display: The IBM 2250
IBM 2250 at Brown University, 1969. Photo credit My first encounter with vector graphics was an IBM 2250 Graphics Display Unit—introduced in 1964, costing around $280,000 in period dollars. It connected to an IBM 1130 that acted as a graphics controller for an IBM S/370 mainframe where the graphical applications ran. At that price, nobody was playing games on it—Computer Aided Design was the killer app.
The 2250’s specifications were impressive for its era:
Specification Value Display 21-inch P39 phosphor CRT Resolution 1024 × 1024 addressable points Usable area 12” × 12” (square aspect) Refresh rate ~40 frames/second Input Light pen for direct interaction Vector drawing Hardware character generator optional The CRT drew lines by steering an electron beam directly—no pixel grid, no rasterization. Just pure geometry traced in phosphor glow. The green P39 phosphor had long persistence, reducing flicker but creating ghostly trails on moving objects.
The light pen was revolutionary: you could point directly at displayed geometry and the system knew which vector you were touching. Interactive graphics in 1964.
The Arcade Era
Vector displays found their way into arcades, where they defined a visual style that’s still recognizable today:
Game Year Innovation Lunar Lander 1979 Physics simulation, thrust/gravity Asteroids 1979 Wrap-around space, particle effects BattleZone 1980 Green wireframe 3D, first-person tanks Tempest 1981 Multi-colored vectors, pseudo-3D depth (Note: Pong (1972) was actually a raster game using discrete logic, but its simple geometry makes it a natural fit for vector recreation.)
Each generation built on the last. White vectors on black screens gave way to green wireframes, then full color. The hardware pushed boundaries that feel primitive now but were revolutionary then.
The Vectorcade Project
Vectorcade recreates these mechanics using modern tools:
- Rust for game logic and rendering
- WebAssembly for browser deployment
- wgpu for GPU-accelerated vector rendering
- Yew for the web frontend
Multi-Repo Architecture
The project architecture emerged from a design session with ChatGPT, exploring how to structure a multi-agent development workflow. The result: a DAG of repositories, each with clear ownership boundaries:
vectorcade-shared/ (Pure Rust API contracts) ↓ vectorcade-fonts/ (Vector font styles) ↓ vectorcade-games/ (Game logic: Pong, Asteroids, etc.) ↓ vectorcade-render-wgpu/ (wgpu + lyon tessellation) ↓ vectorcade-web-yew/ (Yew web shell)This DAG structure allows parallel development with assigned agent roles:
Agent Repo Focus A vectorcade-shared Core API steward: minimal, stable, pure B vectorcade-fonts Font stylist: 3-5 distinct vector styles C vectorcade-games Game logic: Pong → Asteroids → Lunar Lander D vectorcade-render-wgpu Renderer: lyon tessellation → wgpu triangles E vectorcade-web-yew Integrator: UI, mobile controls, PWA Each agent works against stable interfaces—the
DrawCmddisplay list andGametrait—so they don’t step on each other.The Display List Model
Games don’t render directly. They emit draw commands that the renderer interprets:
pub enum DrawCmd { Clear { color: Rgba }, Line(Line2), Polyline { pts: Vec<[f32;2]>, closed: bool, stroke: Stroke }, Text { pos: [f32;2], s: String, size_px: f32, color: Rgba }, PushTransform(Transform2), PopTransform, }This keeps game logic portable. The same Asteroids code can render through wgpu on desktop, WebGPU in browsers, or even a software rasterizer.
Vector Fonts
Classic arcade games had distinctive lettering. Vectorcade includes multiple font styles to match:
Style Look Games ATARIBoxy, utilitarian Asteroids, Lunar Lander CINEMATRONICSThin, angular Star Castle MIDWAYSlightly rounded Defender VECTOR_SCANLINEBroken segments “Beam jitter” effect Each font is pure vector geometry—no bitmaps, no texture atlases.
3D Projection
BattleZone and Tempest need 3D-to-2D projection. Instead of a full 3D renderer, Vectorcade uses a “2.5D pipeline”:
pub struct Camera3 { pub pos: [f32;3], pub yaw: f32, pub pitch: f32, pub fov_y_rad: f32, } pub fn project_polyline(cam: &Camera3, pts3: &[[f32;3]]) -> Vec<[f32;2]>;Games maintain 3D geometry; the core projects it to 2D lines. Depth-based brightness gives the classic “farther = dimmer” effect.
Why Rust + WASM?
The combination solves several problems:
- Performance: Games need consistent frame rates; Rust delivers
- Portability: Same code runs native and in browsers
- Safety: No dangling pointers in the game loop
- Modern tooling: Cargo, wasm-pack, Trunk make deployment straightforward
The wgpu + lyon stack provides cross-platform GPU rendering with proper thick-line support (WebGL’s
lineWidthis notoriously inconsistent).Current Status
Component Status vectorcade-shared Functional vectorcade-fonts Functional vectorcade-games Playable (5 demos) vectorcade-render-wgpu Functional vectorcade-web-yew Functional The core architecture works. All five demos are playable in the browser. Polish and audio remain.
The Demos
The video showcases five demonstrations, progressing from static display to full gameplay:
1. IBM 2250 Chessboard
A static image rendered in the style of the original IBM 2250. The 2250 was mainly used for Computer Aided Design, but programmers did create games on it—this chessboard pays tribute to that era.
2. Pong (Playable)
A vector implementation of the classic. The original Pong (1972) wasn’t actually a vector game—it used discrete logic and a raster display—but some clones used vector hardware. This recreation captures the pure-geometry aesthetic.
3. Asteroids (Playable)
One of the most popular vector arcade games. Rotate, thrust, and shoot to survive. The ship and asteroids wrap around screen edges, creating the classic “infinite space” feel.
4. BattleZone (Playable)
Green wireframe 3D tanks. Drive through a battlefield, shooting enemies and dodging missiles. One of the first games with true 3D perspective—rendered entirely with vectors.
5. Tempest (Playable)
The pinnacle of vector arcade hardware. Move around the edge of geometric tubes, shooting enemies that climb up from the depths. Each level changes the tube shape and color scheme.
Implementation
Each game implements the same
Gametrait:pub trait Game { fn metadata(&self) -> GameMeta; fn reset(&mut self, ctx: &mut GameCtx); fn update(&mut self, ctx: &mut GameCtx, dt: f32); fn render(&mut self, ctx: &mut GameCtx, out: &mut Vec<DrawCmd>); }This makes games drop-in replaceable in the web shell—no renderer changes needed.
TODO
The demos are playable but not finished. Remaining work:
GPU rendering: Switch from Canvas 2D emulation to actual wgpu GPU rendering[Ed. Completed 2/13]- Music and sound effects: Authentic arcade audio
- More aggressive opponents: AI improvements for challenge
- Additional levels/difficulties: Progression and replay value
- More animations: Explosions, transitions, effects
Resources
Before pixels, there were vectors. Vectorcade brings them back—in Rust, for the browser, with phosphor glow optional.
Credits
Role Credit Director Mike Wright Research & Architecture ChatGPT vectorcade-shared Claude Code CLI agent vectorcade-fonts Claude Code CLI agent vectorcade-games Claude Code CLI agent vectorcade-render-wgpu Claude Code CLI agent vectorcade-web-yew Claude Code CLI agent Explainer Video Claude Code Blog Post Claude Code Timeline: First pass vibe coded in one day (February 12, 2026)
- First commit: 11:08 AM PST
- Last commit: 5:08 PM PST
- Total commits: 52 across 4 repositories
- WGPU support added February 13, 2026
References
IBM 2250 Photo: “HES IBM 2250 Console grlloyd Oct1969” by Gregory Lloyd, October 1969. Brown University Hypertext Editing System (HES) demonstration. Licensed under CC BY-SA 4.0. Used with attribution.
-
781 words • 4 min read • Abstract
DyTopo: Dynamic Topology for Multi-Agent AI

When multiple AI agents work together, how should they communicate? Fixed patterns fail at scale. DyTopo rebuilds the communication graph each round based on what agents need and what they can offer.
Resource Link Video DyTopo 
Paper arXiv:2505.16128 Code dytopo-rs The Problem: Fixed Topologies Don’t Scale
Multi-agent systems need communication patterns. The obvious approaches have problems:
Topology Problem All-to-all Context explosion—every agent reads every message Chain Bottlenecks—one slow agent blocks everyone Star Single point of failure at the hub As agent count grows, fixed topologies either explode in messages or create chokepoints.
The DyTopo Solution: Dynamic Routing
DyTopo (Dynamic Topology) solves this by reconstructing the communication graph each round. The key insight: agents know what they need and what they can offer.
Each round, every agent emits:
- Query: What information do I need?
- Key: What can I contribute?
The router computes semantic similarity between all keys and queries, then builds a sparse directed graph:
score(sender → receiver) = cosine(sender.key, receiver.query)High-scoring pairs connect. Low-scoring pairs are ignored. The result: efficient, adaptive communication.
How It Works
Round N: 1. Manager broadcasts goal 2. Each agent produces: - Query (what I need) - Key (what I offer) - Draft (my current contribution) 3. Router embeds keys and queries 4. Similarity matrix → sparse graph (top-K per receiver) 5. Messages flow along edges 6. Trace written to JSONLThe topology adapts every round. An agent working on parsing might connect to the syntax expert in round 1, then the error-handling expert in round 2.
The Implementation: Rust, Zero Python
dytopo-rs is a fully Rust implementation with no Python dependencies:
Crate Purpose dytopo-coreShared types (AgentId, Topology, TraceEvent) dytopo-embedText embedding (hash-based baseline, semantic planned) dytopo-routerSparse graph construction dytopo-agentsAgent implementations dytopo-orchestratorMain execution loop dytopo-vizDOT export for visualization dytopo-cliCommand-line interface Why Rust?
- Zero-cost abstractions for performance-critical embedding/routing
- Strong type system catches protocol mismatches at compile time
- No Python dependency for baseline demos
- Fearless concurrency for future parallelization
Running the Demo
cargo run -p dytopo-cli -- demo --rounds 3 --agents 5 --topk 2This produces:
- Per-round topology printed to stdout
./traces/trace_*.jsonlfor machine-readable analysis- DOT files for graph visualization
Current Status
Milestone 0 is complete—the system runs end-to-end with stub agents and hash-based embeddings.
Feature Status Core types and traits Done Hash embedder (deterministic) Done Top-K sparse routing Done Stub agents with templates Done Orchestrator loop Done JSONL tracing Done DOT visualization Done Planned
- Semantic embeddings (fastembed/candle)
- LLM-backed agents (Ollama integration)
- Inbox summarization for long conversations
- Evaluation harness comparing topologies
Key Design Decisions
Why Hash Embeddings First?
The baseline uses deterministic hash-based embeddings:
- Reproducible demos for debugging
- No external dependencies to download
- Validates the full pipeline before adding ML complexity
Semantic embeddings are planned as drop-in replacements.
Why Sparse Graphs?
Each agent receives at most
topkmessages per round:- Prevents context explosion as agent count grows
- Makes communication interpretable—you can trace why agents connected
- Matches the paper’s approach
Why JSONL Traces?
Every event is logged to JSONL:
- Append-only for streaming
- Line-based for grep/filtering
- Machine-parseable for analysis tools
- Human-readable for debugging
Topology Comparison
The system supports multiple topology strategies for comparison:
Strategy Description Use Case dynamicDyTopo routing Adaptive, sparse fully_connectedAll-to-all Baseline comparison chainSequential Pipeline tasks starHub-and-spoke Centralized coordination What’s Next
- LLM Agent Support (Milestone 2)—Replace stubs with real reasoning
- Semantic Embeddings (Milestone 1)—Meaningful routing decisions
- Evaluation Harness (Milestone 4)—Quantify DyTopo advantages
Resources
- DyTopo Paper (arXiv:2505.16128) - Li et al., 2025
- dytopo-rs Repository
Dynamic topology lets agents find the right collaborators each round. No context explosion. No bottlenecks. Just efficient, adaptive communication.
-
1211 words • 7 min read • Abstract
Towards Continuous LLM Learning (1): Sleepy Coder - When Fine-Tuning Fails

What if your AI coding assistant could learn from its mistakes? Not just for one session, but across training cycles. We built exactly that—and fifty-one adapters later, learned the mistake was trying to teach it at all.
Resource Link Video Sleepy Coder 
Code sleepy-coder Share Paper arXiv:2602.06043 UWSH Paper arXiv:2512.05117 Part 2 Routing Prevents Forgetting The Dream: Day/Night Learning
AI coding agents have a memory problem. They fix a bug today, then make the same mistake next week. Every session starts from the same frozen model. Nothing carries forward.
The idea was elegant: build an agent that improves overnight.
DAY CYCLE (Inference) Agent attempts to fix Rust compiler errors Successes and failures are logged ↓ NIGHT CYCLE (Training) Fine-tune on failure patterns using LoRA Create specialized adapters ↓ EVAL Test against benchmark Measure improvement ↓ (repeat)During the day, the agent works and we log its failures—the error messages, the broken code, and the fixes that worked. Overnight, we fine-tune the model on those failures. Each morning, a new checkpoint should wake up a little better than before.
We based this on two papers from the Johns Hopkins team (Kaushik, Vaidya, Chaudhari, Chellappa, Yuille):
-
Share LoRA Subspaces (arXiv:2602.06043) — Learn a shared low-rank basis across tasks, then train only coefficients (76x fewer parameters per task)
-
UWSH (arXiv:2512.05117) — The Universal Weight Subspace Hypothesis suggests neural networks converge to shared spectral subspaces
The theory was sound. The implementation worked. The results were devastating.
The System
The Sleepy Coder agent runs in a Rust runtime, fixing compiler errors on 30 “koans” (small coding exercises) across 5 error families:
- Borrow Checker: Ownership and lifetime errors
- Type Bounds: Missing trait implementations
- Result Handling: Option/Result conversions
- Type Mismatches: Incompatible types
- Missing Items: Undefined functions or modules
The base model: Qwen2.5-Coder-1.5B-Instruct — small enough to train on a single GPU, capable enough to pass most koans without any fine-tuning.
The Journey: From Hope to Reality
Chapter 1: Naive LoRA
First attempt: standard fine-tuning on failure patterns.
Metric Before After Pass Rate 73.3% 60.0% Change — -13.3% Catastrophic forgetting. The model learned the new patterns but forgot how to do everything else.
Chapter 2: The Paper Chase
We found the Share paper promising “continual learning without forgetting.” The UWSH paper provided theoretical backing: neural networks naturally converge to shared low-rank subspaces.
Key insight from Share:
Train ONLY the coefficients. Keep the basis FROZEN.
This meant ~21,000 trainable parameters instead of ~1.6 million. A 76x reduction.
Chapter 3: The Proper Implementation
SVD: Singular Value Decomposition breaks a matrix into components that reveal its underlying structure. In Share, SVD finds the common “directions” that multiple LoRA adapters share—a compressed basis that captures what they have in common.
We rebuilt everything:
- Phase 1: Extract shared basis from 51 adapters via SVD
- Phase 2: Train only coefficient vectors (frozen basis)
- Phase 3: Merge and update basis periodically
We trained 51 pattern-specific adapters. We followed the algorithm precisely.
Chapter 4: The Stubborn Seven
No matter what we tried, 7 tasks kept failing:
Task The Problem bc_003 Mutable borrow while immutable exists bc_005 Double mutable borrow bc_010 Returning reference to local data tb_002 Missing Clone trait tb_007 Missing Hash trait tb_008 Missing Ord trait rh_004 Option to Result conversion These require deep understanding of Rust’s ownership system—something a 1.5B model can’t reliably learn.
Chapter 5: The Final Score
Approach Pass Rate vs Baseline Regressions Baseline (no training) 73.3% — 0 Naive LoRA 60.0% -13.3% Many Targeted LoRA (7 patterns) 63.3% -10% 4+ Replay buffer 70.0% -3.3% 2 Phase 2 coef-only (10K params) 66.7% -6.6% 2 Share Full (Ph2+Ph3) 73.3% 0% 0 The Share algorithm did exactly what it claimed: it prevented forgetting. But it couldn’t improve beyond baseline because there was nothing to improve.
What Went Wrong
1. The Model Already Knows
The base model already passes 73% of patterns. Training on these patterns doesn’t add knowledge—it dilutes what’s there.
2. Training Causes Forgetting
Even training only on the 7 failure patterns (44 examples) caused 4 new regressions. The model’s knowledge is interconnected.
3. Averaging Destroys Specialization
The Share paper assumes task routing at inference—selecting the right coefficients for each task. We averaged coefficients, which negated any specialization.
4. More Adapters Made It Worse
Adapter Count Pass Rate 6 adapters 73.3% 51 adapters 70.0% More adapters meant more subspace dilution when averaging. The signal got lost in the noise.
The Critical Insight
LoRA fine-tuning cannot improve a capable base model for tasks it already handles reasonably well.
The model’s knowledge is interconnected. Even 10,000 trainable parameters (0.0007% of the model) can break things. The baseline represents the ceiling, not the floor.
What We Learned
-
Read the room. If your base model passes 73%, maybe it doesn’t need fine-tuning. Maybe it needs better prompts.
-
Negative results are results. 51 failed experiments taught us more than a successful one would have.
-
Catastrophic forgetting is real. Small models especially can’t absorb new knowledge without losing old.
-
Share prevents forgetting, not ignorance. The algorithm does what it claims—it just can’t create knowledge from nothing.
-
Sometimes the answer is “don’t.” The best LoRA adapter for this task is no adapter.
-
Task routing vs averaging matters. The Share paper assumes you select coefficients based on task type, not blend them together.
-
AI coding agents cut corners. When implementing research papers, AI agents repeatedly stopped before completing all phases of the algorithm. I had to direct the agent to re-read the papers many times before it implemented them correctly.
Paths Forward
Since fine-tuning doesn’t work here, alternatives:
Approach Tradeoff Prompt engineering No weight changes, limited by context Multi-turn repair Uses base model reasoning, slower Larger model (7B+) More capacity to absorb knowledge Task routing with Share Select coefficients, don’t average Model ensemble Multiple models, pick best output Accept baseline 73% may be good enough The Numbers
Experiments run: 51 adapters, multiple algorithms Parameters trained: From 10K to 1.6M per adapter Best achieved: 73.3% (matches baseline) Target: ≥76.7% Conclusion: Target not achievable with LoRAResources
Sometimes the most valuable research shows what doesn’t work. Fifty-one adapters later, we know: let sleeping models lie.
-
-
470 words • 3 min read • Abstract
Five ML Concepts - #9

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #9 
References
Concept Reference Dropout Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014) RLHF Training language models to follow instructions with human feedback (Ouyang et al. 2022) Inference Deep Learning (Goodfellow et al. 2016), Chapter 5 Quantization A Survey of Quantization Methods for Efficient Neural Network Inference (Gholami et al. 2021) Flash Attention FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al. 2022) Today’s Five
1. Dropout
A regularization technique that randomly disables units during training. This encourages the network to rely on multiple pathways instead of memorizing patterns.
It helps reduce overfitting, especially in large models.
Like training a team where random members sit out each practice, so no one becomes a single point of failure.
2. RLHF (Reinforcement Learning from Human Feedback)
A training approach where humans rank or compare model outputs to produce a reward signal. The model is then optimized to better match human preferences.
This technique is central to aligning language models with human intent.
Like teaching by grading essays instead of dictating every word.
3. Inference
The process of running a trained model to make predictions on new data. Training updates the model’s parameters; inference uses them.
The distinction matters for optimization, deployment, and cost.
Like the difference between studying for an exam and actually taking it.
4. Quantization
Reducing the numerical precision used to store and compute model weights. This can shrink model size and speed up inference, sometimes with a small accuracy tradeoff.
Essential for deploying large models on limited hardware.
Like compressing a high-resolution photo into a smaller file that still looks good.
5. Flash Attention
An optimized attention algorithm designed to reduce memory usage. It avoids materializing the full attention matrix by computing attention in blocks.
This enables longer sequences and faster training.
Like reading a book chapter by chapter instead of photocopying the whole thing first.
Quick Reference
Concept One-liner Dropout Random disabling to prevent overfitting RLHF Learn from human preference comparisons Inference Using a trained model for predictions Quantization Lower precision for smaller, faster models Flash Attention Block-wise attention for memory efficiency
Short, accurate ML explainers. Follow for more.
-
477 words • 3 min read • Abstract
Five ML Concepts - #8

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #8 
References
Concept Reference Bias-Variance The Elements of Statistical Learning (Hastie et al. 2009), Chapter 7 Diffusion Denoising Diffusion Probabilistic Models (Ho et al. 2020) KV Cache Fast Transformer Decoding (Pope et al. 2022) Mixed Precision Mixed Precision Training (Micikevicius et al. 2017) MLA DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (DeepSeek-AI 2024) Today’s Five
1. Bias-Variance Tradeoff
A fundamental tension where simpler models tend to underfit (high bias), and more flexible models can overfit (high variance). The goal is finding a balance that generalizes well to unseen data.
One of the oldest ideas in machine learning, still relevant today.
Like choosing between a ruler that only draws straight lines and one so flexible it traces every bump.
2. Diffusion Models
A generative approach that trains a model to reverse a gradual noising process. During generation, the model starts from noise and removes it step by step.
The foundation of image generators like Stable Diffusion and DALL-E.
Like learning to restore a photo by practicing on progressively more damaged versions.
3. KV Cache
A technique that stores attention key and value tensors from earlier tokens so they don’t need to be recomputed during generation. This significantly speeds up autoregressive inference.
Essential for efficient LLM serving.
Like keeping notes from earlier in a conversation instead of rereading everything.
4. Mixed Precision
A training strategy that uses lower-precision math for most operations, while keeping some calculations in higher precision for stability. This reduces memory use and often speeds up training with little accuracy loss.
Standard practice for modern deep learning.
Like drafting in pencil and only using ink for the final signature.
5. MLA (Multi-head Latent Attention)
An attention variant that compresses key and value information into a lower-dimensional latent space. This reduces memory usage for long sequences while retaining useful context.
Used in DeepSeek-V2 and related architectures.
Like summarizing meeting notes instead of recording every word verbatim.
Quick Reference
Concept One-liner Bias-Variance Balance underfitting vs overfitting Diffusion Generate by learning to denoise KV Cache Store past keys/values for fast inference Mixed Precision Lower precision for speed, higher for stability MLA Compress attention into latent space
Short, accurate ML explainers. Follow for more.
-
1033 words • 6 min read • Abstract
Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation

We started by training models to act like they had memory. Then we found an open source implementation that does it for real. This is what we learned.
Resource Link Paper arXiv:2601.07372 Our Code engram-poc Reference weagan/Engram Video Engram Revisited 
Playlist All Engram Videos The Journey
Phase 1: Behavioral Emulation
Part 2 described our first approach: LoRA fine-tuning to make a model behave like it has memory. Train on patterns, and the model learns to respond consistently.
Metric Baseline LoRA-tuned Accuracy 8.6% 14.1% Improvement - +63% relative It worked, but the architecture was unchanged. We were approximating Engram benefits, not implementing them.
Phase 2: The Discovery
Then we found weagan/Engram on GitHub—real hash-based memory in ~300 lines of Python:
class EnhancedEngramModule(nn.Module): def __init__(self, table_size=50000, d_model=512): # Large learnable memory table self.memory_table = nn.Parameter(torch.zeros(table_size, d_model)) # Gate decides when to trust memory self.gate = nn.Sequential( nn.Linear(d_model * 2, d_model), nn.ReLU(), nn.Linear(d_model, 1), nn.Sigmoid() ) def forward(self, hidden_states, input_ids): # O(1) hash lookup indices = self.multi_head_hash(input_ids) retrieved = F.embedding(indices, self.memory_table) # Gated injection gate_score = self.gate(torch.cat([hidden_states, retrieved], dim=-1)) return hidden_states + gate_score * retrievedThe key insight: the gate decides when to trust the lookup. Not every token needs memory.
Phase 3: Integration with HuggingFace
We ported the module to work with HuggingFace models:
SmolLM-135M (frozen) ↓ EnhancedEngramModule (per layer) - 50K slot memory table - O(1) hash-based lookup - Learned gating ↓ OutputThe proof it works—O(1) lookup regardless of sequence length:
Sequence Length Lookup Time Expected if O(n) 64 tokens 0.15 ms - 2048 tokens 2.77 ms 4.8 ms Sub-linear scaling proves constant-time hash lookup.
The Reality Check
Here’s where it gets interesting. Real Engram memory excels at some tasks and hurts others.
Where Engram Helps
Task Type Baseline Engram Change Acronym expansion 25% 75% +200% Element symbols 33% 67% +103% Long-term fact recall 90% 100% +11% For exact-match lookups with structured keys, Engram dominates.
Where Engram Hurts
Task Type Baseline Engram Change World capitals 83% 67% -19% Pattern completion 14% 11% -21% For tasks where the base model already knows the answer, Engram’s hash collisions add noise.
The Key Insight
Engram is a specialized tool, not a general enhancement.
Use Engram For Don’t Use Engram For FAQ responses Creative generation Terminology lookup Novel combinations Entity facts Context-dependent answers Code boilerplate Reasoning tasks The gating mechanism is critical: it must learn to suppress memory when it doesn’t help. Without proper gating, hash collisions inject noise into every token.
Obstacles Encountered
1. Hash Collisions
Different inputs can map to the same memory slot. The gate must learn to ignore irrelevant retrievals.
2. Parameter Explosion
50K slots × 768 dimensions × 30 layers = 1.2B additional parameters. We had to inject selectively (every 4th layer) to stay practical.
3. Training Dynamics
Memory tables start at zero. They need higher learning rates (10x) to develop meaningful representations before the model learns to use them.
4. Evaluation Mismatch
Our pattern completion task wasn’t ideal for hash-based memory. Engram shines on exact-match retrieval, not generalization.
Combined Approach
The best results came from combining both methods:
Base Model (SmolLM-135M) ↓ EnhancedEngramModule - Long-term fact storage - O(1) lookup for known patterns ↓ LoRA Adapters - Pattern completion - Domain-specific behaviors ↓ OutputThis gives you:
- Long-term memory from hash tables
- Pattern consistency from behavioral training
- Flexibility to disable either component
What We Learned
-
Emulation vs Implementation: LoRA fine-tuning approximates memory behavior; hash tables implement it. Both have their place.
-
Gating is Essential: The learned gate prevents hash collisions from degrading performance. Never use Engram without gating.
-
Match Task to Tool: Hash-based memory excels at exact lookups, not pattern generalization. Use it where applicable.
-
Selective Application: Don’t inject Engram everywhere. Target layers and use cases where it helps.
-
The Gate as a Safety Valve: When the gate learns to output near-zero for a task, that’s the model telling you Engram doesn’t help there. Listen to it.
Resources
- Engram Paper (arXiv:2601.07372)
- engram-poc Repository - Our implementation
- weagan/Engram - Reference implementation
- Engram Revisited Video
- Engram Video Playlist
- Part 1: mHC
- Part 2: Engram Introduction
Series Recap
Part Topic Key Insight 1 mHC Doubly-stochastic constraints bound signal amplification 2 Engram Intro O(1) lookup beats recomputing through attention 3 Engram Revisited Use Engram where applicable; gate to avoid worse results
Hash-based memory is powerful but specialized. The gate decides when to use it—and when not to.
-
469 words • 3 min read • Abstract
Five ML Concepts - #7

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #7 
References
Concept Reference Cross-Validation A Study of Cross-Validation and Bootstrap (Kohavi 1995) GPT Language Models are Unsupervised Multitask Learners (Radford et al. 2019) GQA GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al. 2023) Context Window Attention Is All You Need (Vaswani et al. 2017) Self-Attention Attention Is All You Need (Vaswani et al. 2017) Today’s Five
1. Cross-Validation
A technique that splits data into multiple folds to evaluate model performance on data it wasn’t trained on. By rotating which data is held out, it gives a more reliable estimate of generalization.
Essential for honest model evaluation.
Like practicing with different sets of flashcards to see if you actually learned the material.
2. GPT
Generative Pre-trained Transformer. A family of autoregressive language models trained to predict the next token in a sequence.
Many AI assistants and chatbots are built on this approach.
Like autocomplete, but scaled up and trained on vast text data.
3. GQA (Grouped Query Attention)
An attention variant where multiple query heads share key and value projections. This reduces memory usage and can speed up inference compared to standard multi-head attention.
Widely adopted in efficient transformer architectures.
Like several students sharing one set of notes instead of copying everything separately.
4. Context Window
The maximum number of tokens a model can process in a single forward pass. Larger context windows allow longer inputs, but increase memory and compute costs.
A key constraint in language model design.
Like the size of a desk that limits how many papers you can spread out at once.
5. Self-Attention
A mechanism where each token computes attention scores with other tokens in the same sequence. This lets the model weigh which parts of the input are most relevant to each position.
The core operation inside transformers.
Like everyone in a meeting deciding who to listen to based on the conversation.
Quick Reference
Concept One-liner Cross-Validation Rotate held-out data for reliable evaluation GPT Predict next token, at scale GQA Shared keys/values for efficient attention Context Window How much the model sees at once Self-Attention Each token attends to all others
Short, accurate ML explainers. Follow for more.
-
491 words • 3 min read • Abstract
Five ML Concepts - #6

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #6 
References
Concept Reference Regularization Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014) BERT BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018) RoPE RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al. 2021) Prompting Language Models are Few-Shot Learners (Brown et al. 2020) Positional Encoding Attention Is All You Need (Vaswani et al. 2017) Today’s Five
1. Regularization
Techniques that reduce overfitting by adding constraints or penalties during training. Common examples include L2 weight decay, L1 sparsity, dropout, and early stopping.
The goal is better generalization, not just fitting the training set.
Like adding friction so a model can’t take the easiest overfit path.
2. BERT
Bidirectional Encoder Representations from Transformers. A transformer encoder trained with masked language modeling: predicting hidden tokens using context from both sides.
It was a major step forward for many NLP tasks after its 2018 release.
Like filling in blanks by reading the whole sentence, not just the words before it.
3. RoPE (Rotary Positional Embeddings)
A way to represent token position inside attention by rotating query and key vectors as a function of position. This gives attention information about relative order and distance.
It’s widely used in modern transformer models.
Like turning a dial differently for each position so the model can tell where tokens are.
4. Prompting
Crafting inputs to steer a model toward the output you want. Small changes in instructions, examples, or format can change behavior significantly.
A key skill for working effectively with language models.
Like asking a question in just the right way to get a useful answer.
5. Positional Encoding
Transformers need a way to represent token order, because attention alone doesn’t include sequence position. Different methods do this, including learned embeddings and rotary approaches like RoPE.
Without it, “the cat sat on the mat” would be indistinguishable from “mat the on sat cat the.”
Like numbering the pages of a shuffled book so you can read them in order.
Quick Reference
Concept One-liner Regularization Add constraints to prevent overfitting BERT Bidirectional masked language modeling RoPE Position info via rotation in attention Prompting Craft inputs to steer model outputs Positional Encoding Tell the model where tokens are in sequence
Short, accurate ML explainers. Follow for more.
-
493 words • 3 min read • Abstract
Five ML Concepts - #5

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #5 
References
Concept Reference Perceptron The Perceptron: A Probabilistic Model (Rosenblatt 1958) Pre-training BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018) Speculative Decoding Fast Inference from Transformers via Speculative Decoding (Leviathan et al. 2022) ICL Language Models are Few-Shot Learners (Brown et al. 2020) Latent Space Auto-Encoding Variational Bayes (Kingma & Welling 2013) Today’s Five
1. Perceptron
The simplest neural network: a single linear unit with weights and a bias. It computes a weighted sum and applies a threshold or activation.
It inspired modern neural networks, even though today’s models are far more complex.
Like a single voter weighing inputs before deciding yes or no.
2. Pre-training
Training a model on a large, general dataset before adapting it to a specific task. This gives the model broad patterns that later training can refine.
BERT, GPT, and most modern LLMs use this approach.
Like going to medical school before choosing a specialty.
3. Speculative Decoding
A technique where a small, fast model proposes tokens, and a larger model verifies or rejects them in parallel. This can speed up inference without changing final outputs.
A key optimization for production LLM deployments.
Like a junior writer drafting text for a senior editor to approve in batches.
4. In-Context Learning (ICL)
When a model adapts its behavior using examples in the prompt, without updating its weights. It allows flexible task behavior at inference time.
This emergent capability surprised researchers when GPT-3 demonstrated it.
Like solving a new puzzle after seeing a few worked examples.
5. Latent Space
The internal representations a model learns as it processes data. In this space, similar inputs tend to be located near each other.
It’s not a literal place, but a useful way to think about how models organize information.
Like a map where cities are arranged by similarity instead of geography.
Quick Reference
Concept One-liner Perceptron Single linear unit—the neural network ancestor Pre-training Learn general patterns before specializing Speculative Decoding Draft fast, verify in parallel ICL Adapt from prompt examples without training Latent Space Internal representations where similar things cluster
Related Posts
- In-Context Learning Revisited: From Mystery to Engineering — A deeper exploration of how ICL evolved from emergent surprise to engineered capability.
Short, accurate ML explainers. Follow for more.
-
453 words • 3 min read • Abstract
Five ML Concepts - #4

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #4 
References
Concept Reference Activation Functions Deep Learning (Goodfellow et al. 2016), Chapter 6 Transfer Learning A Survey on Transfer Learning (Pan & Yang 2010) VLM Learning Transferable Visual Models (CLIP) (Radford et al. 2021) Adam Adam: A Method for Stochastic Optimization (Kingma & Ba 2014) Superposition Toy Models of Superposition (Elhage et al. 2022) Today’s Five
1. Activation Functions
Functions like ReLU, sigmoid, and tanh that transform neuron outputs. They introduce nonlinearity, allowing networks to learn complex patterns beyond simple linear relationships.
Without them, stacking layers would just be matrix multiplication.
Like an on-off switch that can also dim the lights.
2. Transfer Learning
Using knowledge a model learned on one task to improve performance on a related task. This often reduces training time and data requirements dramatically.
Pre-trained models can be fine-tuned for specific applications.
Like a chef who already knows French cooking learning Japanese cuisine faster.
3. VLM (Vision-Language Models)
Models trained to work with both images and text. They learn shared representations that connect visual and language understanding.
CLIP, GPT-4V, and LLaVA are examples of this approach.
Like someone who can look at a photo and describe what’s happening.
4. Adam
An optimizer that adapts learning rates for each parameter using information from past gradients. It combines ideas from momentum and adaptive learning-rate methods.
One of the most popular optimizers in deep learning.
Like a hiker who adjusts step size for each part of the trail, steep or flat.
5. Superposition
A way neural networks represent many concepts using overlapping directions in the same space. This allows models to pack more information into fewer neurons than expected.
It’s why interpretability is hard—features aren’t neatly separated.
Like discovering a painting has hidden layers that appear under the right light.
Quick Reference
Concept One-liner Activation Functions Introduce nonlinearity to enable complex patterns Transfer Learning Reuse knowledge from one task for another VLM Joint understanding of images and text Adam Adaptive per-parameter learning rates Superposition Many concepts packed into overlapping representations
Short, accurate ML explainers. Follow for more.
-
524 words • 3 min read • Abstract
Five ML Concepts - #3

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #3 
References
Concept Reference Loss Function A Survey of Loss Functions for Deep Neural Networks (Janocha & Czarnecki 2017) Overfitting Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014) Fine-tuning A Survey on Transfer Learning (Zhuang et al. 2020) LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021) Tokenization Neural Machine Translation of Rare Words with Subword Units (Sennrich et al. 2015) Today’s Five
1. Loss Function
A formula that measures how far off the model’s predictions are from the correct answers. It quantifies the gap between what the model predicted and what it should have predicted.
Training a neural network means minimizing this function.
Like a scorecard that tells the model how badly it messed up.
2. Overfitting
When a model learns the training data too well, including noise and outliers, and fails on new data. The model performs great on examples it has seen but poorly on anything new.
One of the most common pitfalls in machine learning.
Like memorizing the answers to a test instead of understanding the subject.
3. Fine-tuning
Taking a pre-trained model and training it further on a specific task or dataset. Instead of training from scratch, you start from a model that already understands language or images, then specialize it.
This makes powerful models accessible without massive compute budgets.
Like teaching a chef who already knows cooking to specialize in sushi.
4. LoRA (Low-Rank Adaptation)
An efficient fine-tuning method that trains a small number of added parameters instead of the full model. It inserts small trainable matrices into each layer while keeping the original weights frozen.
This dramatically reduces the memory and compute needed for fine-tuning.
Like adding sticky notes to a textbook instead of rewriting the whole thing.
5. Tokenization
The process of breaking text into smaller units called tokens that a model can process. Most modern models use subword tokenization, splitting words into common pieces rather than individual characters or whole words.
It determines what the model actually “sees” and affects everything from vocabulary size to multilingual performance.
Like chopping sentences into bite-sized pieces a model can digest.
Quick Reference
Concept One-liner Loss Function How far off the model’s predictions are Overfitting Memorizing the test instead of learning the subject Fine-tuning Specializing a pre-trained model for a new task LoRA Efficient fine-tuning with small added matrices Tokenization Breaking text into the pieces a model actually reads
Short, accurate ML explainers. Follow for more.
-
1779 words • 9 min read • Abstract
TBT (2/?): Pipelines on OS/390

Unix invented pipes. Mainframes reinvented them—for records, not bytes.
This is the second Throwback Thursday post—revisiting technologies that shaped how I think about programming. This time: CMS/TSO Pipelines, and a vibe coding project that brings them back to life in Rust for education, fun, and nostalgic reasons.
Resource Link Code pipelines-rs Demo Live Demo Video Pipelines on OS/390 #TBT The 1996 Olympics and a Pair of Mainframes
In 1996, IBM hosted the Olympics Web Server—one of the largest public web properties at the time. Many distributed IBM systems in different regions served dynamic web pages. The logs from all of them were funneled to a pair of IBM S/390 mainframes I was in charge of, running OS/390 (formerly MVS).
When you’re processing millions of log records for statistics and forensics, you need tools that think in records, not lines. That’s where Pipelines for TSO/E came in.
Pipelines for TSO/E was the MVS/ESA port of CMS Pipelines, which ran on VM/ESA. Both let you chain stages together to filter, transform, and aggregate record-oriented data—record-oriented pipelines that evolved in parallel with Unix’s byte-stream pipes.
Two Traditions of Piping
Unix pipes came first—Thompson and McIlroy at Bell Labs, 1969–1974. Byte streams, file descriptors, the
|operator. Brutally simple. Explosively powerful. POSIX.1-1988 standardizedpipe(2)and shell pipelines, though POSIX work began in the mid-1980s.CMS Pipelines emerged on IBM mainframes in the mid-to-late 1980s. They weren’t a Unix clone—they were convergent evolution under different pressures. Where Unix piped bytes between small programs, CMS piped records through declarative stages. Pipelines for TSO/E followed in the late 1980s and early 1990s, porting CMS concepts to the MVS multi-user environment. Unlike CMS Pipelines (which ships with z/VM), the TSO/E port is typically installed separately on z/OS.
Neither tradition was “behind.” They were optimizing different dimensions:
Unix Pipes CMS/TSO Pipelines Era 1969–1974 Mid-to-late 1980s Data unit Byte stream Records (fixed or variable length) Stage input stdin (bytes) Record buffer Field access awk,cut(text parsing)Column positions (direct) Execution Typically a process per stage Stages in one address space Topology Linear by default; fan-out/fan-in via tee, FIFOs, or process substitutionMulti-stream, fan-out/fan-in built in Philosophy Small tools, ad hoc composition Declarative data transformation Many datasets on mainframes are record-structured. Records can be fixed-length or variable-length. CMS and TSO/E Pipelines treat records as byte arrays—character-oriented stages assume EBCDIC text, while position/length stages are binary-safe. A fixed-length 80-byte record isn’t arbitrary text—columns 1-8 are the name, 9-18 are the department, 19-26 are the salary. You don’t parse. You just read the right columns.
Unix won culturally—cheap hardware, academic distribution, C portability. But IBM’s record-oriented pipelines were better at structured dataflow, and they anticipate or parallel patterns seen in ETL frameworks like Spark and Beam.
CMS Pipelines ships with z/VM and is still used; Pipelines for TSO/E exists for z/OS but isn’t universally installed. These are not historical curiosities—mainframes continue to process a significant share of high-value transactions, and pipelines remain an available tool for data transformation on those systems.
What a Pipeline Looks Like
CMS Pipelines uses a DSL with
PIPEas the command,|to chain stages, and?as a command terminator (it suppresses the console from being used as implicit input):PIPE CONSOLE | FILTER 18,10 = "SALES" | SELECT 0,8,0; 8,10,8 | CONSOLE ?This reads input records, keeps only those where columns 18–27 equal “SALES”, extracts the name fields, and writes the result. No regex. No string splitting. Just column positions.
Note: pipelines-rs uses 0-based offsets (e.g.,
SELECT 0,8,0). Historical CMS Pipelines uses 1-based column positions.Compare with the Unix equivalent:
cat input.txt | awk '$3 == "SALES" {print $1, $2}'The Unix version looks simpler—until your fields contain spaces, or your records contain non-text bytes, or you need to chain 15 stages without spawning 15 processes.
Bringing It Back in Rust (Vibe Coding)
pipelines-rs is a nostalgia-driven vibe coding project—my attempt to emulate Pipelines for TSO/E in Rust, not because it’s practical, but because these ideas deserve to be celebrated. It supports a subset of stages and features two execution models:
The Two Executors
Batched processes all records through one stage before moving to the next:
All records → Stage 1 → All records → Stage 2 → All records → Stage 3This emulates the correct output and is faster, but doesn’t demonstrate record-oriented dataflow well.
Record-At-a-Time (RAT) sends each record through the entire pipeline before reading the next:
Record 1 → Stage 1 → Stage 2 → Stage 3 → Output Record 2 → Stage 1 → Stage 2 → Stage 3 → Output Record 3 → Stage 1 → Stage 2 → Stage 3 → OutputRAT is the implementation shown in the video. It’s a naive approach—more buffers, more copying—but it shows the dataflow concepts clearly and enables the visual debugger. Both run in linear time (records × stages) and produce identical output for all 23 test specifications.
A future version will aim for fewer buffers and fewer copy operations. Whether it’s faster than Batched remains to be seen.
The 80-Byte Record
The Rust implementation supports fixed-length records only. The fundamental data type is the
Record—exactly 80 bytes, matching historical punch card width. Variable-length input lines are accepted and padded to 80 bytes:pub const RECORD_WIDTH: usize = 80; pub struct Record { data: [u8; RECORD_WIDTH], }Fields are accessed by column position and length. No parsing, no delimiters. The data is always right where you expect it.
Supported Stages
The current implementation supports 14 stages:
Stage Purpose Example FILTER Keep/reject records by field value FILTER 18,10 = "SALES"LOCATE Keep records containing a pattern LOCATE "ERROR"NLOCATE Keep records NOT containing a pattern NLOCATE "DEBUG"SELECT Extract and reposition fields SELECT 0,8,0; 8,10,8CHANGE Text replacement CHANGE "SALES" "MKTG"COUNT Count records COUNTTAKE Keep first N records TAKE 5SKIP Skip first N records SKIP 2DUPLICATE Repeat each record N times DUPLICATE 3LITERAL Append a literal record LITERAL "--- END ---"UPPER/LOWER Case conversion UPPERREVERSE Reverse record text REVERSEHOLE Discard all input HOLECONSOLE Driver stage: source or sink depending on position CONSOLEThe CLI
Both executors have identical CLIs:
# Batch executor pipe-run specs/filter-sales.pipe specs/input-fixed-80.data -v # Record-at-a-time executor pipe-run-rat specs/filter-sales.pipe specs/input-fixed-80.data -vGiven this input data:
SMITH JOHN SALES 00050000 JONES MARY ENGINEER 00075000 DOE JANE SALES 00060000 WILSON ROBERT MARKETING 00055000 CHEN LISA ENGINEER 00080000 GARCIA CARLOS SALES 00045000 TAYLOR SUSAN MARKETING 00065000 BROWN MICHAEL ENGINEER 00090000And this pipeline:
PIPE CONSOLE | FILTER 18,10 = "SALES" | CONSOLE ?The output is:
SMITH JOHN SALES 00050000 DOE JANE SALES 00060000 GARCIA CARLOS SALES 00045000 Records: 8 in -> 3 outExactly what I’d have gotten on OS/390 in 1996, but with Web Server log data showing client IP address, OS, browser type/version, user cookies, timestamps, URLs, and more, instead of accounting data. 😊
The Web UI for Two pipelines-rs Implementations
The web interface runs entirely in the browser via WebAssembly. It has three panels: input records with an 80-column ruler, the pipeline editor, and the output.
Tutorial Mode
The tutorial walks through each stage with examples, running pipelines automatically to show results. You can step through manually or let it auto-advance.
The Visual Debugger
The debugger is the reason RAT exists. It lets you:
- Step through execution one pipe point at a time
- Watch data at specific pipe points between stages
- Set breakpoints to pause at specific stages
- See stage state for stateful stages like COUNT
You load a pipeline, click Run, then Step to watch each record flow through each stage. The debugger highlights which stages have been reached with a green border. For COUNT and other aggregation stages, you can watch the flush phase where accumulated state becomes output.
What’s Next
The current RAT executor is intentionally naive—it uses a buffer at every pipe point and copies each record between them. A better implementation would minimize buffers and copy operations while preserving the record-at-a-time semantics.
Multi-pipe features are also planned—CMS Pipelines supported fan-out (one input, multiple output streams) and fan-in (multiple inputs merged), which enabled complex processing topologies beyond simple linear chains.
How pipelines-rs Differs from IBM Pipelines
IBM CMS/TSO/E Pipelines pipelines-rs Indexing 1-based column positions 0-based offsets Record format Fixed or variable length, EBCDIC Fixed 80-byte ASCII only (variable-length input padded) Stages Hundreds of built-in stages 14 implemented so far Topology Multi-stream: fan-out, fan-in, multi-pipe Linear only (multi-pipe planned) Environment z/VM, z/OS mainframes CLI (native) and browser (WASM) Character set EBCDIC ASCII/UTF-8 This is a teaching tool and nostalgia project, not a production replacement.
Implementation Details
Metric Value Language Rust (2024 edition) Web UI Yew framework, compiled to WASM Stages 14 implemented Test Specs 23 pipeline specifications Tests 60+ (including batch/RAT equivalence) License MIT Live Demo sw-comp-history.github.io/pipelines-rs Resources
Credits
Role Who Concept & direction Mike Wright Content creation Claude (Anthropic) Editorial review ChatGPT (OpenAI)
Mainframe ideas, modern tools. Follow for more.
-
985 words • 5 min read • Abstract
Small Models (6/6): Which Small AI Fits YOUR Laptop?

Maximum AI capability on minimum hardware. The 2-3B efficient frontier.
This is Part 6 (the finale) of the Small Models, Big Brains series. We’re benchmarking the best small models to help you choose the right one for your laptop.
Resource Link Code efficient-llm Phi-2 microsoft/phi-2 Gemma ai.google.dev/gemma SmolLM HuggingFace Blog Video Which Small AI Fits YOUR Laptop? 
The Efficient Frontier
In economics, the “efficient frontier” is the set of optimal portfolios offering the highest return for a given level of risk.
In AI, it’s the models offering the best capability for a given size.
The Contenders
Model Params Source Key Strength Phi-2 2.7B Microsoft Reasoning, synthetic data Gemma-2B 2B Google Distillation, multilingual SmolLM2-1.7B 1.7B HuggingFace 11T tokens, fast inference SmolLM3-3B 3B HuggingFace Dual reasoning, 6 languages Benchmark Results
Actual measurements on Apple Silicon (M-series) from efficient-llm:
Model MMLU GSM8K HumanEval Speed (CPU) Memory Phi-2 61.7% 57.0% 50.0% 7.1 tok/s 5.2GB Gemma-2B 38.9% 18.0% 90.0% 8.5 tok/s 4.7GB SmolLM2 55.6% * * 3.7 tok/s 3.2GB *SmolLM2 GSM8K/HumanEval scores reflect prompt format incompatibility, not capability.
The Key Insight: Data Quality Beats Parameters
Phi-2 achieves 61.7% MMLU with only 2.7B parameters.
For comparison:
- Llama-2-7B: ~46% MMLU
- Llama-2-13B: ~55% MMLU
Phi-2 beats models 5x its size. The secret? Synthetic textbook training.
Microsoft generated high-quality educational content specifically designed to teach reasoning. Quality data > quantity data > model size.
Model Profiles
Phi-2: The Reasoning Champion
Strengths: Math, logic, code understanding Weakness: Less conversational Best for: Technical tasks, chain-of-thoughtPhi-2 was trained on “textbook quality” synthetic data. It thinks like a textbook explains.
Gemma-2B: The Distillation Expert
Strengths: Multilingual, edge deployment Weakness: Lower benchmark scores Best for: Production apps, Google ecosystemGoogle distilled knowledge from larger models into this compact package. Great tooling and documentation.
SmolLM2-1.7B: The Speed Demon
Strengths: Fastest inference, smallest footprint Weakness: Prompt format sensitivity Best for: Memory-constrained environmentsHuggingFace trained on 11T tokens—massive overtraining like TinyLlama but at a slightly larger scale.
SmolLM3-3B: The Balanced Choice
Strengths: Dual reasoning modes, 6 languages Weakness: Newest, less battle-tested Best for: General-purpose small model needsThe latest from HuggingFace, designed to be the go-to small model.
Decision Framework
├── Need best reasoning? → Phi-2 ├── Need instruction following? → SmolLM2 or SmolLM3 ├── Need multilingual? → Gemma-2B or SmolLM3 ├── Memory constrained (<4GB)? → SmolLM2 + INT4 ├── Need Google ecosystem? → Gemma-2B ├── General purpose? → SmolLM3 └── Maximum quality per byte? → Phi-2Running the Benchmarks
git clone https://github.com/softwarewrighter/efficient-llm cd efficient-llm # Setup uv venv && source .venv/bin/activate uv pip install torch transformers accelerate bitsandbytes datasets tqdm # HuggingFace login (required for Gemma) huggingface-cli login # Download and benchmark python download_models.py python benchmark_quality.py python benchmark_speed.py python benchmark_memory.py # Interactive demos python demo_reasoning.py python demo_code.py python demo_chat.pyHardware Requirements
Setup Models You Can Run 4GB RAM SmolLM2 (INT4) 8GB RAM All models (INT4) 16GB RAM All models (FP16) Apple Silicon All models (MPS) Implementation Details
Metric Value Primary Language Python Source Files 7 .pyfilesEstimated Size ~1.4 KLOC Framework Transformers, PyTorch Build System uv / pip Key Features MMLU/GSM8K/HumanEval benchmarks, demos Good for you if: You want to benchmark 2-3B models, compare quality vs speed tradeoffs, or run interactive comparisons between Phi-2, Gemma, and SmolLM.
Complexity: Low. Similar structure to billion-llm. Standalone Python scripts for each benchmark and demo. Requires HuggingFace authentication for Gemma access.
Series Recap
Over six parts, we’ve explored the cutting edge of small model research:
Part Model/Topic Key Insight 1 TRM (<1K params) Iteration beats scale 2 MobileLLM (350M) Offline AI is practical 3 HRM (27M) Hierarchy enables reasoning 4 BDH Sparsity enables interpretability 5 1B models The efficiency sweet spot 6 2-3B models Data quality beats parameters Key Takeaways
-
Data quality beats parameter count. Phi-2 proves careful curation outperforms brute scaling.
-
The 2-3B range is remarkably capable. These models handle real tasks, not just demos.
-
Each model has its niche. Match the model to your use case.
-
Quantization makes everything accessible. INT4 lets you run 3B models on 4GB RAM.
-
The frontier keeps moving. SmolLM3 is weeks old. Better models are coming.
What We’ve Learned
Small models aren’t a compromise—they’re a different optimization target. When you can’t throw compute at a problem, you’re forced to be clever:
- Recursive reasoning (TRM)
- Mobile-optimized architectures (MobileLLM)
- Hierarchical decomposition (HRM)
- Sparse interpretable activations (BDH)
- Overtraining on quality data (TinyLlama, Phi-2)
These techniques will eventually feed back into large models too. Small model research isn’t a dead end—it’s the frontier.
Resources
Part 6 of 6 in the Small Models, Big Brains series. Thanks for following along!
Have questions? Find me on YouTube @SoftwareWrighter or Discord.
-
446 words • 3 min read • Abstract
Five ML Concepts - #2

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #2 
References
Concept Reference Gradient Descent An overview of gradient descent optimization algorithms (Ruder 2016) Attention Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014) DPO Direct Preference Optimization (Rafailov et al. 2023) Learning Rate Cyclical Learning Rates (Smith 2015) Temperature On the Properties of Neural Machine Translation (Cho et al. 2014) Today’s Five
1. Gradient Descent
A general optimization method used across machine learning. It improves a model by taking small steps in the direction that reduces error the most.
Many learning algorithms rely on it, especially neural networks.
Like walking downhill in fog, adjusting each step based on the slope beneath your feet.
2. Attention
A mechanism that lets models weigh different parts of the input by importance. Instead of treating everything equally, attention highlights what matters most.
This was key to breakthroughs in translation and language models.
Like reading a sentence and focusing more on the important words.
3. DPO (Direct Preference Optimization)
A method for aligning language models with human preferences. Unlike RLHF, it trains directly on preference comparisons and avoids an explicit reward model.
This simplifies training while achieving comparable alignment.
Like learning preferences by observing choices, not by designing a scoring system.
4. Learning Rate
Controls how large each update step is during training. Too large and learning becomes unstable. Too small and training is slow or gets stuck.
One of the most important hyperparameters to tune.
Like choosing how fast to walk downhill without losing balance.
5. Temperature
A parameter that controls randomness during text generation. Low temperature favors predictable, high-probability outputs. Higher temperature increases variety and surprise.
A tradeoff between consistency and creativity.
Like adjusting a dial from cautious to adventurous.
Quick Reference
Concept One-liner Gradient Descent Walk downhill to minimize error Attention Focus on what matters in the input DPO Align models from preference pairs directly Learning Rate Step size that balances speed and stability Temperature Dial between predictable and creative
Short, accurate ML explainers. Follow for more.
-
839 words • 5 min read • Abstract
Small Models (5/6): Max AI Per Watt

One billion parameters. The sweet spot for AI.
Big enough to reason. Small enough to run anywhere. Maximum capability per watt.
This is Part 5 of the Small Models, Big Brains series, comparing four models at the 1B parameter point.
Resource Link Code billion-llm TinyLlama jzhang38/TinyLlama Llama 3.2 ai.meta.com/llama Pythia EleutherAI/pythia Video Max AI Per Watt 
Why One Billion?
Range Reality Below 1B Models struggle with complex reasoning Above 1B Hardware requirements increase significantly At 1B Maximum capability per watt 1B parameters is where you get:
- Real language understanding
- Ability to follow instructions
- Fine-tuning in minutes on a laptop
- Deployment anywhere (phone, Raspberry Pi, browser)
The Contenders
Model Params Key Strength Training Data TinyLlama 1.1B Overtrained on 3T tokens Community Llama-3.2-1B 1B Official Meta ecosystem Meta StableLM-1.6B 1.6B Multilingual, 2T tokens Stability AI Pythia-1B 1.08B 154 research checkpoints EleutherAI TinyLlama: The Overtraining Champion
TinyLlama breaks the rules. The Chinchilla scaling laws suggest training tokens should scale with parameters. TinyLlama uses 100x more data than optimal.
Chinchilla-optimal for 1B: ~30B tokens TinyLlama actual: 3T tokens (3,000B)The result? A tiny model that punches well above its weight.
Benchmarks
From the billion-llm repository:
Model MMLU HumanEval Speed Memory TinyLlama 25.3% 12.2% Fast 2.2GB Llama-3.2-1B 32.1% 18.5% Fast 2.4GB StableLM-1.6B 30.8% 15.1% Medium 3.2GB Pythia-1B 26.4% 10.3% Fast 2.2GB Llama-3.2-1B leads on quality. TinyLlama offers the best value when you factor in the open training recipe.
LoRA Fine-Tuning in Minutes
All these models can be fine-tuned on a laptop using LoRA:
cd billion-llm python finetune_demo.py --model tinyllama --epochs 3LoRA adds small trainable adapters without modifying base weights:
Base Model (frozen): 1.1B parameters LoRA Adapters: ~4M parameters (0.4%) Training time: 5-10 minutes on M1 MacSpeculative Decoding: 2-3x Speedup
Use a fast 1B model to draft tokens, verify with a slower 7B model:
Draft (1B): "The quick brown fox" → [jumps, over, the, lazy] Verify (7B): Accept [jumps, over, the] → Reject [lazy] → Generate [sleepy]The 1B model generates candidates quickly. The 7B model only needs to verify, not generate from scratch.
python speculative_demo.pyResults: 2-3x speedup on autoregressive generation.
Hardware Requirements
Setup What You Can Run CPU only All models (slower, INT4 quantized) 4GB VRAM All models (INT4 quantized) 8GB VRAM All models (FP16) Apple Silicon All models (MPS acceleration) Quick Start
git clone https://github.com/softwarewrighter/billion-llm cd billion-llm # Setup uv venv && source .venv/bin/activate uv pip install -r requirements.txt # Download models python download_models.py # Run benchmarks python benchmark.py # Interactive comparison python demo_chat.py --compare tinyllama llama3.2-1bWhich Model Should You Choose?
├── Need Meta ecosystem compatibility? → Llama-3.2-1B ├── Need multilingual support? → StableLM-1.6B ├── Need research reproducibility? → Pythia-1B (154 checkpoints) ├── Need maximum performance/size? → TinyLlama └── Just getting started? → Any of them work!Implementation Details
Metric Value Primary Language Python Source Files 8 .pyfilesEstimated Size ~1.4 KLOC Framework Transformers, PyTorch Build System uv / pip Key Features Benchmarking, LoRA fine-tuning, speculative decoding Good for you if: You want to benchmark small LLMs, learn LoRA fine-tuning, experiment with speculative decoding, or compare models head-to-head.
Complexity: Low. Clean Python scripts with HuggingFace Transformers. Each script is standalone—run benchmarks, chat demos, or fine-tuning independently. Well-documented with shell scripts for common tasks.
Key Takeaways
-
1B is the efficiency sweet spot. Below this, capability drops. Above, hardware costs rise.
-
Overtraining works. TinyLlama proves you can compensate for size with data.
-
LoRA makes fine-tuning accessible. Customize models on consumer hardware.
-
Speculative decoding is free speed. Use small models to accelerate large ones.
-
All roads lead to open weights. Every model here is fully open.
What’s Next
Part 6 explores the 2-3B efficient frontier—Phi-2, Gemma, and SmolLM pushing the limits of small model capability.
Resources
- billion-llm Repository
- TinyLlama
- Llama 3.2
- Pythia
- LoRA Paper
- Speculative Decoding Paper
- Video: Max AI Per Watt
-
411 words • 3 min read • Abstract
Five ML Concepts - #1

5 machine learning concepts. Under 30 seconds each.
Resource Link Papers Links in References section Video Five ML Concepts #1 
References
Concept Reference Backprop Learning representations by back-propagating errors (Rumelhart, Hinton, Williams 1986) Transformer Attention Is All You Need (Vaswani et al. 2017) Mamba Mamba: Linear-Time Sequence Modeling (Gu & Dao 2023) Hallucination Survey of Hallucination in NLG (Ji et al. 2023) Embedding Word2Vec (Mikolov et al. 2013) Today’s Five
1. Backpropagation
Back propagation of errors. It’s how neural networks learn—flowing error backward through the network to adjust each weight.
Without it, modern deep learning wouldn’t be practical.
Think of it like retracing your steps to see which earlier choices caused the mistake.
2. Transformer
The architecture behind GPT, Claude, and most modern language models. Instead of processing words one at a time, transformers use attention to weigh relationships between all tokens.
This enables parallel training and rich context awareness.
Like understanding a sentence by seeing how every word relates to every other.
3. Mamba (State Space Models)
A newer alternative to transformers that processes sequences in linear time instead of quadratic.
This allows scaling to very long documents with much lower memory use.
Like a smart conveyor belt that carries forward only what matters.
4. Hallucination
When a model generates confident-sounding nonsense. It happens because language models predict plausible next words, not true facts.
They optimize for likelihood, not correctness.
Like a student who writes confidently without verifying sources.
5. Embedding
Turning words, images, or concepts into vectors of numbers. Similar meanings end up close together in this space.
This lets math capture semantic relationships.
Think of it as a coordinate system for meaning.
Quick Reference
Concept One-liner Backprop Learn by flowing error backward Transformer Attention over all tokens at once Mamba Linear-time sequence modeling Hallucination Confident nonsense from likelihood optimization Embedding Meaning as coordinates in vector space
Short, accurate ML explainers. Follow for more.
-
842 words • 5 min read • Abstract
Small Models (4/6): This AI Has a Visible Brain

LLMs are black boxes. Baby Dragon Hatchling (BDH) is different—a brain-inspired language model with sparse, interpretable activations.
Train it on Shakespeare and actually see what’s happening inside.
This is Part 4 of the Small Models, Big Brains series, exploring interpretability through sparsity.
Resource Link Paper Pathway (Sparse Coding) Original Code pathwaycom/bdh Fork (with tools) softwarewrighter/bdh Video This AI Has a Visible Brain 
The Black Box Problem
Modern neural networks are opaque:
- Billions of parameters
- Dense activations everywhere
- No clear mapping from neurons to concepts
- “It works, but we don’t know why”
This isn’t just an academic concern. We’re deploying AI systems we don’t understand.
Baby Dragon Hatchling: A Different Approach
BDH takes inspiration from biological brains, which use sparse coding:
Biological Brains Dense Neural Networks ~1-5% neurons active ~100% neurons active Energy efficient Computationally expensive Interpretable patterns Distributed, opaque Robust to noise Brittle Sparse Activations
BDH enforces 80% sparsity—only 20% of neurons are active for any given token.
Dense Network: [████████████████████] 100% active BDH: [████░░░░░░░░░░░░░░░░] 20% activeThis constraint forces the network to learn meaningful, localized representations.
Training on Shakespeare
The demo trains BDH on Shakespeare’s works:
Training Progress: Epoch 1: Loss 0.86 Epoch 50: Loss 0.54 Epoch 100: Loss 0.38 Epoch 200: Loss 0.22Loss drops from 0.86 to 0.22—the architecture works.
Seeing Inside the Model
With sparse activations, you can actually inspect what neurons mean:
# Which neurons fire for "love"? activations = model.forward("love") active_neurons = activations.nonzero() # Neuron 47: fires for emotional words # Neuron 112: fires for abstract nouns # Neuron 203: fires for relationship termsWhen only 20% of neurons fire, each one carries interpretable meaning.
Running the Code
The bdh repository is a fork of Pathway’s original with added inspection tools:
git clone https://github.com/softwarewrighter/bdh cd bdh pip install -r requirements.txt # Train on Shakespeare python train.py --dataset shakespeare --sparsity 0.8 # Inspect activations python inspect.py --model checkpoint.pt --text "To be or not to be"GPU recommended (Nvidia or Apple Silicon) for reasonable training times.
Why Sparsity Enables Interpretability
Dense Networks
Every neuron participates in every computation. The “meaning” of any single neuron is distributed across all inputs it ever sees.
Input: "cat" → All neurons contribute → Output Input: "dog" → All neurons contribute → Output Input: "love" → All neurons contribute → OutputTrying to understand one neuron means understanding everything.
Sparse Networks
Only a small subset of neurons fire for each input. Neurons develop specialization.
Input: "cat" → Neurons [12, 47, 89] fire → Output Input: "dog" → Neurons [12, 52, 89] fire → Output Input: "love" → Neurons [47, 112, 203] fire → OutputNeuron 12 might mean “animal.” Neuron 47 might mean “emotional/living.” You can actually trace meaning.
Comparison with Other Sparse Architectures
Model Sparsity Type Purpose Mixture of Experts Routing sparsity Efficiency Top-k attention Attention sparsity Memory BDH Activation sparsity Interpretability BDH’s sparsity is specifically designed for understanding, not just efficiency.
Implementation Details
Metric Value Primary Language Python Source Files 9 .pyfilesEstimated Size ~1.5 KLOC Framework PyTorch Build System pip / requirements.txt GPU Support CUDA, MPS (Apple Silicon) Good for you if: You want to experiment with sparse neural architectures, study interpretability techniques, or train small language models with visible internals.
Complexity: Low-Moderate. Standard PyTorch project structure. The sparse activation mechanism is well-documented. Fork includes additional inspection tools not in the original.
Key Takeaways
-
Sparsity enables interpretability. When fewer neurons fire, each one means more.
-
Brain-inspired design works. Biological neural coding principles transfer to AI.
-
Interpretability doesn’t require sacrifice. BDH learns effectively despite constraints.
-
We can build AI we understand. Black boxes aren’t inevitable.
Current Limitations
- Early research stage
- Smaller scale than production models
- Training requires more epochs
- Not yet competitive with dense models on benchmarks
But the principle is sound: constraint breeds clarity.
What’s Next
Part 5 dives into the 1B parameter sweet spot—comparing TinyLlama, Llama 3.2, StableLM, and Pythia.
Resources
- Pathway Paper
- Original Pathway Code
- bdh Repository (with inspection tools)
- Video: This AI Has a Visible Brain
-
1473 words • 8 min read • Abstract
Solving Sparse Rewards with Many Eyes

Single explorer: 0% success. Five explorers: 60% success.
Learning often fails not because models are slow, but because they see too little. In sparse-reward environments, a single explorer is likely to miss the rare feedback entirely. The solution? Put many eyes on the problem.
Resource Link Related Intrinsic Rewards and Diversity Papers IRPO · Reagent Code many-eyes-learning ELI5 eli5.md Video Given enough eyeballs… 
The Problem: Sparse Rewards Create Blindness
As IRPO formalizes: in sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal.
A 7x7 grid with a single goal demonstrates this perfectly:
- Random agent success rate: ~9%
- With limited training (75 episodes), a single learner exploring alone never finds the goal
This isn’t a compute problem. It’s an information problem.
Challenge Effect Paper Connection Rare rewards Weak gradient signal IRPO’s core problem statement Single explorer Limited coverage Why multiple scouts help Random exploration Misses valuable states Why intrinsic rewards matter No feedback structure Can’t distinguish “almost right” from “nonsense” Reagent’s motivation The Solution: Many Eyes
Instead of one explorer, use multiple scouts—independent exploratory agents that gather diverse information.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Scout 1 │ │ Scout 2 │ │ Scout N │ │ (strategy A)│ │ (strategy B)│ │ (strategy N)│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ v v v ┌─────────────────────────────────────────────────┐ │ Experience Buffer │ └─────────────────────────────────────────────────┘ │ v ┌─────────────────────────────────────────────────┐ │ Shared Learner │ └─────────────────────────────────────────────────┘Each scout explores with its own strategy. Their discoveries are aggregated to improve a shared learner.
Results
On a 7x7 sparse grid with 75 training episodes:
Method Success Rate Random baseline 9% Single scout 0% Many eyes (3 scouts) 40% Many eyes (5 scouts) 60% Same total environment steps. Dramatically better outcomes.
Why It Works
Single Scout Fails Because:
In IRPO terms: sparse reward → sparse gradient signal → no learning.
- Random exploration rarely reaches the goal (~9%)
- Insufficient successful trajectories
- DQN can’t learn from sparse positive examples
- The policy gradient has near-zero magnitude
Many Eyes Succeeds Because:
IRPO’s key insight: multiple exploratory policies manufacture signal.
- More coverage: Different scouts explore different regions (intrinsic rewards drive novelty-seeking)
- More discoveries: Higher probability of reaching goal (scouts find extrinsic reward)
- Signal routing: Scout discoveries update the shared learner (surrogate gradient in IRPO, experience pooling in many-eyes)
- Better gradients: Aggregated experience provides meaningful learning signal
Scout Strategies (Intrinsic Rewards)
IRPO uses intrinsic rewards to drive exploration. The many-eyes-learning project implements several strategies:
Strategy Intrinsic Motivation IRPO Connection Epsilon-greedy Random action with probability ε Simple exploration noise Curious Bonus for novel states: 1/√(count+1)Count-based intrinsic reward Optimistic High initial Q-values Optimism under uncertainty Random Pure random baseline Maximum entropy exploration # CuriousScout intrinsic reward (simplified) def intrinsic_reward(self, state): count = self.state_counts[state] return self.bonus_scale / sqrt(count + 1)Scouts can be homogeneous (same strategy, different seeds) or heterogeneous (different strategies). IRPO supports swapping intrinsic reward functions—many-eyes makes this concrete with pluggable scout types.
Running the Demo
git clone https://github.com/softwarewrighter/many-eyes-learning cd many-eyes-learning # Setup uv venv .venv source .venv/bin/activate uv pip install -e ".[dev]" # Interactive CLI demo python experiments/cli_demo.py # Full experiment python experiments/run_experiment.py --episodes 75 --scouts 1 3 5 # Generate plots python experiments/plot_results.pyResults appear in ~5-10 minutes on a laptop.
Diversity Experiment
Does diversity of strategies matter, or just number of scouts?
Configuration Success Rate 5 random scouts 20% 5 epsilon-greedy scouts 40% 5 diverse scouts (mixed strategies) 40% Finding: In simple environments, strategy quality matters more than diversity. Epsilon-greedy beats random regardless of diversity.
Key Insight
The problem isn’t that learning is slow. The problem is that learning is blind.
Many eyes make learning better.
Implementation Details
Metric Value Primary Language Python Source Files ~12 .pyfilesEstimated Size ~1.5 KLOC Framework PyTorch, NumPy Platform CPU (no GPU required) Good for you if: You want to understand exploration in RL, experiment with sparse-reward environments, or see a clean implementation of scout-based learning.
Complexity: Low-Moderate. Clean codebase with CLI demos. Runs on a laptop in minutes.
Design Philosophy
The project prioritizes clarity over performance:
- Single-file implementations where practical
- Minimal dependencies
- Sequential mode is first-class (parallel optional)
- Reproducible experiments with fixed seeds
Simplifications from IRPO
Full IRPO computes Jacobians to route gradients from exploratory policies back to the base policy. Many-eyes-learning simplifies this:
IRPO Many-Eyes-Learning Jacobian chain rule Experience pooling Surrogate gradient Standard DQN updates Learned intrinsic rewards Hand-designed strategies The core insight remains: scouts explore with intrinsic motivation, discoveries benefit the shared learner. The math is simpler, the demo runs on a laptop, and the concept is clear.
Key Takeaways
-
Sparse rewards create information bottlenecks. Learning fails not from lack of compute, but lack of signal.
-
More eyes = more information. Multiple scouts increase coverage and discovery rate.
-
Diversity helps, but quality matters more. In simple environments, good exploration strategy beats diversity.
-
Same compute, better outcomes. Many-eyes improves sample efficiency, not wall-clock speed.
The Papers Behind Many-Eyes
This project builds on two recent papers that attack the same fundamental problem: sparse rewards starve learning of signal.
IRPO: Intrinsic Reward Policy Optimization
IRPO (Cho & Tran, UIUC) formalizes the scouts concept mathematically.
The core insight: In sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal. Learning stalls.
IRPO’s solution:
┌─────────────────────────────────────────────────┐ │ 1. Train exploratory policies (scouts) │ │ using INTRINSIC rewards │ ├─────────────────────────────────────────────────┤ │ 2. Scouts discover EXTRINSIC rewards │ │ through exploration │ ├─────────────────────────────────────────────────┤ │ 3. Route extrinsic signal back to base policy │ │ via surrogate gradient (Jacobian chain) │ └─────────────────────────────────────────────────┘IRPO Concept What It Means Intrinsic rewards “Explore what’s new” - reward novelty Exploratory policies Scouts driven by intrinsic motivation Surrogate gradient Trade bias for signal - approximate gradient that actually has magnitude Base policy The learner that benefits from scout discoveries How many-eyes-learning demonstrates this:
- Scouts implement intrinsic motivation (CuriousScout uses count-based novelty bonuses)
- Multiple exploration strategies create diverse coverage
- Aggregated experience routes discoveries to the shared DQN learner
- Simplified gradient routing - we pool experiences rather than compute full Jacobians
Reagent: Reasoning Reward Models for Agents
Reagent (Fan et al., CUHK/Meituan) takes a different approach: make feedback richer and more structured.
The problem with sparse rewards: They can’t distinguish “almost right, failed at the end” from “complete nonsense.” Both get the same zero reward.
Reagent’s solution: Build a Reasoning Reward Model that emits:
Signal Purpose <think>Explicit reasoning trace <critique>Targeted natural-language feedback <score>Overall scalar reward This provides dense-ish supervision without hand-labeling every step.
How many-eyes-learning relates:
- Both papers recognize sparse rewards as an information problem
- Reagent enriches the reward signal; IRPO multiplies the exploration
- Many-eyes takes the IRPO path: more explorers finding the sparse signal
- Future work could combine both: scouts + richer feedback per trajectory
The Shared Meta-Lesson
Both papers are saying the same thing:
Sparse signals are a tragedy. Let’s smuggle in richer ones.
- IRPO: via intrinsic-reward exploration gradients
- Reagent: via language-based reward feedback
Many-eyes-learning demonstrates the IRPO intuition in a simple, visual, reproducible way.
Resources
- IRPO Paper (arXiv:2601.21391)
- Reagent Paper (arXiv:2601.22154)
- many-eyes-learning Repository
- Results Documentation
- Architecture Documentation
Sparse rewards are an information problem. Many eyes provide the solution.
-
661 words • 4 min read • Abstract
MCP: Teaching Claude to Play (and Trash Talk)

Claude learned to play tic-tac-toe. And trash talk. Using one protocol that works with any language model.
Resource Link Code game-mcp-poc MCP Spec modelcontextprotocol.io Video Claude Plays Tic-Tac-Toe 
The Problem
Language models are stuck in text. They can’t click buttons, make moves, or interact with real systems. Every integration is custom—different for Claude, GPT, Gemini.
The Solution: MCP
Model Context Protocol is a standard way for models to use tools. Define your tools once, they work with Claude, GPT, or any MCP-compatible agent.
The protocol is simple:
- JSON-RPC 2.0 over stdio
- No HTTP needed
- Clean request/response cycle
The Demo: Trash Talkin’ Tic Tac Toe
This proof-of-concept implements 6 MCP tools:
Tool Purpose view_game_stateSee the board, players, status get_turnWhose turn is it? make_movePlay a square (row, col) taunt_playerSend trash talk to opponent restart_gameStart a new game get_game_historyAll moves with timestamps The AI calls tools, the server responds. Claude can play a full game AND talk trash—all through the same protocol.
Architecture
┌─────────────────────────────────────────────┐ │ Claude Code (AI) │ │ (MCP Client) │ └──────────────────┬──────────────────────────┘ │ JSON-RPC 2.0 via stdio ▼ ┌─────────────────────────────────────────────┐ │ MCP Server (Rust Binary) │ │ ┌───────────────────────────────────────┐ │ │ │ 6 Tools: view, turn, move, taunt, │ │ │ │ restart, history │ │ │ └───────────────────────────────────────┘ │ │ ▼ │ │ ┌───────────────────────────────────────┐ │ │ │ SQLite (game.db) │ │ │ │ • Games • Moves • Taunts │ │ │ └───────────────────────────────────────┘ │ └─────────────────────────────────────────────┘ ▲ ▲ │ REST API │ MCP │ │ Browser (UI) AI Agent (Yew/WASM) (Claude Code)Running It
git clone https://github.com/sw-game-dev/game-mcp-poc cd game-mcp-poc # Development mode (with hot-reload) ./scripts/dev.sh # Or production build ./scripts/build.sh ./scripts/serve.shThe server runs on
http://localhost:7397serving:- REST API for UI interactions
- MCP endpoint for AI agents
- SSE for real-time updates
- Yew/WASM frontend
Configuring Claude Code
Add to
~/.config/claude-code/mcp.json:{ "mcpServers": { "tic-tac-toe": { "command": "/path/to/game-mcp-poc/target/release/game-mcp-server", "args": [], "env": { "GAME_DB_PATH": "/path/to/game.db" } } } }Restart Claude Code, then:
You: "Let's play tic-tac-toe! Show me the board." You: "I'll take the center." You: "Your turn!" You: "Can you taunt me?"Implementation Details
Metric Value Language Rust 2024 Edition Frontend Yew + WebAssembly Database SQLite Tests 175+ passing LOC ~2,500 (backend) + ~1,500 (tests) Binary Size ~8 MB Good for you if: You want to learn MCP, build AI-tool integrations, or see a production-quality Rust game server.
Complexity: Moderate. Clean architecture with TDD. Requires Rust toolchain and understanding of JSON-RPC.
Key Takeaways
-
MCP standardizes AI tools. Define once, works with any compatible model.
-
JSON-RPC over stdio is elegant. No HTTP complexity for local tools.
-
Rust + WASM = fast everywhere. Same language for server and (via Yew) frontend.
-
Trash talk is essential. Games without taunting are just… exercises.
Resources
MCP turns language models into tool users. This demo proves it works—and that AI can talk trash.
-
789 words • 4 min read • Abstract
Small Models (3/6): Planner + Doer = Genius

27 million parameters beats o3-mini on ARC.
The hardest reasoning benchmark. Most LLMs score under 5 percent. This tiny model scores 40 percent.
This is Part 3 of the Small Models, Big Brains series, exploring the Hierarchical Reasoning Model (HRM)—a brain-inspired architecture that separates planning from execution.
Resource Link Paper Hierarchical Reasoning Model Original Code sapientinc/HRM Visualization viz-hrm-ft Video Planner + Doer = Genius 
The ARC Challenge
The Abstraction and Reasoning Corpus (ARC) tests:
- Abstract reasoning
- Pattern matching
- Spatial logic
- Puzzles requiring real thinking
These aren’t problems you can memorize. Each puzzle is unique, requiring genuine understanding of the underlying pattern.
Why LLMs Struggle
Challenge LLM Limitation Novel patterns Can’t rely on training data Spatial reasoning Text-based thinking is linearized Multi-step logic Each step compounds errors Abstraction Pattern matching isn’t generalization Meet HRM: The Hierarchical Reasoning Model
HRM uses just 27 million parameters but achieves remarkable results by mimicking how the brain thinks: plan first, then act.
Two-Module Architecture
┌─────────────────────────────────────┐ │ PLANNER │ │ Thinks slow and abstract │ │ Sets goals and strategies │ └─────────────┬───────────────────────┘ │ Goals ▼ ┌─────────────────────────────────────┐ │ DOER │ │ Moves fast │ │ Takes concrete actions │ └─────────────────────────────────────┘Module Speed Function Planner Slow Abstract thinking, goal setting Doer Fast Concrete actions, execution This mirrors the brain’s dual-process theory: System 1 (fast, intuitive) and System 2 (slow, deliberate).
Results
Benchmark HRM (27M) o3-mini GPT-4 ARC 40% <40% <5% Hard Mazes 99% - ~0% Complex Sudoku 99% - - A 27M parameter model outperforming models 1000x larger on reasoning tasks.
The Visualization Tool
The viz-hrm-ft repository provides a React app to visualize HRM’s reasoning process:
- Watch the Planner form strategies
- See the Doer execute actions
- Visualize the feedback loop between modules
- Simulate fine-tuning on BabyAI tasks
git clone https://github.com/softwarewrighter/viz-hrm-ft cd viz-hrm-ft npm install npm startWhy Hierarchy Works
Traditional Flat Models
Input → [Single Network] → OutputEverything happens in one pass. Complex problems overwhelm the network.
Hierarchical Models
Input → [Planner] → Strategy ↓ Strategy → [Doer] → Action ↓ Action → [Environment] → Feedback ↓ Feedback → [Planner] → Refined Strategy ↓ ...The Planner doesn’t worry about details. The Doer doesn’t worry about strategy. Each module focuses on what it does best.
Key Insights
-
Separation of concerns scales. Splitting planning from execution lets each module specialize.
-
Iteration enables refinement. The Planner-Doer loop allows course correction.
-
Small can beat big. 27M parameters with good architecture beats 100B+ with brute force.
-
Brain-inspired design works. Mimicking cognitive architecture yields better results.
Comparison with Part 1 (TRM)
Aspect TRM HRM Parameters <1,000 27M Architecture Think-Act cycles Planner-Doer hierarchy Strength Maze solving Abstract reasoning Key insight Iteration Hierarchical decomposition Both use recursive reasoning, but HRM adds hierarchical structure for more complex tasks.
Implementation Details
Metric Value Primary Language TypeScript Source Files 26 .ts/.tsx, 7.jsEstimated Size ~4 KLOC Framework React Build System npm / Create React App Visualization Canvas-based rendering Good for you if: You want to visualize neural reasoning processes, build interactive ML demos, or learn React with a real project.
Complexity: Low-Moderate. Standard React/TypeScript project. No ML training code—this is a visualization tool for understanding the HRM architecture. Easy to extend with new visualizations.
Key Takeaways
-
Plan, then act. Separating strategy from execution mirrors effective human thinking.
-
Hierarchy enables complexity. Multi-level reasoning handles problems flat networks can’t.
-
Architecture > Scale for reasoning tasks.
-
ARC remains unsolved by brute-force scaling—clever architectures are the path forward.
What’s Next
Part 4 explores Baby Dragon Hatchling (BDH)—a brain-inspired model with visible, interpretable activations.
Resources
-
705 words • 4 min read • Abstract
Deepseek Papers (2/3): Engram - Conditional Memory for Transformers

Deepseek publishes papers. I implement them. This paper tackles another fundamental transformer problem: redundant computation.
This post covers my implementation of Engram (Conditional Memory via Scalable Lookup)—running on both Apple Silicon and NVIDIA GPUs.
Resource Link Paper arXiv:2601.07372 Code engram-poc Video 1 Engram Part 1 
Video 2 Engram Part 2 
The Problem: Redundant Computation
LLMs waste compute reconstructing patterns they’ve seen before:
- Style rules repeated across files
- Common code idioms re-derived each call
- Boilerplate knowledge injected repeatedly
Attention computes everything from scratch every time. For recurring patterns, this is wasteful.
The Engram Solution: O(1) Lookup
Engram introduces conditional memory as a complementary sparsity axis. Instead of recomputing common patterns through attention, look them up in O(1) time.
Think of it as a cache for the model’s learned patterns:
Without Engram With Engram Recompute pattern every call Look up cached result O(n²) attention O(1) deterministic lookup Implicit knowledge Explicit, inspectable memory The PoC Approach
The full Engram paper describes in-model memory. The engram-poc repo approximates the benefits through behavioral fine-tuning:
- Pattern Injection: Training data encodes lookup-like patterns
- LoRA Adapters: Learn to recognize and consistently respond
- Evaluation: Compare baseline vs tuned model
Pattern Categories
The PoC includes 131 patterns across 4 categories:
Category Examples Code Idioms for i in range(→len(items)):Factual Recall HTTP status for 'Not Found'?→404Format Transforms snake_case: getUserName→get_user_nameError Fixes Fix: if x = 5:→if x == 5:Results
Training on SmolLM-135M-Instruct:
Metric Value Training Examples 337 Training Time ~10 seconds (M-series Mac) Loss Reduction 58.2% (4.34 → 1.82) Behavioral change:
Prompt: Complete: for i in range( Baseline: "Here is a Python function that implements this approach..." Engram-tuned: "len(items)):"The tuned model produces direct, pattern-completing responses instead of verbose explanations.
Running the Engram Demo
git clone https://github.com/softwarewrighter/engram-poc cd engram-poc # Apple Silicon uv venv && source .venv/bin/activate uv pip install -r requirements.txt ./scripts/run_all.sh # NVIDIA GPU (separate directory) cd unsloth-nvidia uv venv && source .venv/bin/activate uv pip install torch --index-url https://download.pytorch.org/whl/cu124 uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" ./scripts/run_all.shImplementation Details
Metric Value Primary Language Python Source Files 24 .py, 10.sh, 6.yamlEstimated Size ~3.0 KLOC Frameworks MLX-LM, Unsloth Platforms Apple Silicon, NVIDIA CUDA Key Features LoRA fine-tuning, pattern evaluation, interactive demo Good for you if: You want to experiment with LoRA fine-tuning, understand behavioral pattern injection, or compare MLX vs Unsloth workflows.
Complexity: Moderate. Includes extensive documentation and video recording guides. Pattern data is human-readable YAML.
Key Takeaways
-
Engram reduces redundant computation. O(1) lookup for recurring patterns beats recomputing through attention.
-
LoRA makes experimentation accessible. Fine-tune small models in seconds on a laptop.
-
Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, with different tooling for each.
-
Deepseek publishes useful research. Their papers address real problems with practical solutions.
What’s Next
Part 3 will cover Engram Revisited—what happened when we moved from behavioral emulation to real hash-based memory implementation. Spoiler: it works, but not everywhere.
Resources
- Engram Paper (arXiv:2601.07372)
- engram-poc Repository
- Engram Video Part 1
- Engram Video Part 2
- Part 1: mHC
Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.
-
692 words • 4 min read • Abstract
Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs

A tiny 135M parameter model goes from 0% to 75% accuracy in 5 minutes of training. The secret? Knowledge graph-guided training with rejection sampling.
Resource Link Paper KG-Guided RAG (arXiv) Code multi-hop-reasoning ELI5 eli5.md Demo Live Demo Video LLM with Training Wheels 
Part 2 The Distribution Trap The Problem: Multi-Hop Reasoning
LLMs struggle with questions requiring multiple reasoning steps. “What’s the fix for a crash caused by a corrupted config file on a system running outdated firmware?” requires connecting several facts:
- Corrupted config → need config reset
- Outdated firmware → need firmware update
- Crash context → check dependencies between these fixes
Standard fine-tuning teaches pattern matching. Multi-hop reasoning requires following logical chains.
The Paper’s Approach
Learn with training wheels, remove them after learning completes.
Knowledge Graph-Guided RAG from Princeton proposes using knowledge graphs during training to score reasoning quality—then removing the graph at inference.
The key insight: train with scaffolding, test without it.
My Implementation
The repo implements this for a software troubleshooting domain:
Component Details Knowledge Graph ~200 entities, ~600 edges (symptoms, causes, fixes) Training Data MCQs with 1-3 hop paths Eval Data MCQs with 4-5 hop paths (harder) Model SmolLM-135M-Instruct Framework MLX (Apple Silicon native) The Training Pipeline
┌─────────────────────────────────────────┐ │ 1. SFT: Learn output format │ │ TRACE: <reasoning> │ │ ANSWER: A|B|C|D │ ├─────────────────────────────────────────┤ │ 2. RSFT: Rejection Sampling FT │ │ - Generate multiple answers │ │ - Score with knowledge graph │ │ - Keep only correct traces │ │ - Train on winners │ └─────────────────────────────────────────┘The Reward Function
The knowledge graph scores outputs during training:
- R_corr: +1.0 correct answer, -2.0 incorrect
- R_path: Entity coverage (did the trace mention relevant nodes?)
- P_spam: -0.5 penalty for repeating entities (prevents gaming)
At inference, the graph is removed. The model must reason from learned patterns.
Results
Phase Accuracy Training Time Base model 0% - After SFT 30% ~2 min After RSFT 75% ~3 min The critical finding: distribution matching matters.
Training on easy examples (1-2 hops) hurt performance on hard eval (4-5 hops). Training on examples matching the eval distribution achieved 75%.
Running It
git clone https://github.com/softwarewrighter/multi-hop-reasoning cd multi-hop-reasoning # Setup (Apple Silicon) make setup-mlx # Full pipeline make trainResults appear in ~5 minutes on an M-series Mac.
Implementation Details
Metric Value Primary Language Python Source Files 12 .pyfilesEstimated Size ~1.5 KLOC Framework MLX, Transformers Platform Apple Silicon (MLX native) Good for you if: You want to understand knowledge graph-guided training, experiment with rejection sampling fine-tuning, or see how small models can learn reasoning patterns.
Complexity: Moderate. Clean codebase with Make targets for each step. Requires understanding of fine-tuning concepts.
Key Takeaways
-
Scaffolded training works. Use structured feedback during training, remove it at inference.
-
Distribution matching matters. Train on examples that match your eval distribution.
-
Small models can reason. 135M parameters is enough for 75% accuracy on 4-5 hop questions.
-
MLX makes iteration fast. Full pipeline runs in 5 minutes on a MacBook.
Resources
- Paper: Knowledge Graph-Guided RAG
- Repository: multi-hop-reasoning
- Live Demo
- Video: LLM with Training Wheels
Knowledge graphs as training wheels—helping small models learn to reason, then letting go.
-
765 words • 4 min read • Abstract
Small Models (2/6): AI in Your Pocket

AI on your phone. All day. No internet required.
This is Part 2 of the Small Models, Big Brains series. Today we’re putting a language model in your pocket with Pocket Eliza++—a modern AI therapist that runs completely offline on Android.
Resource Link Paper MobileLLM (ICML 2024) Code pocket-llm Runtime llama.cpp Video AI in Your Pocket 
Why Offline Matters
Benefit Description Privacy Data never leaves your device Speed No network latency Cost No API fees Offline Works without internet Battery Efficient on-device inference Cloud AI is convenient, but sometimes you want a conversation that stays on your device.
MobileLLM: Meta’s Edge Champion
MobileLLM is Meta’s sub-500M parameter model optimized specifically for on-device inference.
Architecture Optimizations
Technique Benefit Deep-thin design More layers, fewer parameters per layer SwiGLU activation Better performance than ReLU Embedding sharing Saves 30% of parameters Grouped-query attention Faster inference The result: a 260MB quantized model (Q4_K_M) that runs smoothly on phones.
Pocket Eliza++

The original ELIZA (1966) used pattern matching to simulate a Rogerian therapist. Pocket Eliza++ uses the same therapeutic approach but with actual language understanding.
Therapeutic Design
The system prompt instructs the model to:
- Ask one short question at a time
- Never repeat questions
- Vary question types (feelings, motivations, specifics)
- Never give advice or explanations
It’s a reflective listener, not a problem solver.
Technical Stack
┌─────────────────────────────────┐ │ Kotlin + Jetpack Compose │ UI Layer ├─────────────────────────────────┤ │ JNI Bridge │ ├─────────────────────────────────┤ │ llama.cpp │ Inference Engine ├─────────────────────────────────┤ │ MobileLLM-350M (Q4_K_M) │ Model (260MB) └─────────────────────────────────┘- Model: MobileLLM-350M quantized to Q4_K_M (260MB GGUF)
- Runtime: llama.cpp compiled for Android via NDK
- Interface: Kotlin + Jetpack Compose
- Bridge: JNI bindings connect Kotlin to native llama.cpp
Building the App
# Clone the repository git clone https://github.com/softwarewrighter/pocket-llm cd pocket-llm/android-demo # Clone llama.cpp into native source git clone https://github.com/ggerganov/llama.cpp.git \ app/src/main/cpp/llama.cpp # Download the model (260MB) mkdir -p app/src/main/assets curl -L -o app/src/main/assets/MobileLLM-376M-Q4_K_M.gguf \ "https://huggingface.co/pjh64/MobileLLM-350M-GGUF/resolve/main/MobileLLM-376M-Q4_K_M.gguf" # Build and install ./gradlew assembleDebug adb install -r app/build/outputs/apk/debug/app-debug.apkBuild Requirements
Requirement Value Target SDK 35 (Android 15) Min SDK 28 (Android 9.0) ABI arm64-v8a NDK CMake for native build Kotlin 2.0.0 Quick CLI Demo
Don’t want to build the Android app? Test with Ollama:
pip install -r requirements.txt ollama pull smollm:360m python3 eliza.pyPerformance
On a mid-range Android phone (Snapdragon 7 series):
- First token: ~500ms
- Generation: ~10 tokens/second
- Memory: ~400MB RAM
- Battery: Minimal impact for short sessions
Implementation Details
Metric Value Languages Kotlin (UI), Python (CLI), C++ (JNI) Source Files 6 .kt, 4.py, 2.cppEstimated Size ~1.3 KLOC Android Target SDK 28+ (Android 9.0) Build System Gradle + CMake (NDK) Key Dependency llama.cpp (vendored) Good for you if: You want to deploy LLMs on Android, learn JNI/NDK integration, or build privacy-focused mobile AI apps.
Complexity: Moderate-High. Requires Android Studio, NDK setup, and understanding of JNI bridges. The llama.cpp integration is the tricky part; the Kotlin UI is straightforward Jetpack Compose.
Key Takeaways
-
Sub-500M models are phone-ready. MobileLLM proves useful AI fits in your pocket.
-
llama.cpp is the universal runtime. Same engine runs on Mac, Linux, Windows, and Android.
-
Privacy doesn’t require sacrifice. Offline AI can still be conversational and helpful.
-
Quantization is essential. Q4_K_M brings 350M parameters down to 260MB with minimal quality loss.
What’s Next
Part 3 explores the Hierarchical Reasoning Model (HRM)—a 27M parameter model that beats o3-mini on abstract reasoning.
Resources
- MobileLLM Paper (ICML 2024)
- pocket-llm Repository
- llama.cpp
- MobileLLM GGUF on HuggingFace
- Video: AI in Your Pocket
-
760 words • 4 min read • Abstract
Deepseek Papers (1/3): mHC - Training Stability at Any Depth

Deepseek publishes papers. I implement them. This paper tackles a fundamental transformer problem: training stability in deep networks.
This post covers my implementation of mHC (Manifold-Constrained Hyper-Connections)—running on both Apple Silicon and NVIDIA GPUs.
Resource Link Paper arXiv:2512.24880 Code mHC-poc ELI5 eli5-mHC.md ELI4 eli4-mHC.md Video 1 mHC Demo 
Video 2 mHC Explained 
Video 3 mHC Results 
The Problem: Deep Networks Explode
Residual connections revolutionized deep learning. Skip connections let gradients flow through hundreds of layers. But there’s a catch.
Standard residual connections:
output = layer(input) + inputThis works, but the signal accumulates. With many layers, small amplifications compound into instability.
Hyper-Connections (HC) tried to fix this by learning connection weights:
output = α₁ × layer(input) + α₂ × inputBetter expressiveness, but learned weights can still cause explosion. At 48 layers, HC becomes unstable.
The mHC Solution: Doubly-Stochastic Constraints
mHC constrains the connection weights using Sinkhorn-Knopp iteration—a mathematical technique that ensures weights form a doubly-stochastic matrix.
What does “doubly-stochastic” mean?
- Each row sums to 1
- Each column sums to 1
This bounds the total signal flow. No matter how deep the network, amplification stays controlled.
# Sinkhorn-Knopp iteration (simplified) def make_doubly_stochastic(weights, iterations=5): for _ in range(iterations): weights = weights / weights.sum(dim=0) # Column normalize weights = weights / weights.sum(dim=1) # Row normalize return weightsResults: Stability at Depth
The mHC-poc repo stress-tests this with a depth sweep:
Depth Baseline HC mHC 12 layers Stable Stable Stable 24 layers Struggling Stable Stable 48 layers Oscillating Explodes Stable At 48 layers:
- HC gain proxy: 10²⁷ (catastrophic amplification)
- mHC gain proxy: 10⁻⁰·⁶ (bounded, healthy)
HC’s final loss at 48 layers: 5.54 (never learns) mHC’s final loss at 48 layers: 0.0002 (perfect convergence)
Cross-Platform Validation
The implementation runs on both Apple Silicon (MLX) and NVIDIA (PyTorch/CUDA):
Metric MLX (Apple) CUDA (NVIDIA) Gain Proxy (24L) -0.6 -0.602 Gradient Stability Stable Stable NaN Events 0 0 Identical results confirm the Sinkhorn-Knopp projection works correctly on both platforms.
Running the mHC Demo
git clone https://github.com/softwarewrighter/mHC-poc cd mHC-poc # Apple Silicon (MLX) uv venv && source .venv/bin/activate uv pip install -r mlx/requirements.txt bash scripts/run_depth_sweep.sh # NVIDIA (CUDA) cd cuda uv venv && source .venv/bin/activate uv pip install -r requirements.txt bash scripts/run_cuda_depth_sweep.shResults go to
runs/with plots showing loss, gradient norms, and gain proxy across depths.Implementation Details
Metric Value Primary Language Python Source Files 29 .py, 3.sh, 10.yamlEstimated Size ~2.5 KLOC Frameworks MLX, PyTorch Platforms Apple Silicon, NVIDIA CUDA Key Features Depth sweep, cross-platform validation, visualization Good for you if: You want to understand mHC’s stability benefits, compare MLX vs PyTorch implementations, or experiment with residual connection variants.
Complexity: Moderate. Well-documented with ELI5 explanations in
docs/. Requires understanding of residual connections and matrix constraints.Key Takeaways
-
mHC solves deep network instability. Doubly-stochastic constraints bound signal amplification at any depth.
-
Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, validated to produce identical results.
-
Deepseek publishes useful research. Their papers address real problems with practical solutions.
What’s Next
Part 2 covers Engram—Deepseek’s approach to reducing redundant computation through conditional memory.
Resources
Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.
-
703 words • 4 min read • Abstract
Small Models (1/6): 976 Parameters Beat Billions

The best large language models score zero on hard mazes. A model with under 1,000 parameters scores 85 percent.
This is Part 1 of the Small Models, Big Brains series, exploring how tiny models with clever architectures outperform massive ones on specific tasks.
Resource Link Paper Tiny Recursive Model Code train-trm Video 976 parameters is more than billions?! 
Why LLMs Fail at Mazes
Large language models generate one token at a time. They cannot backtrack. One wrong move and the entire solution fails.
Maze solving requires:
- Exploring dead ends
- Backtracking when stuck
- Maintaining spatial awareness
- Planning multiple steps ahead
Autoregressive generation is fundamentally incompatible with these requirements.
Meet TRM: The Tiny Recursive Model
The Tiny Recursive Model uses under 1,000 parameters. Instead of being bigger, it thinks in loops.
Input → Think → Act → Think → Act → ... → OutputA simple two-layer network that iterates until the solution emerges.
The Architecture
TRM alternates between two phases:
Phase Purpose Think Update internal latent state by processing input, current answer, and previous state Act Update the answer based on the refined latent state This process repeats for multiple cycles, progressively improving the output.
TRMConfig { input_dim: 5, output_dim: 5, hidden_dim: 16, latent_dim: 16, l_layers: 2, // Network depth h_cycles: 3, // Outer think-act cycles l_cycles: 4, // Inner think cycles }The Secret: Deep Supervision
The key insight isn’t just recursion—it’s supervising every step, not just the final answer.
Traditional training:
Input → [black box] → Final Output → LossTRM training:
Input → Step 1 → Loss₁ → Step 2 → Loss₂ → Step 3 → Loss₃ → ... → Final → Loss_nEvery iteration gets feedback. The model learns to make progress at each step.
Results
Model Maze Accuracy GPT-4 ~0% on hard mazes Claude ~0% on hard mazes TRM (976 params) 85% Iteration beats scale.
Running the Code
The train-trm repo provides a complete Rust implementation:
# Clone and build git clone https://github.com/softwarewrighter/train-trm cd train-trm ./scripts/build.sh --release # Train a model ./scripts/train.sh --epochs 1000 --lr 0.01 # Evaluate ./scripts/eval.sh # Or launch the web UI cargo install --locked trunk ./scripts/web-serve.shThe web UI includes interactive maze visualization with solution paths and real-time training charts.
Implementation Details
Metric Value Primary Language Rust Source Files 21 .rsfilesEstimated Size ~2.5 KLOC Also Includes HTML (web UI), Shell scripts Build System Cargo + Trunk (WASM) Dependencies ndarray, serde, clap, wasm-bindgen Good for you if: You want to learn Rust ML from scratch, experiment with recursive architectures, or need a web-based training visualization.
Complexity: Moderate. Clean Rust code with good documentation. The neural network is implemented from scratch (no PyTorch/TensorFlow), making it educational but requiring Rust familiarity.
Key Takeaways
-
Parameter count isn’t everything. Architecture and training strategy matter more for certain tasks.
-
Recursion enables backtracking. By iterating, TRM can explore and refine solutions.
-
Deep supervision accelerates learning. Feedback at every step, not just the end.
-
Task-specific models excel. TRM isn’t a general-purpose LLM—it’s optimized for maze-like reasoning.
What’s Next
Part 2 explores MobileLLM and running AI completely offline on your Android phone.
Resources
-
1013 words • 6 min read • Abstract
Welcome to Software Wrighter Lab

Welcome to Software Wrighter Lab—a blog, YouTube channel, Discord server, and GitHub repos for exploring the intersection of AI coding agents, systems programming, and practical machine learning.
I’m Mike Wright, a software engineer with over four decades of experience, currently focused on AI-assisted development with Rust and WebAssembly.
Quick Links YouTube @SoftwareWrighter GitHub softwarewrighter Discord SW Lab Contents:
- About Me
- Programming Languages
- What This Blog Covers
- Why “Software Wrighter”?
- What to Expect
- Current Projects
- Technology Stack
- Get Involved
- What’s Next
About Me
I’ve been writing code professionally for over 35 years—an Emacs user since 1989, still going strong.
My background spans mainframes to startups:
- IBM Data Processing Division - MVS Dynamic Reconfiguration and Standalone Dump (SADUMP)
- IBM T.J. Watson Research - Advisory Programmer on MVS Batch Pipes, Automatic Restart Manager, Java Record I/O, and IMS Data Sharing
- Forte Software / Sun Microsystems - Senior Programmer on Forte 4GL/Conductor/Fusion, Open Enterprise Service Bus, and Glassfish
- Startups - Individual contributor and management roles including LogiCoy (Open ESB), Likestream (Facebook Clojure App), Guidewire (Platform), Illumio (Network Security Web UI), and Signifyd (Gradle/Docker performance tuning)
Areas I’ve worked in: mainframe O/S development, EAI/B2B middleware, platform engineering, build/release engineering, and embedded programming.
Programming Languages
Over the years, I’ve written production code in:
Era Languages Mainframe APL, Assembler (S/370, S/390), IBM PL/S, PL/AS, PL/X, CMS/TSO Pipelines Systems C, C++ Enterprise Java, Forte 4GL, Guidewire Gosu, Groovy Web/Modern JavaScript, TypeScript, Go, Clojure, ClojureScript Current Elisp, JavaScript, Kotlin, Python, Rust, WebAssembly Each language taught me something different about how to think about problems. APL taught me array thinking. Assembler taught me what the machine is actually doing. CMS/TSO Pipelines taught me dataflow composition (an area I plan to revisit in Throwback Thursday posts). Lisp (via Clojure) taught me functional composition. Rust is teaching me ownership and fearless concurrency.
I’m a lifelong learner. When Rust emerged as a modern systems language, I dove in. When AI coding agents became capable enough to be genuine collaborators, I started exploring how they change the way we build software.
This blog and the accompanying YouTube channel document that exploration.
What This Blog Covers
Software Wrighter Lab focuses on three main areas:
1. AI Coding Agents
How do tools like Claude Code, Cursor, and other AI assistants actually perform on real projects? I build the same applications with different agents to compare their strengths and weaknesses.
- Vibe coding comparisons (Claude vs GLM, different models)
- Practical workflows (parallel coding with git worktrees, hooks, custom scripts)
- Tool development (guardian-cli, proact, ralphy)
2. Machine Learning Research Implementation
When interesting ML papers come out, I implement them to understand how they work. The goal isn’t to compete with research labs—it’s to learn by building.
Recent implementations include:
- Tiny Recursive Model (TRM) - Under 1,000 parameters solving mazes
- Hierarchical Reasoning Model (HRM) - Planner-Doer architecture for abstract reasoning
- MobileLLM - Running LLMs offline on Android
- Deepseek papers (mHC, Engram) - Novel architectures for efficient inference
- MIT’s Recursive Language Model - Implemented in Rust with WASM
3. Rust, WebAssembly, and Practical Tools
Rust is my language of choice for new projects. Combined with WebAssembly, it enables building tools that run anywhere—CLI, browser, or embedded.
Topics include:
- Rust/Yew/WASM web applications
- Visualization (Three.js, d3.js, pure CSS approaches)
- Video production tools (TTS, lip sync, explainer generation)
- Developer utilities (installation scripts, repo assistants, modularizers)
Why “Software Wrighter”?
A “wright” is a craftsperson—someone who builds things. A wheelwright builds wheels. A playwright builds plays.
A Software Wrighter builds software, with attention to craft.
The name reflects my belief that good software comes from treating programming as a craft: learning continuously, choosing tools deliberately, and building things that work well and last.
What to Expect
Posts on this blog will typically include:
- Links to papers, repos, and videos (above the fold)
- Implementation details (language, LOC, complexity assessment)
- Working code you can clone and run
- Honest assessments of what works and what doesn’t
I’m not trying to sell you anything. This is a lab notebook—a record of experiments, some successful, some not.
Current Projects
As of February 2026, I’m actively working on:
Project Description Status Small Models, Big Brains 6-part series on efficient LLMs Publishing Deepseek papers mHC and Engram implementations In progress Explainer pipeline AI-generated video production Ongoing RLM implementations Recursive Language Models in Rust Complete Technology Stack
Most of my current work uses:
Layer Technology Systems Rust Web Yew, WASM, TypeScript ML Python, PyTorch, HuggingFace AI Agents Claude Code, Cursor Video OBS, FFmpeg, TTS tools Get Involved
If any of this resonates with you:
- Subscribe to the YouTube channel for video content
- Star repos on GitHub that interest you
- Join the Discord server to discuss
I’m always interested in discussing these topics with other engineers exploring similar territory.
What’s Next
The first content series, Small Models, Big Brains, starts tomorrow. It’s a 6-part deep dive into small language models that outperform much larger ones on specific tasks:
- TRM: 976 parameters beating GPT-4 on mazes
- MobileLLM: AI running offline on your phone
- HRM: 27M parameters beating o3-mini on abstract reasoning
- BDH: A language model with visible, interpretable activations
- Billion-parameter models: The efficiency sweet spot
- The 2-3B efficient frontier: Phi-2, Gemma, SmolLM
Each post maps to a YouTube video, a GitHub repo, and working code you can run yourself.
Thanks for reading. Let’s build something interesting.
Mike Wright Software Wrighter LLC San Francisco Bay Area
-
1138 words • 6 min read • Abstract
TBT (1/?): My First Program Was a Horse Race

My first program was a horse race. Written in APL. On a mainframe. In 1972.
This is the first Throwback Thursday post—a series where I revisit the technologies, languages, and ideas that shaped how I think about programming.
Resource Link Code apl-horse-race Demo Live Demo GNU APL gnu.org/software/apl Video Greek Code, No Lowercase #TBT 
APL: A Programming Language
APL was created by Kenneth Iverson at IBM in the 1960s. The name literally means “A Programming Language”—Iverson was a mathematician who designed it as a notation for describing algorithms before it became an actual programming language.
What made APL special:
Feature Description Array-oriented Operations work on entire arrays, not single values Symbolic notation Greek letters and mathematical symbols as operators Interactive REPL-style development decades before it was common Terse Complex operations in a few characters APL programs look like nothing else:
POS←POS+?5⍴3This single line adds random values (1-3) to all five horse positions simultaneously. No loops. No iteration. The operation just happens across the entire array.
The IBM 2741 Experience
In 1972, APL\360 ran on IBM mainframes. You accessed it through terminals like the IBM 2741—essentially a modified Selectric typewriter with a special APL typeball.
APL typeball for IBM Selectric The typeball had all the APL glyphs:
⍴ ⍳ ∇ ⎕ ← ⌈ ⌊ ⍵ ⍺ ∘ ⊃ ⊂and dozens more. You physically typed these symbols. The keyboard layout was completely different from anything you’d seen before.When you made an error, there was no backspace in the modern sense. You’d overstrike characters or start the line over. Programs were stored in workspaces, saved to tape or disk.
The terminal printed on paper. Every interaction left a physical record.
The Horse Race Program
Horse race simulations were popular APL demonstrations. They showed off several things:
- Random number generation (
?roll operator) - Array operations (updating all positions at once)
- Character graphics (crude but effective visualization)
- Interactive output (watching the race unfold)
Here’s the verbose version from the repo:
∇ RACE;HORSES;POS;FINISH;ROUND;_ HORSES←'LUCKY ' 'THUNDER' 'SHADOW ' 'COMET ' 'BLAZE ' POS←5⍴0 FINISH←15 ROUND←0 ⎕←'══════════════════════════════════════════' ⎕←' THE RACE IS ON!' ⎕←'══════════════════════════════════════════' LOOP:ROUND←ROUND+1 ⎕←'--- ROUND ',(⍕ROUND),' ---' POS←POS+?5⍴3 SHOWHORSES →DONE×⍳∨/POS≥FINISH →LOOP DONE:⎕←'WINNER: ',((⊃(POS=⌈/POS)/⍳5)⊃HORSES),'!' ∇Key APL Idioms
Array creation:
POS←5⍴0 ⍝ Create array of 5 zerosThe
⍴(rho) operator reshapes.5⍴0means “reshape 0 into a 5-element array.”Random numbers:
?5⍴3 ⍝ Roll 5 dice, each 1-3The
?operator is “roll”—like rolling dice.?5⍴3rolls five 3-sided dice.Finding the winner:
(⊃(POS=⌈/POS)/⍳5)⊃HORSESThis reads right-to-left:
⌈/POS— maximum of all positionsPOS=⌈/POS— boolean mask: which horses are at max?/⍳5— compress: keep only those indices⊃— take the first one⊃HORSES— select that horse’s name
One line. No loops. Pure array thinking.
The Idiomatic Version
APL programmers pride themselves on terseness. The idiomatic version does the same thing in fewer characters:
HORSES←'LUCKY ' 'THUNDER' 'SHADOW ' 'COMET ' 'BLAZE ' ∇ SHOW;I I←1 N:⎕←(I⊃HORSES),'│',((I⊃POS)⍴'░'),'▓' I←I+1 →N×⍳I≤5 ∇ ∇ RACE;POS;_ POS←5⍴0 ⎕←'THE RACE IS ON!' L:_←⎕DL 0.3 POS←POS+?5⍴3 SHOW ⎕←'' →L×⍳~∨/POS≥15 ⎕←'WINNER: ',(⊃(POS=⌈/POS)/⍳5)⊃HORSES ∇The entire program fits on a single screen. This was the APL aesthetic: powerful ideas expressed concisely.
Running It Today
GNU APL implements ISO 13751 (Extended APL) and runs on modern systems:
# macOS brew install gnu-apl # Arch Linux yay -S gnu-apl # Run the race git clone https://github.com/sw-comp-history/apl-horse-race cd apl-horse-race apl -f src/race.aplSample output:
══════════════════════════════════════════ THE RACE IS ON! ══════════════════════════════════════════ --- ROUND 1 --- LUCKY │▓▓▓◄ THUNDER │▓▓◄ SHADOW │▓◄ COMET │▓▓▓◄ BLAZE │▓▓◄The horses advance randomly until one crosses the finish line.
What APL Taught Me
APL shaped how I think about programming in ways that persist today:
1. Think in arrays, not loops.
When I see a problem, I ask: can this be expressed as an operation on a whole collection? Languages like NumPy, R, and Julia carry this forward.
2. Notation matters.
Good notation can make complex ideas simple. Bad notation obscures them. APL’s symbols were controversial, but they made array operations visible in ways that verbose syntax doesn’t.
3. The REPL is powerful.
Interactive development—type an expression, see the result immediately—was central to APL decades before it became fashionable again with Jupyter notebooks and modern REPLs.
4. Terseness has value.
Not obfuscation for its own sake, but the ability to see an entire algorithm at once. When your program fits on one screen, you can reason about the whole thing.
APL’s Legacy
APL influenced many languages:
Language Year APL Influence J 1990 Iverson’s ASCII-only redesign K/Q 1993 Powers financial systems at Kx A+ 1988 Morgan Stanley’s open-source APL BQN 2020 Modern APL with cleaner semantics NumPy 2006 Array operations in Python R 1993 Vector operations for statistics The ideas live on, even if the glyphs don’t.
Implementation Details
Metric Value Primary Language APL Source Files 2 .aplfilesLines of Code ~50 lines total Runtime GNU APL Also Includes Documentation, PNG samples for Unicode issues Good for you if: You want to understand array programming origins, learn basic APL, or experience what programming felt like in the 1970s.
Complexity: Low. The program is intentionally simple—a teaching example, not production code. The repo includes extensive documentation explaining every line.
Why Throwback Thursday?
Programming didn’t start with Python and JavaScript. Every abstraction we use today was invented by someone solving a real problem.
TBT posts will explore:
- Languages that shaped my thinking (APL, Lisp, Forth)
- Technologies that were ahead of their time (CMS/TSO Pipelines, dataflow)
- Ideas worth revisiting with modern tools
Understanding where we came from helps us see where we’re going.
Resources
- apl-horse-race Repository
- GNU APL
- APL Wiki
- Iverson’s Turing Award Lecture: “Notation as a Tool of Thought”
- Video: Greek Code, No Lowercase #TBT
Have your own “first program” story? Find me on YouTube @SoftwareWrighter.
- Random number generation (
subscribe via RSS





