• 457 words3 min readAbstract

    Five ML Concepts - #29

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #29
    Video

    References

    Concept Reference
    Neural Collapse Prevalence of Neural Collapse (Papyan et al. 2020)
    Grokking Grokking: Generalization Beyond Overfitting (Power et al. 2022)
    SAM Sharpness-Aware Minimization (Foret et al. 2021)
    Mechanistic Interpretability Transformer Circuits (Anthropic 2021)
    Self-Training Instability Understanding Self-Training (Wei et al. 2020)

    Today’s Five

    1. Neural Collapse

    In overparameterized networks trained to zero loss, class representations converge late in training to a symmetric, maximally separated structure. The last-layer features and classifiers align into a simplex equiangular tight frame.

    This geometric phenomenon appears universally across architectures.

    Like students settling into evenly spaced seats by the end of class.

    2. Grokking

    In some tasks, especially small algorithmic ones, models memorize quickly but only later suddenly generalize. The jump from memorization to understanding can happen long after training loss reaches zero.

    Weight decay and longer training appear necessary for this phase transition.

    Like cramming facts for an exam, then later realizing you truly understand.

    3. SAM (Sharpness-Aware Minimization)

    Instead of minimizing loss at a single point, SAM minimizes loss under small weight perturbations, finding flatter regions. Flatter minima tend to generalize better than sharp ones.

    The optimizer seeks robustness to parameter noise.

    Like choosing a wide hilltop instead of balancing on a sharp peak.

    4. Mechanistic Interpretability

    Researchers analyze activations and internal circuits to understand how specific computations are implemented inside models. The goal is reverse-engineering neural networks into understandable components.

    This reveals attention heads, induction heads, and other interpretable patterns.

    Like mapping the wiring of an unknown machine to see how it works.

    5. Self-Training Instability

    When models train on their own generated data, feedback loops can amplify small errors over time. Each iteration compounds mistakes, causing distributional drift.

    Careful filtering and external grounding help mitigate this.

    Like copying a copy repeatedly until the meaning drifts.

    Quick Reference

    Concept One-liner
    Neural Collapse Late-stage geometric convergence of class representations
    Grokking Sudden generalization after prolonged memorization
    SAM Optimizing for flat loss regions under perturbations
    Mechanistic Interpretability Analyzing internal circuits of neural networks
    Self-Training Instability Feedback loops that amplify errors in self-generated data

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 443 words3 min readAbstract

    Five ML Concepts - #28

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #28
    Video

    References

    Concept Reference
    Lottery Ticket Hypothesis The Lottery Ticket Hypothesis (Frankle & Carlin 2019)
    Sparse Activation Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
    Conditional Computation Sparsely-Gated MoE + Switch Transformers
    Inference Parallelism Megatron-LM (Shoeybi et al. 2019)
    Compute Optimality Chinchilla Scaling Laws (Hoffmann et al. 2022)

    Today’s Five

    1. Lottery Ticket Hypothesis

    Large neural networks contain smaller subnetworks that, when trained from the right initialization, achieve similar performance. These “winning tickets” exist before training begins.

    The key insight: you can find and train just the winning subnetwork.

    Like finding a winning lottery ticket hidden among many losing ones.

    2. Sparse Activation

    Only a subset of neurons activate for each input, even in models with many parameters. This allows large capacity without using everything at once.

    Mixture-of-experts architectures explicitly design for this pattern.

    Like a library where only relevant books light up for each query.

    3. Conditional Computation

    The model dynamically activates only certain components depending on the input. Different inputs route to different experts or pathways.

    This improves efficiency and scalability without proportional compute increase.

    Like routing patients to the right specialist instead of seeing every doctor.

    4. Inference Parallelism

    Model execution can be split across multiple devices to reduce latency or increase throughput. Tensor parallelism splits layers; pipeline parallelism splits stages.

    Essential for serving large models in production.

    Like dividing a puzzle so multiple people work on it simultaneously.

    5. Compute Optimality Hypothesis

    Empirical scaling laws suggest performance improves when model size, data, and compute are balanced. Adding only one resource may not yield optimal gains.

    Chinchilla showed many models were undertrained relative to their size.

    Like baking a cake where proportions matter more than just adding extra ingredients.

    Quick Reference

    Concept One-liner
    Lottery Ticket Hypothesis Small winning subnetworks hidden in large models
    Sparse Activation Using only part of a model per input
    Conditional Computation Dynamically routing inputs for efficiency
    Inference Parallelism Distributing inference across devices
    Compute Optimality Balancing model size, data, and compute

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 894 words5 min readAbstract

    How AI Learns Part 7: Designing a Continuous Learning Agent

    A robust continuous learning agent contains:

    • Core model (rarely updated)
    • Adapters (modular skills)
    • External memory (facts)
    • Context manager (Recursive Language Model (RLM)-style)
    • Logging & evaluation loop
    Resource Link
    Related RLM | Engram | Sleepy Coder

    The Layered Architecture

    Four-layer architecture showing Context Manager, External Memory, Adapters, and Core Weights with feedback and evaluation loops
    Continuous learning is layered coordination.

    Layer by Layer

    Layer 4: Core Weights (Bottom)

    The foundation. Trained once, changed rarely.

    Aspect Details
    Contains General reasoning, language, base knowledge
    Update frequency Months or never
    Update method Full fine-tune or major consolidation
    Risk of change High (forgetting, capability shifts)

    Rule: Don’t touch this unless you have a very good reason.

    Layer 3: Adapters (Parameter-Efficient Fine-Tuning (PEFT) / Low-Rank Adaptation (LoRA))

    Modular skills that plug into the base.

    Aspect Details
    Contains Task-specific capabilities
    Update frequency Weekly to monthly
    Update method Lightweight PEFT training
    Risk of change Medium (isolated, but validate)

    Rule: Train adapters for validated, recurring patterns. Version them. Enable rollback.

    Layer 2: External Memory

    Facts, experiences, and retrieved knowledge.

    Aspect Details
    Contains Documents, logs, structured data
    Update frequency Continuous
    Update method Database writes
    Risk of change Low (doesn’t affect weights)

    Rule: Store experiences here first. Memory is cheap and safe.

    Layer 1: Context Manager (Top)

    The RLM-style interface that rebuilds focus each step.

    Aspect Details
    Contains Current context, retrieved data, active state
    Update frequency Per call
    Update method Reconstruction from memory + query
    Risk of change None (ephemeral)

    Rule: Don’t drag context forward. Rebuild it.

    The Feedback Loop

    Logging

    Capture everything the agent does:

    • Prompts received
    • Actions taken
    • Tool calls made
    • Errors encountered
    • User signals

    This is your training data.

    Evaluation

    Before any update reaches production:

    Check Purpose
    Retention tests Did old skills degrade?
    Forward transfer Did new skills improve?
    Regression suite Known failure cases
    Safety checks Harmful outputs?

    Without evaluation, you’re updating blind.

    Deployment

    Updates should be:

    • Modular: Can isolate and rollback
    • Versioned: Know what changed when
    • Staged: Test before full rollout
    • Monitored: Track post-deployment metrics

    The Error Flow

    Where do errors go?

    Error occurs
        ↓
    Log it (immediate)
        ↓
    Store in memory (same day)
        ↓
    Pattern emerges over multiple occurrences
        ↓
    Train adapter update (weekly/monthly)
        ↓
    Validate update (before deployment)
        ↓
    Deploy with rollback capability
    

    Errors feed into memory first. Only validated, recurring improvements reach adapters. Core weights almost never change.

    What This Architecture Achieves

    Problem Solution
    Catastrophic forgetting Core weights frozen; adapters isolated
    Context rot RLM rebuilds focus each step
    Hallucination Memory grounds responses
    Slow adaptation Memory updates continuously
    Unsafe changes Evaluation before deployment

    Design Principles

    1. Separate Storage from Reasoning

    Facts belong in memory. Reasoning belongs in weights. Don’t blur them.

    2. Separate Speed from Permanence

    Fast learning (memory) is temporary. Slow learning (weights) is permanent. Match the update speed to the desired permanence.

    3. Evaluate Before Consolidating

    Every update to adapters or weights must be validated. Regressions are silent killers.

    4. Enable Rollback

    Version everything. If an update causes problems, you must be able to undo it.

    5. Log Everything

    You cannot improve what you cannot measure. Structured logging is the foundation of continuous learning.

    The Big Picture

    AI does not learn in one place.

    It learns in layers:

    • Permanent (weights)
    • Modular (adapters)
    • External (memory)
    • Temporary (context)

    Continuous learning is not constant weight updates.

    It is careful coordination across time scales.

    Continuous learning systems don’t constantly retrain. They carefully consolidate what works.

    References

    Concept Paper
    LoRA LoRA: Low-Rank Adaptation (Hu et al. 2021)
    RAG Retrieval-Augmented Generation (Lewis et al. 2020)
    RLM Recursive Language Models (Zhou et al. 2024)
    Share Shared LoRA Subspaces (2025)
    Engram Engram: Conditional Memory (DeepSeek 2025)

    Series Summary

    Part Key Insight
    1. Time Scales Learning happens at different layers and speeds
    2. Forgetting vs Rot Different failures need different fixes
    3. Weight-Based Change the brain carefully
    4. Memory-Based Store facts outside the brain
    5. Context & RLM Rebuild focus instead of dragging baggage
    6. Continuous Learning Learn in memory, consolidate in weights
    7. Full Architecture Layered coordination enables safe improvement

    Continuous learning is layered coordination.

  • 419 words3 min readAbstract

    Five ML Concepts - #27

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #27
    Video

    References

    Concept Reference
    Elastic Weight Consolidation Overcoming catastrophic forgetting (Kirkpatrick et al. 2017)
    Replay Buffers Experience Replay for Continual Learning (Rolnick et al. 2019)
    Parameter Routing Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
    Memory-Augmented Networks Neural Turing Machines (Graves et al. 2014)
    Model Editing Editing Large Language Models (Yao et al. 2023)

    Today’s Five

    1. Elastic Weight Consolidation

    Adding a penalty that discourages changing parameters important to previous tasks. Importance is estimated using Fisher information from prior training.

    This helps models learn new tasks without catastrophic forgetting.

    Like protecting well-worn neural pathways while building new ones.

    2. Replay Buffers

    Storing examples from earlier tasks and mixing them into new training. Past data is replayed alongside current examples during optimization.

    This reinforces previous knowledge while learning new data.

    Like reviewing old flashcards while studying new material.

    3. Parameter Routing

    Activating different subsets of model parameters depending on the task or input. Mixture-of-experts and conditional computation route inputs to specialized weights.

    Enables specialization without fully separate models.

    Like having different experts handle different questions.

    4. Memory-Augmented Networks

    Adding external memory modules that neural networks can read from and write to. The model learns to store and retrieve information during inference.

    Extends beyond purely weight-based memory to explicit storage.

    Like giving a calculator access to a notepad.

    5. Model Editing

    Targeted weight updates to modify specific behaviors without full retraining. Locate and adjust the parameters responsible for particular facts or behaviors.

    Allows fast corrections and knowledge updates post-training.

    Like editing a specific entry in an encyclopedia instead of rewriting the whole book.

    Quick Reference

    Concept One-liner
    Elastic Weight Consolidation Protecting important parameters during new learning
    Replay Buffers Mixing past examples to prevent forgetting
    Parameter Routing Activating task-specific parameter subsets
    Memory-Augmented Networks External memory modules for neural networks
    Model Editing Targeted weight updates without full retraining

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 691 words4 min readAbstract

    How AI Learns Part 6: Toward Continuous Learning

    Continuous learning aims to:

    • Learn new skills
    • Retain old skills
    • Avoid retraining from scratch
    • Avoid catastrophic forgetting
    Resource Link
    Related Sleepy Coder Part 1 | Sleepy Coder Part 2

    The Continuous Learning Loop

    Flow diagram showing Agent to Logs to Evaluate to Cluster to Train to Validate to Deploy cycle, with Memory branch
    Periodic consolidation, not constant updates.

    The Core Tradeoff

    Goal Description
    Plasticity Learn new things quickly
    Stability Retain old things reliably

    You cannot maximize both simultaneously. The art is in the balance.

    Approaches to Continuous Learning

    1. Replay-Based Methods

    Keep (or synthesize) some old data. Periodically retrain on old + new.

    How it works:

    • Store representative examples from each task
    • Mix old data into new training batches
    • Periodically consolidate

    Recent work: FOREVER adapts replay timing using “model-centric time” (based on optimizer update magnitude) rather than fixed training steps.

    Pros Cons
    Strong retention Storage costs
    Conceptually simple Privacy concerns
    Well-understood Data governance complexity

    2. Replay-Free Regularization

    Constrain weight updates to avoid interference, without storing old data.

    Efficient Lifelong Learning Algorithm (ELLA) (Jan 2026): Regularizes updates using subspace de-correlation. Reduces interference while allowing transfer.

    Share (Feb 2026): Maintains a single evolving shared low-rank subspace. Integrates new tasks without storing many adapters.

    Pros Cons
    No replay needed Still active research
    Privacy-friendly Evaluation complexity
    Constant memory Subtle failure modes

    3. Modular Adapters

    Keep base model frozen. Train task-specific adapters. Merge or switch as needed.

    Evolution:

    1. Low-Rank Adaptation (LoRA): Individual adapters per task
    2. Shared LoRA spaces: Adapters share subspace
    3. Adapter banks: Library of skills to compose
    Pros Cons
    Modular, versioned Adapter proliferation
    Low forgetting risk Routing complexity
    Easy rollback Composition challenges

    4. Memory-First Learning

    Store experiences in external memory. Only consolidate to weights what’s proven stable.

    Pattern:

    • New information → Memory (fast)
    • Validated patterns → Adapters (slow)
    • Fundamental capabilities → Weights (rare)

    This separates the speed of learning from the permanence of changes.

    The Practical Loop

    A working continuous learning system:

    1. Run agent (with Recursive Language Model (RLM) context management)
    2. Collect traces: prompts, tool calls, outcomes, failures
    3. Score outcomes: tests, static analysis, user signals
    4. Cluster recurring failure patterns
    5. Train lightweight updates (LoRA/adapters)
    6. Validate retention (did old skills degrade?)
    7. Deploy modular update (with rollback capability)
    

    This is not real-time learning. It’s periodic consolidation.

    Human analogy: Sleep. Process experiences, consolidate important patterns, prune noise.

    Time Scales of Update

    Frequency What Changes Method
    Every query Nothing (inference only) -
    Per session Memory Retrieval-Augmented Generation (RAG)/Engram
    Daily Adapters (maybe) Lightweight Parameter-Efficient Fine-Tuning (PEFT)
    Weekly Validated adapters Reviewed updates
    Monthly Core weights Major consolidation

    Most systems should:

    • Update memory frequently
    • Update adapters occasionally
    • Update core weights rarely

    Evaluation Is Critical

    Continuous learning without continuous evaluation is dangerous.

    Required:

    • Retention tests (what got worse?)
    • Forward transfer tests (what got better?)
    • Regression detection
    • Rollback capability

    Without these, you’re flying blind.

    References

    Concept Paper
    ELLA Subspace Learning for Lifelong ML (2024)
    Share Shared LoRA Subspaces (2025)
    FOREVER Model-Centric Replay (2024)
    EWC Overcoming Catastrophic Forgetting (Kirkpatrick et al. 2017)

    Coming Next

    In Part 7, we’ll put it all together: designing a practical continuous learning agent with layered architecture, logging, feedback loops, and safety.


    Learn often in memory. Consolidate carefully in weights.

  • 424 words3 min readAbstract

    Five ML Concepts - #26

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #26
    Video

    References

    Concept Reference
    Data Augmentation A survey on Image Data Augmentation (Shorten & Khoshgoftaar 2019)
    Caching Strategies Systems engineering practice (no canonical paper)
    Constitutional AI Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022)
    Goodhart’s Law Goodhart’s Law and Machine Learning (Sevilla et al. 2022)
    Manifold Hypothesis An Introduction to Variational Autoencoders (Kingma & Welling 2019)

    Today’s Five

    1. Data Augmentation

    Creating additional training examples using label-preserving transformations. Rotate, flip, crop, or color-shift images without changing what they represent.

    Effectively increases dataset size and improves generalization.

    Like practicing piano pieces at different tempos to build flexibility.

    2. Caching Strategies

    Storing previous computation results to reduce repeated work and latency. Cache embeddings, KV states, or frequently requested outputs.

    Essential for production inference at scale.

    Like keeping frequently used books on your desk instead of the library.

    3. Constitutional AI

    Training models to follow explicit written principles alongside other alignment methods. The constitution provides clear rules for behavior.

    Models critique and revise their own outputs against these principles.

    Like giving someone written house rules instead of vague instructions.

    4. Goodhart’s Law

    When a measure becomes a target, it can stop being a good measure. Optimizing for a proxy metric can diverge from the true objective.

    A core challenge in reward modeling and evaluation design.

    Like studying only for the test instead of learning the subject.

    5. Manifold Hypothesis

    The idea that real-world data lies on lower-dimensional structures within high-dimensional space. Images of faces don’t fill all possible pixel combinations.

    This structure is what representation learning exploits.

    Like faces varying along a few key features instead of every pixel independently.

    Quick Reference

    Concept One-liner
    Data Augmentation Expanding training data with transformations
    Caching Strategies Reducing latency by reusing computation
    Constitutional AI Training models to follow explicit principles
    Goodhart’s Law Optimizing metrics distorts objectives
    Manifold Hypothesis Data lies on lower-dimensional structures

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 697 words4 min readAbstract

    music-pipe-rs: Web Demo and Multi-Instrument Arrangements

    Since the initial music-pipe-rs post, the project has grown. There’s now a web demo with playable examples, a new seq stage for explicit note sequences, and multi-instrument arrangements that work in GarageBand.

    Resource Link
    Video YouTube
    Live Demo music-pipe-rs Samples
    Source GitHub
    Previous Unix Pipelines for MIDI

    Web Demo

    The live demo showcases pre-built examples with playable audio:

    Tab Style Description
    Bach Toccata (Organ) Classical Multi-voice church organ with octave doubling and pedal bass
    Bach Toccata (8-bit) Chiptune Gyruss-inspired arcade version with square wave
    Bach-esque Algorithmic Procedurally generated baroque-style background music
    Baroque Chamber Ensemble Six-channel piece with strings, harpsichord, and recorder

    Each tab shows the pipeline script alongside playable audio. See exactly what commands produce each result.

    The seq Stage

    The new seq stage allows explicit note sequences instead of algorithmic generation:

    seed | seq "C4/4 D4/4 E4/4 F4/4 G4/2" | to-midi --out scale.mid
    

    Notation: NOTE/DURATION where duration is in beats. Combine with other stages:

    seed | seq "D5/4 C#5/8 R/4 B4/4" | transpose --semitones 5 | humanize | to-midi --out melody.mid
    

    The R represents rests. This enables transcribing existing melodies or composing precise phrases.

    Multi-Instrument Arrangements

    The Baroque chamber piece demonstrates six-channel composition:

    {
        seed 42 | seq "..." --ch 0 --patch 48;  # Strings melody
        seed 42 | seq "..." --ch 1 --patch 6;   # Harpsichord
        seed 42 | seq "..." --ch 2 --patch 74;  # Recorder
        # ... additional voices
    } | humanize | to-midi --out baroque.mid
    

    Each instrument gets its own channel and General MIDI patch. The same seed ensures timing coherence across parts.

    GarageBand Integration

    Import the MIDI files directly into GarageBand:

    1. Generate arrangement: ./examples/trio-demo.sh
    2. Open GarageBand, create new project
    3. Drag the .mid file into the workspace
    4. GarageBand creates tracks for each channel
    5. Assign software instruments to taste

    The demo includes a jazz trio arrangement:

    • Piano: Bluesy melody with chords and swing
    • Bass: Walking bass line with acoustic bass patch
    • Drums: Hi-hat, snare, kick with dynamic variation

    All generated from pipeline scripts.

    Inspiration

    This project was inspired by research into generative music tools and techniques:

    References

    Topic Link
    Analog Synthesizers Code Self Study
    Drum Synthesis JavaScript Drum Synthesis
    Generative Music Code Self Study
    Music Projects Software and Hardware
    FOSS Music Tools Open Source Music Production
    Eurorack Programming Patch.Init() Tutorial
    Opusmodus Algorithmic Composition in Lisp

    The key insight from Opusmodus: algorithmic composition isn’t random music—it’s programmable composition. Motif transformation, rule systems, deterministic generation. music-pipe-rs brings these ideas to Unix pipes.

    What’s Next

    The pipeline architecture makes extension natural:

    • More generators: Markov chains, L-systems, cellular automata
    • More transforms: Inversion, retrograde, quantization
    • Live mode: Real-time MIDI output with clock sync

    Each new capability is just another stage in the pipeline.


    Series: Personal Software (Part 5) Previous: music-pipe-rs: Unix Pipelines

    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Watch the Video

    Unmute to hear narration.

  • 900 words5 min readAbstract

    Lucy 20%: Upgrading My Home AI Cluster

    Lucy is getting an upgrade. I’m adding an X99 motherboard with an RTX 3090 to expand my AI cluster from 10% to 20% brain power.

    Resource Link
    Video Lucy 20% Upgrade
    Video
    Previous Lucy 10%
    Video

    New Hardware: Queenbee

    The cluster uses bee-themed naming. The new node is called queenbee:

    Component Specification
    Motherboard X99
    CPU Intel Xeon E5-2660 v4 (28 threads)
    RAM 64 GB DDR4 ECC
    GPU RTX 3090 (24GB VRAM)
    Storage 1TB NVMe SSD + 4TB HDD

    New AI Capabilities

    With queenbee online, Lucy gains several new abilities:

    Capability Model What It Does
    Voice Cloning VoxCPM High-quality text-to-speech with voice cloning
    Text-to-Image FLUX schnell Fast image generation from text prompts
    Text-to-Video Wan 2.2 Generate video clips from text descriptions
    Image-to-Video SVD Animate still images into video

    The Active Cluster

    Currently active for AI workloads:

    Node Role GPU
    hive MuseTalk lip-sync 2x P40 (48GB total)
    queenbee Generative AI workloads RTX 3090 (24GB)

    Together, they handle the full pipeline: generate images, animate them to video, add lip-synced speech, and produce the final output. See the full apiary inventory below.

    Why Local AI?

    Running AI locally means:

    • Privacy - Data never leaves my network
    • No API costs - Unlimited generations after hardware investment
    • Customization - Full control over models and parameters
    • Learning - Deep understanding of how these systems work

    The 24GB of VRAM on the 3090 opens up models that wouldn’t fit on smaller cards. FLUX schnell produces high-quality images in seconds. VoxCPM creates natural-sounding speech that can clone voices from short audio samples.

    Bee-Themed Host Names

    The full apiary (current and planned nodes):

    Host System CPU Cores RAM GPU
    apiary HPE DL360 G10 1x Xeon Gold 5188 12C/24T 188G -
    bees HPE DL360 G9 2x E5-2650 v4 24C/48T 128G -
    brood HPE DL380 G9 2x E5-2680 v4 28C/56T 64G 2x P100-16G
    colony Supermicro 6028U 2x E5-2680 v3 24C/48T TBD 2x K80-24G
    drones HPE DL380 G9 2x E5-2620 v4 16C/32T 256G -
    hive HPE DL380 G9 2x E5-2698 v3 32C/64T 128G 2x P40-24G
    honeycomb HPE DL180 G9 1x E5-2609 v4 8C/8T TBD -
    queenbee X99 1x E5-2660 v4 14C/28T 64G RTX 3090-24G
    swarm HPE DL380 G9 2x E5-2698 v3 32C/64T 374G 2x P100-12G
    workers HPE DL560 G8 4x E5-4617 v1 TBD 640G TBD

    Notes: Some nodes pending upgrade or configuration. Workers may upgrade to 4x E5-4657L v2 (48C/96T). Honeycomb needs unbrick. K80 GPUs are old and difficult to configure (limited CUDA version support)—will be replaced with M40 GPUs.

    Power and Control

    Remote management is essential for a home datacenter. The HPE servers include iLO (Integrated Lights-Out) for out-of-band access to BIOS, diagnostics, monitoring, and power control—even when the OS is down.

    Category Technology Purpose
    Remote Management HPE iLO BIOS access, diagnostics, monitoring, power control
    IP KVM JetKVM, Sipeed KVM Console access for non-HPE servers (planned)
    Power Monitoring Kill-A-Watt, clones Per-outlet power consumption tracking
    Smart Outlets Home Assistant + Zigbee Remote power control, scheduling, automation
    Additional Circuits Bluetti LFP power stations Extra capacity to run more servers, remote control via BT/WiFi/Zigbee

    The combination of iLO and smart outlets means I can remotely power-cycle any server, access its console, and monitor power draw—all from my phone or Home Assistant dashboard. The Bluetti stations primarily provide additional circuits so I can run more servers simultaneously—home electrical limits are a real constraint. More LFP power stations will be needed to power Lucy at 100%.

    Networking

    Each server has 3 or more NICs, segmented by purpose:

    Speed Purpose Switch
    1G iLO/KVM management 1G switch
    2.5G SSH, SCP, Chrome Remote Desktop 2x 2.5G switches
    10G fiber Server-to-server data transfer (large models) 10G switch

    The 10G backbone is essential for moving multi-gigabyte model files between nodes. Loading a 70B parameter model over 1G would take forever—10G fiber makes it practical. The 2.5G network handles interactive work and smaller transfers (using USB NICs where needed), while the 1G management network stays isolated for out-of-band access.

    Additional networking notes:

    • WiFi 7 for wireless connectivity
    • Managed switches with VLANs planned for better network segmentation
    • Linux network bonding experiments to increase aggregate transfer rates
    • Sneaker net - most servers have hot-swap SAS SSDs and hard drives, so physically moving drives between nodes is sometimes the fastest option for very large transfers

    What’s Next

    The 20% milestone is just a step. Future upgrades could include:

    • Additional GPU nodes for parallel processing
    • Larger language models for local inference
    • Real-time video generation pipelines
    • Integration with more specialized models

    The bee hive keeps growing.


    Building AI infrastructure one node at a time.

    Watch the Video

    Unmute to hear narration.

  • 631 words4 min readAbstract

    How AI Learns Part 5: Context Engineering & Recursive Reasoning

    Large context windows are not a complete solution.

    As context grows:

    • Attention dilutes
    • Errors compound
    • Reasoning quality degrades
    Resource Link
    Related RLM | ICL Revisited

    The Context Problem

    Transformers have finite attention. With limited attention heads and capacity, the model cannot attend equally to everything. As tokens accumulate:

    • Earlier instructions lose influence
    • Patterns average toward generic responses
    • Multi-step reasoning fails

    This is context rot—not forgetting weights, but losing signal in noise.

    In-Context Learning (ICL)

    The model adapts temporarily via examples in the prompt.

    Aspect ICL
    Updates weights? No
    Persists across sessions? No
    Speed Instant
    Mechanism Activations, not gradients

    ICL is powerful but ephemeral. It’s working memory, not learning.

    Limitation: As context grows, ICL examples compete with other content for attention.

    Recursive Language Models (RLM)

    Circular flow diagram showing LLM connected to Tools, Memory, Context, and Evaluation in a recursive loop
    Rebuild context each step instead of dragging it forward.

    RLMs decompose reasoning into multiple passes. Instead of dragging entire context forward:

    1. Query relevant memory
    2. Retrieve what’s needed now
    3. Execute tools
    4. Evaluate results
    5. Reconstruct focused context
    6. Repeat

    This treats context as a dynamic environment, not a static blob.

    Why RLM Works

    Traditional approach:

    [System prompt + 50k tokens of history + query]
    

    RLM approach:

    [System prompt + retrieved relevant context + current query]
    

    Each reasoning step starts fresh with focused attention.

    Context Engineering Techniques

    Technique How It Helps
    Summarization Compress old context, preserve essentials
    Chunking Process in segments, aggregate results
    Retrieval Pull relevant content, not everything
    Tool offloading Store state externally, query on demand
    Structured prompts Clear sections, explicit priorities

    Tool Use as Context Management

    Tools aren’t just for actions—they’re for state management.

    Instead of keeping everything in context:

    • Store in files, databases, or structured formats
    • Query when needed
    • Return focused results

    This converts unbounded context into bounded queries.

    The Agent Loop

    Modern agents combine these ideas:

    while not done:
        # 1. Assess current state
        relevant = retrieve_from_memory(query)
    
        # 2. Build focused context
        context = [system_prompt, relevant, current_task]
    
        # 3. Reason
        action = llm(context)
    
        # 4. Execute
        result = execute_tool(action)
    
        # 5. Update memory
        memory.store(result)
    
        # 6. Evaluate
        if goal_achieved(result):
            done = True
    

    Each iteration rebuilds context. No rot accumulation.

    Test-Time Adaptation

    A related technique: temporarily update weights during inference.

    Aspect Test-Time Learning
    Updates weights? Yes, lightly (LoRA)
    Persists? No (rolled back)
    Purpose Adapt to input distribution

    This sits between ICL (no updates) and fine-tuning (permanent updates).

    Key Insight

    Context is not a static buffer. It’s a dynamic workspace.

    Systems that treat context as “append everything” will rot. Systems that actively manage context stay coherent.

    References

    Concept Paper
    RLM Recursive Language Models (Zhou et al. 2024)
    ICL What Can Transformers Learn In-Context? (Garg et al. 2022)
    Test-Time Training TTT for Language Models (2024)
    Chain-of-Thought Chain-of-Thought Prompting (Wei et al. 2022)

    Coming Next

    In Part 6, we’ll connect all of this to continuous learning: replay methods, subspace regularization, adapter evolution, and consolidation loops.


    Rebuild focus instead of dragging baggage.

  • 406 words3 min readAbstract

    Five ML Concepts - #25

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #25
    Video

    References

    Concept Reference
    Label Smoothing Rethinking the Inception Architecture (Szegedy et al. 2015)
    Miscalibration On Calibration of Modern Neural Networks (Guo et al. 2017)
    Representation Learning Representation Learning: A Review (Bengio et al. 2013)
    Adversarial Examples Intriguing properties of neural networks (Szegedy et al. 2013)
    Double Descent Deep Double Descent (Nakkiran et al. 2019)

    Today’s Five

    1. Label Smoothing

    Replacing hard one-hot labels with softened target distributions during training. Instead of 100% confidence in one class, distribute small probability to other classes.

    Reduces overconfidence and can improve generalization.

    Like allowing small uncertainty instead of absolute certainty.

    2. Miscalibration

    When predicted confidence does not match observed accuracy. A model that says “90% confident” should be right 90% of the time.

    Modern neural networks tend to be overconfident. Temperature scaling can help.

    Like a forecast that sounds certain but is often wrong.

    3. Representation Learning

    Learning useful internal features automatically from raw data. Instead of hand-crafting features, the model discovers what matters.

    The foundation of deep learning’s success across domains.

    Like detecting edges before recognizing full objects.

    4. Adversarial Examples

    Inputs modified to cause incorrect predictions. Small, often imperceptible changes can flip model outputs.

    A security concern and a window into model vulnerabilities.

    Like subtle changes that fool a system without obvious differences.

    5. Double Descent

    Test error that decreases, increases, then decreases again as model capacity grows. The classical bias-variance tradeoff captures only the first part.

    Modern overparameterized models operate in the second descent regime.

    Like getting worse before getting better—twice.

    Quick Reference

    Concept One-liner
    Label Smoothing Softening targets to reduce overconfidence
    Miscalibration Confidence not matching accuracy
    Representation Learning Automatically learning useful features
    Adversarial Examples Inputs crafted to cause errors
    Double Descent Test error decreasing twice with model size

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 627 words4 min readAbstract

    How AI Learns Part 4: Memory-Based Learning

    Modern AI systems increasingly rely on external memory.

    This shifts “learning” away from parameters.

    Resource Link
    Related Engram | Engram Revisited | Multi-hop RAG

    The Memory Paradigm

    Diagram showing brain (model) connected to notebook (memory) with RAG, CAG, and Engram types
    Store facts outside the brain.

    Why External Memory?

    Most “learning new facts” should not modify weights.

    Weights are for generalization. They encode reasoning patterns, language structure, and capability.

    Memory is for storage. It holds specific facts, documents, and experiences.

    If you store everything in weights:

    • You create interference
    • You risk forgetting
    • You must retrain

    If you store facts in memory:

    • No forgetting
    • Fast updates
    • Survives model upgrades

    Retrieval-Augmented Generation (RAG)

    Documents are embedded into vectors. At query time:

    1. Embed the query
    2. Search the vector database
    3. Retrieve relevant documents
    4. Inject into prompt
    5. Generate grounded response

    The model does not need to remember facts internally. It retrieves them on demand.

    RAG Benefits

    Benefit Description
    No forgetting External storage, not weights
    Persistent Survives restarts and model changes
    Scalable Add documents without retraining
    Verifiable Can cite sources

    RAG Challenges

    • Retrieval precision (wrong docs = bad answers)
    • Latency (search takes time)
    • Index maintenance
    • Chunk boundaries

    Cache-Augmented Generation (CAG)

    Instead of retrieving from vector DB, cache previous context or KV states.

    Use cases:

    • Repeated knowledge tasks
    • Multi-turn conversations
    • Pre-computed context windows

    Benefits over RAG:

    • Often faster (no embedding + search)
    • More deterministic
    • Good for structured repeated workflows

    Trade-offs:

    • Less flexible
    • Cache management complexity

    Engram-Style Memory

    Recent proposals (e.g., DeepSeek research) introduce conditional memory modules with direct indexing.

    Instead of scanning long context or searching vectors:

    • Memory slots indexed directly
    • O(1) lookup instead of O(n) attention
    • Separates static knowledge from dynamic reasoning

    The goal: Constant-time memory access that doesn’t scale with context length.

    This changes the compute story:

    • Don’t waste attention on “known facts”
    • Reserve compute for reasoning
    • Avoid context rot

    Model Editing

    A related technique: surgically patch specific facts without full fine-tuning.

    Example: The model says “The capital of Australia is Sydney.” You edit the specific association to “Canberra” without retraining.

    Pros:

    • Targeted fixes
    • Fast

    Cons:

    • Side effects possible
    • Consistency not guaranteed

    The Key Distinction

    Aspect Weight Learning Memory Learning
    Location Parameters External storage
    Persistence Model lifetime Storage lifetime
    Forgetting risk High None
    Update speed Slow (training) Fast (database)
    Survives model change? No Yes

    When to Use What

    Situation Approach
    Need new reasoning capability Weight-based (fine-tune)
    Need to know new facts Memory-based (RAG)
    Need domain expertise Weight-based (LoRA)
    Need to cite sources Memory-based (RAG)
    Frequently changing data Memory-based (RAG/CAG)

    References

    Concept Paper
    RAG Retrieval-Augmented Generation (Lewis et al. 2020)
    Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
    REALM REALM: Retrieval-Augmented Pre-Training (Guu et al. 2020)
    Model Editing Editing Factual Knowledge (De Cao et al. 2021)

    Coming Next

    In Part 5, we’ll examine context engineering and recursive reasoning: ICL, RLM, and techniques that prevent context rot during inference.


    The brain stays stable. The notebook grows.

  • 426 words3 min readAbstract

    Five ML Concepts - #24

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #24
    Video

    References

    Concept Reference
    Warmup Accurate, Large Minibatch SGD (Goyal et al. 2017)
    Data Leakage Leakage in Data Mining (Kaufman et al. 2012)
    Mode Collapse Generative Adversarial Nets (Goodfellow et al. 2014)
    Blue/Green Deployment MLOps best practice (no canonical paper)
    Reward Hacking Concrete Problems in AI Safety (Amodei et al. 2016)

    Today’s Five

    1. Warmup

    Gradually increasing the learning rate at the start of training as part of a learning rate schedule. This helps stabilize early training when gradients can be noisy.

    Warmup is especially important for large batch training.

    Like stretching before a sprint instead of starting at full speed.

    2. Data Leakage

    When information unavailable at deployment accidentally influences model training. This creates artificially high validation scores that don’t reflect real-world performance.

    Common sources include future data, preprocessing on full dataset, or duplicate samples.

    Like memorizing test answers instead of learning the material.

    3. Mode Collapse

    When a generative model produces limited output diversity. The generator learns to produce only a few outputs that fool the discriminator.

    A major challenge in GAN training that various architectures attempt to address.

    Like a musician who only plays one song no matter the request.

    4. Blue/Green Deployment

    Maintaining two production environments and switching traffic between them. One serves live traffic while the other is updated and tested.

    Enables instant rollback if problems occur.

    Like having a backup stage ready so the show never stops.

    5. Reward Hacking

    When agents exploit reward functions in unintended ways. The agent optimizes the reward signal rather than the intended objective.

    A key challenge in reinforcement learning and AI alignment.

    Like gaming the grading rubric instead of learning the material.

    Quick Reference

    Concept One-liner
    Warmup Gradually increasing learning rate at start
    Data Leakage Training on unavailable deployment info
    Mode Collapse Limited generative output variety
    Blue/Green Deployment Switching between parallel environments
    Reward Hacking Exploiting reward function flaws

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1231 words7 min readAbstract

    TBT (5/?): IBM 1130 System Emulator - Experience 1960s Computing

    The IBM 1130, introduced in 1965, was a 16-bit minicomputer that brought computing to universities and small businesses. This browser-based system emulator recreates the complete experience: console panel with authentic indicator lights, keypunch, printer, and assembly programming.

    Status: Work in progress. Core features functional, enhancements planned.

    The System

    This isn’t just an assembly emulator—it’s a full system visualization:

    Component What It Does
    Console Panel Authentic indicator lights, toggle switches, speed control
    Assembler Game Write and execute IBM 1130 code with real-time visualization
    Keypunch IBM 029 text cards and 1442 object deck visualization
    Printer IBM 1131 console printer with greenbar paper

    Console Panel

    The console panel recreates the physical operator interface with all indicator light groups documented in IBM’s Functional Characteristics manual.

    Register Display (6 rows × 16 positions)

    Row Register Bits Shown Purpose
    1 IAR 15 Instruction Address Register (program counter)
    2 SAR 15 Storage Address Register (memory access)
    3 SBR 16 Storage Buffer Register (data word)
    4 AFR 16 Arithmetic Factor Register (operand)
    5 ACC 16 Accumulator (main arithmetic register)
    6 EXT 16 Extension (double-precision, multiply/divide)

    Right-Side Indicators

    Beyond the register displays, the console shows:

    • Operation Register (5 bits) - Binary op-code of current instruction
    • Format/Tag Indicators - Long instruction format, index register selection
    • Cycle Control (T0-T7) - Internal timing pulses for debugging
    • Status Lights - Wait, Run, Fetch, Execute, Indirect Address

    Control Panel Lights

    Light Purpose
    DISK UNLOCK Safe to swap 2315 disk cartridge
    FILE READY Disk drive up to speed
    FORMS CHECK Printer out of paper
    RUN CPU executing instructions
    PARITY Memory parity error
    FREEZE Fatal hardware error

    Operator Controls

    • 16-bit toggle switches for manual data entry
    • 7-position speed knob - Single Step, SMC, INT RUN, RUN, SI, DISP, LOAD
    • Lamp test to verify all indicators function
    • Emergency stop button

    Assembler Game

    Learn the IBM 1130 instruction set interactively:

    • Complete instruction set - LD, STO, LDX, STX, A, S, AND, OR, SLA, SRA, BSC, BSI, WAIT
    • Memory-mapped index registers - XR1-3 at addresses 1, 2, 3 (historically accurate)
    • Step-by-step execution with change highlighting
    • Interactive examples covering arithmetic, indexing, shifts
    • Progressive challenges with validation

    Keypunch

    The keypunch simulation supports two card types:

    IBM 029 Text Cards

    • Hollerith encoding - Standard character-to-punch mapping
    • Visual card display - Watch holes appear as you type
    • Multi-card decks - Manage multiple cards

    IBM 1130 Object Deck (1442 Output)

    • Binary card visualization - Machine code punch patterns
    • Object deck format - Matches authentic assembler output
    • No character printing - Pure binary data cards

    The IBM 029 Keypunch produced human-readable text cards. For binary object decks (compiled programs), the IBM 1442 Card Read-Punch would create cards with arbitrary punch patterns that don’t map to characters.

    Printer

    The IBM 1131 Console Printer simulation:

    • Greenbar paper rendering - Authentic line printer output
    • Typewriter-style characters - Period-appropriate appearance
    • Console output - System messages and program output

    Technology

    Component Choice
    Language Rust
    Target WebAssembly
    UI Framework Yew
    Build Tool Trunk
    Hosting GitHub Pages

    Planned Enhancements

    This is a work in progress. Planned features include:

    • Additional challenges (10 total)
    • Code save/load functionality
    • URL sharing of programs
    • Breakpoints and memory watches
    • Keyboard shortcuts
    • Full 1442 Card Read-Punch integration

    IBM Documentation References

    Document Description
    GA26-5881 Functional Characteristics - Console panel details
    GA26-5717 Operating Procedures - Operator instructions
    GA26-5914 Physical Planning - System dimensions
    Bitsavers Collection Complete IBM 1130 documentation archive

    Project Goals

    This is an early proof-of-concept for trying out components that could be extended to produce a more realistic system of devices that could actually run programs. The modular architecture allows each peripheral (console, keypunch, printer) to be developed and refined independently.

    A key goal is educational challenges that teach assembly language step by step. The assembler game provides progressive exercises that build understanding from basic load/store operations through arithmetic, indexing, and control flow.

    Historical Significance

    The IBM 1130 was the first computer for many programmers in the late 1960s and 1970s. Its clean architecture and accessible price point (~$32,000) made it ideal for education.

    A Transitional Technology

    The IBM 1130 arrived after mechanical calculators and vacuum tube computers, but before dense integrated circuits and microprocessors. This was a unique moment in computing history when machines were complex enough to be powerful, yet simple enough to be fully understood by one person.

    The system shipped with complete schematics and diagnostic listings. A field engineer could use an oscilloscope to probe the pins on every transistor. The “integrated circuit” of the era was a small can with a 4×4 pin grid containing just two transistors, mounted on a pluggable card connected via a wire-wrapped backplane. When something failed, you could see it, touch it, and replace it.

    Non-Volatile Core Memory

    One remarkable feature: magnetic core memory was non-volatile. You could stop the system, power down overnight, come back in the morning, power up, and start your program exactly where it left off—without reloading from cards, tape, or disk.

    Each bit was stored as the magnetic polarity of a tiny ferrite ring. No electricity required to maintain state. This made the 1130 remarkably resilient and practical for environments where power wasn’t guaranteed.

    Notable fact: The Forth programming language was developed on the IBM 1130 by Charles Moore in the late 1960s.

    Personal Experience

    In the late 1970s, I worked as an IBM Customer Engineer maintaining a large number of IBM 1130 and 1800 systems used primarily by IBM manufacturing facilities in Kingston, Poughkeepsie, and East Fishkill, New York.

    Field service on these machines was hands-on in ways that seem almost unimaginable today. I would often hand-assemble code on paper, converting mnemonics to binary, then enter machine code via the console toggle switches to create a small program. That program’s job? To punch another program onto a card.

    I could then insert that punched card into a diagnostic deck to loop on an error condition while I used an oscilloscope and logic schematics to diagnose a failing circuit card. The blinking lights weren’t decoration—they were essential debugging tools that showed exactly what the CPU was doing at each moment.

    This emulator recreates that experience: the same indicator lights, the same toggle switches, the same intimate connection between human and machine that made these systems so memorable to work with.


    Experience 1960s computing in your browser. Work in progress.

    Watch the Video

    Unmute to hear narration.

  • 649 words4 min readAbstract

    How AI Learns Part 3: Weight-Based Learning

    Weight-based learning modifies the neural network itself.

    It is slow. It is powerful. It is dangerous.

    The Weight-Based Methods

    Diagram showing LoRA adapters, distillation flow, and alignment pipeline
    Weight-based learning modifies the brain itself.

    Pretraining

    This creates the base model.

    It encodes language structure, reasoning patterns, and general world knowledge. The process:

    • Trains on terabytes of text
    • Uses self-supervised learning (predict next token)
    • Runs for weeks or months
    • Costs millions of dollars

    This learning is rarely repeated for cost reasons. The result is a foundation that everything else builds upon.

    Fine-Tuning

    Fine-tuning adapts models for specific tasks.

    Standard Fine-Tuning

    Adjust some or all weights using task-specific data.

    Pros:

    • Can significantly change behavior
    • Works with small datasets

    Cons:

    • Risk of catastrophic forgetting
    • Expensive if you modify all weights
    • Hard to undo

    Supervised Fine-Tuning (SFT)

    Train on instruction → response pairs.

    This teaches the model to:

    • Follow directions
    • Produce helpful outputs
    • Maintain conversation structure

    Risk: Can reduce other capabilities if data is narrow.

    Preference Optimization

    Instead of “correct answers,” train from comparisons: preferred vs rejected responses.

    Method Description
    Reinforcement Learning from Human Feedback (RLHF) Reward model + reinforcement learning
    Direct Preference Optimization (DPO) Simpler alternative to RLHF
    RLAIF AI-generated preferences

    Pros: Strong style/safety/helpfulness steering

    Cons: Can drift (“over-align”), may conflict with domain competence

    Parameter-Efficient Fine-Tuning (PEFT)

    Instead of changing all weights, inject small trainable modules.

    LoRA (Low-Rank Adaptation)

    Insert small low-rank matrices into transformer layers. Only train these matrices.

    Benefits:

    • Faster training: Fewer parameters to update
    • Modular: Can swap adapters
    • Version control: Different adapters for different tasks
    • Lower forgetting risk: Base weights frozen

    Other PEFT Methods

    • Prompt tuning: Learn soft prompts
    • Prefix tuning: Prepend learned vectors
    • Adapters: Small bottleneck layers
    • IA³: Learned vectors that scale activations

    Shared LoRA Subspaces

    Multiple tasks share adapter subspaces to reduce interference.

    Recent work (ELLA, Share) maintains evolving shared low-rank subspaces that:

    • Reduce interference between tasks
    • Enable continual learning
    • Keep memory constant

    Distillation

    Train a smaller model using a larger model as teacher.

    Aspect Teacher Student
    Size Large Small
    Cost High inference Low inference
    Knowledge Full Compressed

    Distillation benefits:

    • Speeds up inference
    • Often improves consistency
    • Can reduce hallucination
    • Makes deployment cheaper

    This is not runtime learning—it’s offline structural learning.

    The Alignment Pipeline

    Modern models typically go through:

    1. Pretraining → General competence
    2. SFT → Follow instructions
    3. RLHF/DPO → Align with preferences
    4. Safety fine-tuning → Reduce harmful outputs

    Each step modifies weights. Each step risks forgetting previous capabilities.

    Key Insight

    Fine-tuning changes the brain. RAG changes the notes on the desk.

    Weight-based learning is the core capability layer. It’s slow to change, expensive to update, and risky to modify—but it forms the stable foundation that everything else builds upon.

    References

    Concept Paper
    LoRA LoRA: Low-Rank Adaptation (Hu et al. 2021)
    RLHF Training LMs with Human Feedback (Ouyang et al. 2022)
    DPO Direct Preference Optimization (Rafailov et al. 2023)
    Distillation Distilling Knowledge in Neural Networks (Hinton et al. 2015)
    Adapters Parameter-Efficient Transfer Learning (Houlsby et al. 2019)

    Coming Next

    In Part 4, we’ll explore memory-based learning: RAG, CAG, Engram, and other techniques that learn without touching weights.


    Change the brain carefully.

  • 440 words3 min readAbstract

    Five ML Concepts - #23

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #23
    Video

    References

    Concept Reference
    Emergent Behavior Emergent Abilities of Large Language Models (Wei et al. 2022)
    Tool Use Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al. 2023)
    Loss Surface Sharpness On Large-Batch Training for Deep Learning (Keskar et al. 2016)
    Learning Rate Schedules SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov & Hutter 2016)
    Canary Deployment MLOps best practice (no canonical paper)

    Today’s Five

    1. Emergent Behavior

    Some capabilities appear only when models reach sufficient scale. These behaviors were not directly programmed but arise from learned representations.

    Emergence is a key phenomenon in large language models.

    Like a child learning words and then suddenly understanding full sentences.

    2. Tool Use

    Modern AI systems can generate structured commands to call external tools. These include search engines, calculators, or code interpreters.

    This extends model capabilities beyond internal knowledge.

    Like asking a librarian to look something up instead of guessing.

    3. Loss Surface Sharpness

    Sharp minima are sensitive to small weight changes. Flatter minima tend to be more robust and often generalize better.

    Training methods that find flatter regions can improve test performance.

    Like standing on a plateau instead of balancing on a narrow peak.

    4. Learning Rate Schedules

    Instead of keeping the learning rate constant, training often starts high and gradually reduces it. Schedules like step decay or cosine annealing improve convergence.

    Warm restarts can help escape local minima.

    Like running fast at first, then slowing down to finish precisely.

    5. Canary Deployment

    A new model version is rolled out to a small percentage of users first. If problems appear, rollout stops before affecting everyone.

    Essential MLOps practice for safe production updates.

    Like tasting food before serving it to all your guests.

    Quick Reference

    Concept One-liner
    Emergent Behavior Capabilities appearing at sufficient scale
    Tool Use AI calling external tools
    Loss Surface Sharpness Flatter minima generalize better
    Learning Rate Schedules Adjusting learning rate during training
    Canary Deployment Gradually rolling out new models safely

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 641 words4 min readAbstract

    How AI Learns Part 2: Catastrophic Forgetting vs Context Rot

    There are two fundamentally different failure modes in modern AI systems.

    They are often confused. They should not be.

    The Two Failures

    Split diagram showing catastrophic forgetting (weight interference) vs context rot (attention dilution)
    Two different failure modes require two different solutions.

    Catastrophic Forgetting (Weight-Space Failure)

    When you fine-tune a model on new tasks, performance on older tasks may degrade.

    This happens because gradient descent updates overlap in parameter space. The model does not “know” which weights correspond to which task. It optimizes globally.

    Example: Fine-tune a model on medical text. Its ability to write code degrades. The new learning overwrote old capabilities.

    Why It Happens

    Neural networks store knowledge distributed across many weights. When you update those weights for Task D, you modify the same parameters that encoded Task A. The old knowledge gets overwritten.

    This is the stability vs plasticity tradeoff:

    • Plasticity: Learn new things quickly
    • Stability: Retain old things reliably

    You cannot maximize both simultaneously.

    Solutions

    Method How It Helps
    Replay Train on old + new data
    Subspace regularization Constrain weight updates to avoid interference
    Shared Low-Rank Adaptation (LoRA) spaces Modular updates that don’t overwrite base weights
    Freezing base weights Keep foundation stable, train adapters only

    Context Rot (Inference-Time Failure)

    Context rot is not weight damage.

    It happens when:

    • Prompts grow too large
    • Earlier instructions get diluted
    • Attention spreads thin
    • The model begins averaging patterns instead of reasoning

    Example: A 50,000 token conversation. The original system prompt is still there, but the model stops following it. Earlier context gets “forgotten” even though it’s technically present.

    Why It Happens

    Transformer attention is finite. With limited attention heads and capacity, the model cannot attend equally to everything. As context grows, earlier tokens receive less attention weight.

    This creates:

    • Instruction drift: Original instructions lose influence
    • Pattern averaging: The model reverts to generic responses
    • Lost coherence: Multi-step reasoning fails

    Solutions

    Method How It Helps
    Retrieval-based context Pull relevant passages, not everything
    Recursive Language Models (RLM) Rebuild context each step
    Summarization Compress old context
    Memory indexing Constant-time lookup instead of linear attention
    Structured tool calls Offload state to external systems

    The Critical Distinction

    Aspect Catastrophic Forgetting Context Rot
    Where Weights Prompt window
    When During training During inference
    Persists? Permanently Session only
    Analogy Brain damage Working memory overload

    Why This Matters

    If you confuse these failure modes, you apply the wrong fix.

    • Forgetting problem? Don’t add more context. Fix your training.
    • Context rot problem? Don’t retrain. Fix your context management.

    Many “AI agents that forget” discussions conflate both. Modern systems need solutions for both simultaneously.

    References

    Concept Paper
    Catastrophic Forgetting Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017)
    Continual Learning Survey A Comprehensive Survey of Continual Learning (Wang et al. 2023)
    ELLA ELLA: Subspace Learning for Lifelong Machine Learning (2024)
    Share Share: Shared LoRA Subspaces for Continual Learning (2025)
    RLM Recursive Language Models (Zhou et al. 2024)

    Coming Next

    In Part 3, we’ll examine weight-based learning in detail: pretraining, fine-tuning, LoRA, alignment methods, and distillation.


    Different failures need different fixes.

  • 472 words3 min readAbstract

    Five ML Concepts - #22

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #22
    Video

    References

    Concept Reference
    RSFT Scaling Relationship on Learning Mathematical Reasoning (Yuan et al. 2023)
    Model Steerability Controllable Generation from Pre-trained Language Models (Zhang et al. 2023)
    LSTM Long Short-Term Memory (Hochreiter & Schmidhuber 1997)
    More Data Beats Better Models The Unreasonable Effectiveness of Data (Halevy et al. 2009)
    System Reliability vs Quality MLOps best practice (no canonical paper)

    Today’s Five

    1. RSFT (Rejection Sampling Fine-Tuning)

    A method where many model outputs are generated, weaker ones are filtered out, and the best samples are used for further fine-tuning. It improves output quality without full reinforcement learning.

    The model learns from its own best attempts.

    Like practicing many attempts and studying only your best ones.

    2. Model Steerability

    The ability to adjust a model’s behavior through prompts, parameters, or control mechanisms. This allows flexible behavior without retraining.

    Steerable models can adapt to different tasks or styles at inference time.

    Like steering a car instead of letting it move in a fixed direction.

    3. LSTM (Long Short-Term Memory)

    A recurrent neural network architecture with gates that regulate memory flow. It was designed to mitigate vanishing gradient problems in sequence modeling.

    LSTMs decide what to remember and what to forget at each time step.

    Like a notebook where you choose what to keep and what to forget.

    4. Why More Data Beats Better Models

    In many cases, adding high-quality data improves performance more than small architecture improvements. Data scale often matters as much as model design.

    This is sometimes called “the unreasonable effectiveness of data.”

    Like practicing with many real conversations instead of perfecting one grammar rule.

    5. System Reliability vs Model Quality

    A slightly less accurate model that runs reliably can outperform a fragile but slightly better one. Engineers balance uptime, latency, and stability against pure accuracy.

    Production systems need both correctness and dependability.

    Like choosing a reliable car over a faster one that breaks down often.

    Quick Reference

    Concept One-liner
    RSFT Fine-tuning on filtered best outputs
    Model Steerability Adjusting behavior at inference time
    LSTM Gated memory for sequence modeling
    More Data Beats Better Models Data scale trumps architecture tweaks
    System Reliability vs Quality Balancing accuracy with uptime

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1393 words7 min readAbstract

    Many-Eyes Learning: Intrinsic Rewards and Diversity

    In Part 1, we demonstrated that multiple scouts dramatically improve learning in sparse-reward environments. Five scouts achieved 60% success where a single scout achieved 0%.

    This post explores how scouts explore: intrinsic rewards that drive novelty-seeking behavior, and what happens when you mix different exploration strategies.

    Recap: The Many-Eyes Architecture

    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
    │   Scout 1   │  │   Scout 2   │  │   Scout N   │
    │ (strategy A)│  │ (strategy B)│  │ (strategy N)│
    └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
           │                │                │
           v                v                v
    ┌─────────────────────────────────────────────────┐
    │              Experience Buffer                   │
    └─────────────────────────────────────────────────┘
                           │
                           v
    ┌─────────────────────────────────────────────────┐
    │               Shared Learner                     │
    └─────────────────────────────────────────────────┘
    

    Scouts are information gatherers, not independent learners. They explore with different strategies, pool their discoveries, and a shared learner benefits from the combined experience.

    New Scout Strategies

    CuriousScout: Count-Based Novelty

    IRPO formalizes intrinsic rewards as the mechanism that drives scout exploration. CuriousScout implements count-based curiosity:

    class CuriousScout(Scout):
        def __init__(self, bonus_scale: float = 1.0):
            self.state_counts = defaultdict(int)
            self.bonus_scale = bonus_scale
    
        def intrinsic_reward(self, state):
            count = self.state_counts[state]
            return self.bonus_scale / sqrt(count + 1)
    

    How it works:

    • Track how many times each state has been visited
    • Reward = bonus_scale / √(count + 1)
    • Novel states get high rewards; familiar states get diminishing returns

    The intuition: A curious scout is drawn to unexplored territory. The first visit to a state is exciting (reward = 1.0). The fourth visit is mundane (reward = 0.5). This creates natural pressure to explore widely.

    OptimisticScout: Optimism Under Uncertainty

    A different philosophy: assume unknown states are valuable until proven otherwise.

    class OptimisticScout(Scout):
        def __init__(self, optimism: float = 10.0):
            self.optimism = optimism
    
        def initial_q_value(self):
            return self.optimism  # Instead of 0
    

    How it works:

    • Initialize all Q-values to a high value (e.g., 10.0)
    • The agent is “optimistic” about unvisited state-action pairs
    • As it explores and receives actual rewards, Q-values decay toward reality

    The intuition: If you’ve never tried something, assume it might be great. This drives exploration without explicit novelty bonuses.

    Strategy Comparison

    Strategy Mechanism Best For
    Random Uniform random actions Baseline, maximum coverage
    Epsilon-Greedy Random with probability ε, greedy otherwise Balancing exploit/explore
    CuriousScout Novelty bonus for unvisited states Systematic coverage
    OptimisticScout High initial Q-values Early exploration pressure

    The Diversity Experiment

    Does mixing strategies help, or is it enough to have multiple scouts with the same good strategy?

    Setup

    • 7x7 sparse grid, 100 training episodes
    • All configurations use exactly 5 scouts (fair comparison)
    • 5 random seeds for statistical significance

    Configurations

    1. Homogeneous Random: 5 identical random scouts
    2. Homogeneous Epsilon: 5 identical epsilon-greedy scouts (ε=0.2)
    3. Diverse Mix: Random + 2 epsilon-greedy (ε=0.1, 0.3) + CuriousScout + OptimisticScout

    Results

    Configuration Success Rate
    Random baseline 7%
    Homogeneous random 20%
    Homogeneous epsilon 40%
    Diverse mix 40%

    Analysis

    Finding: Strategy quality matters more than diversity in simple environments.

    • Epsilon-greedy (homogeneous or mixed) outperforms pure random
    • Diverse mix performs the same as homogeneous epsilon-greedy
    • Having 5 good scouts beats having 5 diverse but weaker scouts

    Why doesn’t diversity help here?

    In a simple 7x7 grid, the exploration problem is primarily about coverage, not strategy complementarity. Five epsilon-greedy scouts with different random seeds already explore different regions due to stochastic action selection.

    Diversity likely provides more benefit in:

    • Complex environments with multiple local optima
    • Tasks requiring different behavioral modes
    • Environments with deceptive reward structures

    Web Visualization

    The web visualization demonstrates Many-Eyes Learning with real-time parallel scout movement. (The upcoming video walks through this demo—the post focuses on the underlying mechanism.)

    Many-Eyes Web Visualization

    How It Works

    The web version uses Q-learning with a shared Q-table (simpler than DQN for clarity). All scouts contribute to the same Q-table—the core “many eyes” concept: more explorers = faster Q-value convergence.

    Scout Role Epsilon Behavior
    Random Baseline 1.0 (constant) Always random, never follows policy
    Scouts 1-N Learning Agents 0.5-0.8 → 0.01 Epsilon-greedy with decay

    Exploration Modes

    The UI provides a dropdown to select different exploration strategies:

    Mode Heatmap Diversity Learning Performance
    Shared Policy Low (identical paths) Best (lowest avg steps)
    Diverse Paths High (distinct paths) Worse (biases override optimal)
    High Exploration High Worst (never fully exploits)
    Boltzmann Moderate Moderate

    The Diversity vs Performance Trade-off

    There’s a fundamental trade-off between visual diversity and learning performance:

    • Shared Policy wins on performance: The “many eyes” benefit comes from diverse exploration during learning (finding the goal faster). But once Q-values converge, all scouts should follow the same optimal policy.

    • Diverse Paths sacrifices performance for visuals: Scout-specific directional biases (Scout 1 prefers right, Scout 2 prefers down) create visually interesting heatmaps but suboptimal behavior.

    • High Exploration never converges: Fixed 50% random actions means scouts never fully exploit the learned policy.

    Key insight: For best learning, use Shared Policy. Use other modes to visualize how different exploration strategies affect the learning process, but expect higher average steps.

    Learning Phases

    Phase Episodes Avg Steps Behavior
    Random 1-5 ~70 All scouts exploring randomly
    Early Learning 5-15 40-60 Policy starts forming
    Convergence 15-30 15-25 Clear optimal path emerges
    Stable 30+ 12-18 Near-optimal with random scout noise

    Why “Average Steps to Goal”?

    Success rate is coarse-grained—with 5 scouts, only 6 values are possible (0%, 20%, 40%, 60%, 80%, 100%). After ~10 episodes, all scouts typically reach the goal. Average steps shows continued policy refinement, dropping from ~70 (random) to ~8 (optimal).

    Running the Visualization

    ./scripts/serve.sh   # Open http://localhost:3200
    
    • Yew/WASM frontend with FastAPI backend
    • Speed control from 1x to 100x
    • Replay mode to step through recorded training

    What’s Next

    Potential future directions:

    Direction Why It Matters
    Larger environments Test scaling to 15x15, 25x25 grids
    Scout communication Real-time sharing vs passive pooling
    Adaptive intrinsic rewards Learn the reward function (closer to full IRPO)
    Multi-goal environments Multiple sparse rewards to discover

    Key Takeaways

    1. Intrinsic rewards drive exploration. CuriousScout and OptimisticScout implement different philosophies: novelty bonuses vs optimistic initialization.

    2. Strategy quality > diversity in simple environments. Five good scouts beat five diverse but weaker scouts.

    3. Diversity during learning, convergence after. The “many eyes” benefit comes from diverse exploration during learning. Once Q-values converge, all scouts should follow the same optimal policy.

    4. Shared Q-table enables collective learning. All scouts contribute to one Q-table—more explorers means faster convergence.

    5. Visual diversity costs performance. Modes like “Diverse Paths” create interesting heatmaps but suboptimal behavior. Use “Shared Policy” for best learning results.

    References

    Concept Paper
    IRPO Intrinsic Reward Policy Optimization (Cho & Tran 2026)
    Reagent Reasoning Reward Models for Agents (Fan et al. 2026)
    ICM Curiosity-driven Exploration (Pathak et al. 2017)

    Diverse exploration, convergent execution. Many eyes see more, but the best path is the one they all agree on.

    Watch the Video

    Unmute to hear narration.

  • 592 words3 min readAbstract

    How AI Learns Part 1: The Many Meanings of Learning

    When people say, “AI learned something,” they usually mean one of at least four very different things.

    Large Language Models (LLMs) do not learn in one single way. They learn at different time scales, in different locations, and with very different consequences. To understand modern AI systems—especially agents—we need to separate these layers.

    Resource Link
    Related ICL Revisited | RLM | Engram

    Four Time Scales of Learning

    Concentric rings showing four time scales of learning: core weights, adapters, external memory, and prompt/context
    Learning happens at different layers with different persistence and speed.

    1. Pretraining (Years)

    This is the foundation.

    The model trains on massive datasets using gradient descent. The result is a set of weights—billions of parameters—encoding statistical structure of language and knowledge.

    This learning:

    • Is slow and expensive
    • Persists across restarts
    • Cannot easily be reversed
    • Is vulnerable to interference if modified later

    Think of this as long-term biological memory.

    2. Fine-Tuning (Days to Weeks)

    Fine-tuning modifies the weights further, but with narrower data.

    This includes:

    • Instruction tuning (following directions)
    • Alignment methods (Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO))
    • Domain adaptation
    • Parameter-efficient methods like Low-Rank Adaptation (LoRA)

    This is still weight-based learning.

    It persists across restarts. It risks catastrophic forgetting. It modifies the brain itself.

    3. Memory-Based Learning (Seconds to Minutes)

    This is where many modern systems shift.

    Instead of changing weights, they store information externally:

    • RAG (Retrieval-Augmented Generation)
    • CAG (Cache-Augmented Generation)
    • Vector databases
    • Engram-style memory modules

    The model retrieves relevant memory per query.

    The brain stays stable. The notebook grows.

    This learning:

    • Persists across restarts
    • Survives model upgrades
    • Does not cause forgetting
    • Is fast

    4. In-Context Learning (Milliseconds)

    This is temporary reasoning scaffolding.

    Information exists only in the prompt window.

    It:

    • Does not update weights
    • Does not persist across sessions
    • Is powerful but fragile
    • Suffers from context rot

    This is working memory.

    Why This Matters

    Most discussions collapse all of this into “the model learned.”

    But:

    • Updating weights risks forgetting
    • Updating memory does not
    • Updating prompts does not persist
    • Updating adapters can be modular and reversible

    Continuous learning systems must coordinate all four.

    Persistence Comparison

    Mechanism Persists Across Chat? Persists Across Restart? Persists Across Model Change?
    Pretraining Yes Yes No
    Fine-tune Yes Yes No
    LoRA Yes Yes Usually
    Distillation Yes Yes No
    ICL No No No
    RAG Yes Yes Yes
    Engram Yes Yes Yes
    CAG Yes Yes Yes

    That last column is subtle but powerful for agents.

    References

    Concept Paper
    LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
    RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
    ICL What Can Transformers Learn In-Context? (Garg et al. 2022)
    Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
    DPO Direct Preference Optimization (Rafailov et al. 2023)

    Coming Next

    In Part 2, we’ll examine the two fundamental failure modes that arise from confusing these layers: catastrophic forgetting and context rot.


    Learning happens in layers of permanence.

  • 1173 words6 min readAbstract

    music-pipe-rs: Unix Pipelines for MIDI Composition

    After building midi-cli-rs for quick mood-based generation, I wanted something more surgical. What if music generation worked like Unix commands—small tools connected by pipes?

    Resource Link
    Code music-pipe-rs
    Related midi-cli-rs
    Next Web Demo and Multi-Instrument

    The Unix Philosophy for Music

    Most generative music tools are monolithic. You get one application with a closed workflow. If you want to inspect intermediate results, you can’t. If you want to swap one transformation for another, you rebuild everything.

    Unix solved this decades ago: small tools that do one thing well, connected by pipes. Each tool reads from stdin, writes to stdout. You can inspect any point in the pipeline with head, filter with grep, transform with jq.

    music-pipe-rs applies this philosophy to MIDI composition.

    A Pipeline in Action

    seed 12345 | motif --notes 16 --bpm 120 | humanize | to-midi --out melody.mid
    

    Four stages:

    1. seed establishes the random seed for the entire pipeline
    2. motif generates a melodic pattern (using the pipeline seed)
    3. humanize adds timing and velocity variation (using the same seed)
    4. to-midi converts the event stream to a standard .mid file

    The output plays in any DAW.

    Seed-First Architecture

    The seed stage goes at the head of the pipeline:

    # Explicit seed for reproducibility
    seed 12345 | motif --notes 16 | humanize | to-midi --out melody.mid
    
    # Auto-generated seed (printed to stderr)
    seed | motif --notes 16 | humanize | to-midi --out melody.mid
    # stderr: seed: 1708732845
    

    All downstream stages read the seed from the event stream. No --seed arguments scattered across the pipeline. One seed, set once, used everywhere.

    This means:

    • Same seed = identical output across all random stages
    • Different seed = different composition with same structure
    • Reproducibility is trivial: just save the seed number

    JSONL: The Intermediate Format

    Between stages, events flow as JSONL (JSON Lines). Each line is a complete event:

    {"type":"Seed","seed":12345}
    {"type":"NoteOn","t":0,"ch":0,"key":60,"vel":80}
    {"type":"NoteOff","t":480,"ch":0,"key":60}
    

    This format is human-readable and tool-friendly:

    # See the first 10 events
    seed 42 | motif --notes 8 | head -10
    
    # Count how many NoteOn events
    seed 42 | motif --notes 16 | grep NoteOn | wc -l
    
    # Pretty-print with jq
    seed 42 | motif --notes 4 | jq .
    

    No binary formats to decode. No proprietary protocols. Just text.

    Visualization with viz

    The viz stage prints a sparkline to stderr while passing events through:

    seed 12345 | motif --notes 16 | viz | humanize | to-midi --out melody.mid
    

    Output on stderr:

    ▃▅▇▅▃▁▂▄▆▇▆▄▂▁▃▅
    

    For more detail, use piano roll mode:

    seed 12345 | motif --notes 16 | viz --roll
    
     G6 │···█············│
    F#6 │·····█··········│
     F6 │····█···········│
     G5 │·██·········█···│
     F5 │···········█····│
     E5 │·········██···█·│
     C5 │█·····███····█·█│
    

    The visualization goes to stderr; the JSONL events continue to stdout. You can inspect the music without breaking the pipeline.

    Available Stages

    Stage Type Description
    seed Start Establish random seed for pipeline
    motif Generate Create melodic patterns
    euclid Generate Euclidean rhythm generation
    transpose Transform Shift notes by semitones
    scale Transform Constrain notes to a scale
    humanize Transform Add timing/velocity variation
    viz Inspect Print sparkline visualization
    to-midi Output Convert to .mid file

    Each stage is a separate binary. Mix and match as needed.

    Euclidean Rhythms

    The euclid stage generates Euclidean rhythms—mathematically optimal distributions of hits across steps:

    # 3 hits distributed across 8 steps (Cuban tresillo)
    seed | euclid --pulses 3 --steps 8 --note 36 | to-midi --out kick.mid
    
    # 4-on-the-floor kick pattern
    seed | euclid --pulses 4 --steps 16 --note 36 | to-midi --out four-floor.mid
    

    These patterns appear in music worldwide because they “feel right”—the spacing is as even as possible.

    Scale Locking

    The scale stage constrains notes to a musical scale:

    seed 42 | motif --notes 16 | scale --root C --mode minor | to-midi --out c-minor.mid
    

    No wrong notes. Every pitch fits the harmonic context.

    Layering Streams

    Generate drum and melody separately, then combine:

    {
        seed 100 | euclid --pulses 4 --steps 16 --note 36 --ch 9;
        seed 100 | motif --notes 16 | scale --root C --mode pentatonic;
    } | to-midi --out layered.mid
    

    Channel 9 is General MIDI drums. Same seed ensures coherence between parts. Both streams merge into a single MIDI file.

    Why Not Just Use midi-cli-rs?

    Different tools for different needs:

    Tool Strength Use Case
    midi-cli-rs Quick mood presets “Give me 5 seconds of jazz”
    music-pipe-rs Compositional control “Generate a motif, constrain to scale, add swing”

    midi-cli-rs is high-level: pick a mood, get music. music-pipe-rs is low-level: build compositions from primitive operations.

    Both are useful. Both work with AI coding agents.

    The Personal Software Pattern

    This continues the theme: build small tools that compose well. Don’t try to solve everything in one application. Let Unix handle orchestration.

    The best part? Standard tools still work. head, grep, jq, wc—all participate in the pipeline. No special music knowledge required to inspect the data.


    Series: Personal Software (Part 4) Previous: midi-cli-rs: Custom Mood Packs Next: music-pipe-rs: Web Demo

    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

  • 447 words3 min readAbstract

    Five ML Concepts - #21

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #21
    Video

    References

    Concept Reference
    Prompt Injection Prompt Injection attack against LLM-integrated Applications (Liu et al. 2023)
    Jailbreaks Jailbroken: How Does LLM Safety Training Fail? (Wei et al. 2023)
    GRU Empirical Evaluation of Gated Recurrent Neural Networks (Chung et al. 2014)
    Planning vs Prediction Between accurate prediction and poor decision making (Zaffalon et al. 2023)
    Production Rollbacks MLOps best practice (no canonical paper)

    Today’s Five

    1. Prompt Injection

    Malicious instructions embedded in user input that override intended system behavior. An attacker crafts text that tricks an AI into ignoring its original instructions.

    This is a major security concern for LLM-integrated applications.

    Like slipping a forged instruction into a trusted document.

    2. Jailbreaks

    Techniques that attempt to bypass safety constraints in AI systems. These attacks exploit gaps between a model’s capabilities and its safety training.

    Safety training can fail due to competing objectives or mismatched generalization.

    Like convincing a guard to bend the rules.

    3. GRU (Gated Recurrent Unit)

    A recurrent neural network unit with gates that control memory flow. GRUs decide what information to keep and what to discard at each time step.

    Simpler than LSTM but designed for similar sequence modeling tasks.

    Like a notepad where you decide what to keep and what to erase.

    4. Planning vs Prediction

    Prediction forecasts likely outcomes. Planning evaluates actions across possible futures. Accurate predictions don’t guarantee good decisions—you also need to model how actions affect outcomes.

    This is a key gap in many AI/ML systems.

    Like knowing it will rain versus deciding whether to bring an umbrella.

    5. Production Rollbacks

    Reverting to a previous stable model version after deployment issues. When a new model causes problems in production, rolling back quickly minimizes impact.

    Essential MLOps practice for maintaining system reliability.

    Like reloading a saved game state when something breaks.

    Quick Reference

    Concept One-liner
    Prompt Injection Malicious instructions overriding AI behavior
    Jailbreaks Bypassing safety constraints
    GRU Gated memory for sequence modeling
    Planning vs Prediction Action evaluation vs forecasting
    Production Rollbacks Reverting to stable model versions

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1300 words7 min readAbstract

    midi-cli-rs: Extending with Custom Mood Packs

    Personal Software doesn’t stop at “it works.” It evolves. After building midi-cli-rs for AI agents to generate music, I wanted more moods—without recompiling Rust every time.

    The solution: a plugin system that lets anyone create custom mood packs using simple TOML files.

    The Problem with Built-in Only

    The original midi-cli-rs shipped with a handful of mood presets: suspense, eerie, upbeat, calm, ambient, jazz. Useful, but limited. What if you want synthwave? Chillout? Something faster or in a different key?

    Hardcoding every possible mood isn’t practical. And asking users to modify Rust source code isn’t friendly.

    Three Levels of Extensibility

      Level What You Get What You Change Skill Required
    Built-in Moods 9 curated generators Nothing—use as-is None
    Plugin Moods Parameter variations TOML config files Text editing
    Custom Generators New musical patterns Rust source code Programming (future)

    This post covers Plugin Moods—the middle tier. You can preset combinations of tempo, key, and intensity, but you’re still using the built-in generators’ musical logic. Want a “smooth-jazz” preset (slower, mellower)? Plugin mood. Want bebop or Latin jazz with different chord progressions? That requires a custom generator.

    Custom generators (writing new Rust code) will be covered in a future post when the plugin editor ships.

    The Plugin Architecture

    Custom moods live in ~/.midi-cli-rs/moods/ as TOML files. Each file is a “mood pack” that can define multiple moods. The CLI discovers them automatically.

    Here’s how it works:

    ~/.midi-cli-rs/
    └── moods/
        ├── electronic.toml    # Your synthwave, techno, etc.
        ├── cinematic.toml     # Epic, tension, wonder
        └── seasonal.toml      # Holiday themes
    

    Creating a Mood Pack

    A plug-in mood pack has two parts: pack metadata and mood definitions.

    [pack]
    name = "electronic"
    version = "1.0.0"
    author = "Your Name"
    description = "Electronic music styles"
    
    [[moods]]
    name = "synthwave"
    base_mood = "upbeat"
    default_tempo = 118
    default_key = "Am"
    default_intensity = 65
    description = "80s synthwave vibes"
    tags = ["electronic", "retro"]
    
    [[moods]]
    name = "chillout"
    base_mood = "ambient"
    default_tempo = 85
    default_key = "Em"
    default_intensity = 40
    description = "Relaxed electronic"
    

    Each mood delegates to a built-in generator (base_mood) but overrides specific parameters. You get the musical logic of the built-in mood with your customizations applied.

    Available Base Moods

    Your custom moods can extend any of the nine built-in generators:

    Base Mood Character
    suspense Tense, building
    eerie Dark, unsettling
    upbeat Energetic, positive
    calm Peaceful, slow
    ambient Atmospheric, textural
    jazz Swing, improvisation
    chiptune 8-bit, retro gaming
    orchestral Classical instruments
    show Broadway, theatrical

    Configuration Options

    Each mood definition supports these overrides:

    Field Description Example
    name CLI name (required) "synthwave"
    base_mood Built-in to extend (required) "upbeat"
    default_tempo BPM 118
    default_key Musical key "Am", "C", "Eb"
    default_intensity 0-100 energy level 65
    description Human-readable description "80s vibes"
    tags Discovery tags ["electronic", "retro"]

    How Seeds Create Variation

    Seeds aren’t random—they’re deterministic variation selectors. The same mood + same seed always produces identical output. But different seeds create observable musical differences across multiple dimensions:

    Parameter Variation Range
    Tempo ±15% from base
    Layer inclusion Which instruments appear
    Melodic contour 16 different phrase shapes
    Note density 0.6x to 1.4x
    Rest probability 0% to 35% silence
    Phrase length 3-8 notes
    Velocity -15 to +15 offset

    The system uses hash-based mixing with unique salts for each parameter. This means adjacent seeds (42 vs 43) produce completely different outputs—no gradual transitions between seeds.

    When you combine plugin moods with seed variation, you get a matrix: your custom tempo/key/intensity settings applied across different seed-driven variations of the underlying generator’s patterns.

    Using Custom Moods

    Once your TOML file is in place, the mood appears automatically:

    # List all moods (built-in + plugins)
    midi-cli-rs moods
    
    # Generate with your custom mood
    midi-cli-rs preset -m synthwave -d 5 -s 42 -o output.wav
    

    The seed system still works—same mood + same seed = identical output.

    Example: Electronic Pack

    Here’s a complete pack with four electronic moods:

    [pack]
    name = "electronic"
    version = "1.0.0"
    description = "Electronic music styles"
    
    [[moods]]
    name = "synthwave"
    base_mood = "upbeat"
    default_tempo = 118
    default_key = "Am"
    default_intensity = 65
    
    [[moods]]
    name = "chillout"
    base_mood = "ambient"
    default_tempo = 85
    default_key = "Em"
    default_intensity = 40
    
    [[moods]]
    name = "techno"
    base_mood = "upbeat"
    default_tempo = 130
    default_key = "Dm"
    default_intensity = 85
    
    [[moods]]
    name = "8bit"
    base_mood = "chiptune"
    default_tempo = 140
    default_key = "C"
    default_intensity = 70
    

    Drop this in ~/.midi-cli-rs/moods/electronic.toml and you have four new moods.

    What’s Next

    This plugin system handles mood variations—different tempos, keys, and intensities applied to existing generators. A future update will add a plugin editor for creating entirely new musical patterns without writing Rust.

    For now, the delegation model covers most use cases: want faster jazz? Darker ambient? Major-key suspense? Create a TOML file and you’re done.

    The Personal Software Pattern

    This follows the Personal Software philosophy: start with something that works, then extend it as needs emerge. The plugin system wasn’t in the original design. It grew from actual use—wanting more moods without recompiling.

    Good personal software leaves room to grow.


    Series: Personal Software (Part 3) Previous: midi-cli-rs: Music for AI Agents Next: music-pipe-rs: Unix Pipelines

    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Watch the Video

    Unmute to hear narration.

  • 456 words3 min readAbstract

    Five ML Concepts - #20

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #20
    Video

    References

    Concept Reference
    VAEs Auto-Encoding Variational Bayes (Kingma & Welling 2013)
    Uncertainty Estimation What Uncertainties Do We Need in Bayesian Deep Learning? (Kendall & Gal 2017)
    Interpretability Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez & Kim 2017)
    Gradient Noise Stochastic Gradient Descent as Approximate Bayesian Inference (Mandt et al. 2017)
    Human-in-the-Loop Human-in-the-Loop Machine Learning (Monarch 2021)

    Today’s Five

    1. Variational Autoencoders (VAEs)

    VAEs are probabilistic autoencoders that learn a structured latent distribution. By sampling from that distribution, they can generate new examples similar to the training data.

    The key innovation is regularizing the latent space to be smooth and continuous.

    Like learning not just to summarize books, but to create new ones in a similar style.

    2. Uncertainty Estimation

    Models can estimate how confident they should be in predictions. Some uncertainty comes from noisy data (aleatoric), and some from limited knowledge (epistemic).

    Knowing when a model is uncertain enables safer decision-making.

    Like a weather forecast giving seventy percent chance of rain instead of a simple yes or no.

    3. Why Interpretability Is Hard

    Neural networks represent information across many interacting parameters. No single component cleanly maps to a human concept.

    Distributed representations enable powerful learning but resist simple explanations.

    Like trying to explain a dream by pointing to individual neurons.

    4. Gradient Noise

    When training with mini-batches, gradients vary from step to step. A little noise can help exploration, but too much can slow convergence.

    Batch size, learning rate, and gradient clipping all influence this noise level.

    Like getting slightly different directions each time you ask for help.

    5. Human-in-the-Loop Systems

    Humans review, supervise, or override model decisions in critical workflows. This improves safety and accountability in high-stakes applications.

    The approach combines model efficiency with human judgment where it matters most.

    Like a pilot monitoring autopilot and stepping in when necessary.

    Quick Reference

    Concept One-liner
    VAEs Generative models with structured latent spaces
    Uncertainty Estimation Know when you don’t know
    Interpretability Distributed representations resist explanation
    Gradient Noise Mini-batch variation in training
    Human-in-the-Loop Human oversight for critical decisions

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 643 words4 min readAbstract

    In-Context Learning Revisited: From Mystery to Engineering

    It was 2020 when GPT-3 shocked everyone. It could learn from examples in the query—without updating its weights. We called it In-Context Learning. But was it magic, or was it doing something deeper?

    Resource Link
    Video ICL Revisited
    Video
    Papers 4 References

    Phase 1: The Empirical Discovery (2020)

    The GPT-3 paper showed that large models could perform few-shot learning. Give them examples, and they generalize. No gradient updates. No retraining. Just forward passes.

    The surprising part was that scaling alone seemed to unlock it.

    Paper: Language Models are Few-Shot Learners

    ELI5: Show a big language model a few examples of a task in your prompt, and it figures out how to do the task—without any retraining. Nobody told it to do this. It just emerged when models got big enough.

    Main idea: Scale unlocks emergent capabilities. ICL was discovered, not designed.

    Phase 2: Mechanistic Explanations (2022)

    By 2022, researchers began probing the internal mechanisms. Several papers proposed that transformers implement implicit meta-learning. The model appears to learn during inference by performing gradient-descent-like operations internally.

    Paper: What Explains In-Context Learning in Transformers?

    ELI5: When you give a transformer examples, its attention layers do something that looks like fitting a simple linear model to those examples—on the fly, during the forward pass. It’s not memorizing; it’s computing a mini-solution.

    Main idea: ICL works because attention can simulate linear regression internally.

    Paper: Transformers Learn In-Context by Gradient Descent

    ELI5: The transformer’s forward pass is secretly doing something similar to training. The attention mechanism acts like one step of gradient descent over the examples you provided. Learning happens inside inference.

    Main idea: ICL is implicit gradient descent—learning hidden inside prediction.

    Phase 3: Engineering the Effect

    Once researchers understood that ordering and structure affect ICL, prompt design became less of an art and more of an optimization problem. The quality and arrangement of demonstrations directly shape performance.

    ICL became tunable. Researchers could now deliberately improve it rather than just observe it.

    Phase 4: Interactive ICL (2026)

    Recent work pushes this further. Models are trained to predict natural language critiques and feedback. If a model can predict what a teacher would say, it can internalize that signal. External correction becomes an internal capability.

    Paper: Improving Interactive In-Context Learning from Natural Language Feedback

    ELI5: Train a model to guess what feedback a human would give. Now the model has internalized the “teacher” and can improve itself without needing the actual teacher present. Self-correction without weight updates.

    Main idea: Models can learn to learn from feedback, making ICL interactive and self-improving.

    Beyond Language

    Newer work applies ICL to neuroscience discovery, showing that the mechanism is not limited to text tasks. It becomes a flexible reasoning substrate across domains. That’s when you know a concept has matured.

    The Arc

    Phase Era Key Insight
    Discovery 2020 Emerges from scale
    Explanation 2022 Implicit gradient descent
    Engineering 2023-24 Prompt design as optimization
    Self-improvement 2026 Learning from feedback

    The Deeper Insight

    In-Context Learning started as an emergent surprise. Now it’s becoming an engineered learning substrate inside transformers.

    It was not magic. It was meta-learning hiding in plain sight.

    References

    Paper Link
    Language Models are Few-Shot Learners (GPT-3) arXiv:2005.14165
    What Explains In-Context Learning in Transformers? arXiv:2202.12837
    Transformers Learn In-Context by Gradient Descent arXiv:2212.07677
    Improving Interactive ICL from Natural Language Feedback arXiv:2602.16066

    ICL started as “whoa, it works.” Now we understand “why it works.” Next: engineering it deliberately.

    Watch the Video

    Unmute to hear narration.

  • 451 words3 min readAbstract

    Five ML Concepts - #19

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #19
    Video

    References

    Concept Reference
    Autoencoders Reducing the Dimensionality of Data with Neural Networks (Hinton & Salakhutdinov 2006)
    Correlation vs Causation Causality (Pearl 2009)
    Curriculum Learning Curriculum Learning (Bengio et al. 2009)
    Failure Analysis Practical Machine Learning for Computer Vision (Lakshmanan et al. 2021)
    Covariate Shift Dataset Shift in Machine Learning (Quinonero-Candela et al. 2009)

    Today’s Five

    1. Autoencoders

    Autoencoders are neural networks trained to compress inputs into a smaller representation and reconstruct them. The bottleneck forces the model to capture essential structure.

    This learned compression is useful for dimensionality reduction, denoising, and feature learning.

    Like summarizing a book into key points and then rebuilding the story from that summary.

    2. Correlation vs Causation

    Two variables can move together without one causing the other. Models typically learn correlations present in data, not true cause-and-effect relationships.

    This matters because interventions based on correlation alone may not produce intended effects.

    Like noticing umbrella sales rise with rain—umbrellas don’t cause rain.

    3. Curriculum Learning

    Training starts with easier examples and gradually introduces harder ones. This can improve stability and learning speed in some settings.

    The approach mirrors how humans learn complex subjects incrementally.

    Like teaching math by starting with addition before moving to calculus.

    4. Failure Analysis

    Failure analysis groups model errors into categories to understand where performance breaks down. This helps target improvements instead of guessing.

    Systematic error analysis often reveals actionable patterns invisible in aggregate metrics.

    Like a teacher reviewing which types of questions students miss most often.

    5. Covariate Shift

    Covariate shift occurs when the input distribution changes between training and deployment, while the task itself remains the same. The model may underperform because it sees unfamiliar inputs.

    Monitoring input distributions helps detect this shift early.

    Like training a driver in sunny weather and testing them in snow.

    Quick Reference

    Concept One-liner
    Autoencoders Compress and reconstruct to learn structure
    Correlation vs Causation Co-occurrence isn’t cause
    Curriculum Learning Start easy, progress to hard
    Failure Analysis Categorize errors to guide fixes
    Covariate Shift New inputs, same task

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 2241 words12 min readAbstract

    JSON et al: A Deep Dive into Data Serialization Formats

    JSON is everywhere. APIs. Logs. Databases. Configuration files. But it’s not alone. A whole ecosystem of formats exists—each optimizing for different tradeoffs.

    This post expands on the JSON et al short, providing technical depth on each format: when it was created, where it’s specified, and what problems it solves.


    The Tradeoff Triangle

    Before diving in, understand the fundamental constraint. Data formats balance three competing goals:

    Goal Description
    Human Readability Can a developer read and edit it directly?
    Compactness How many bytes does it take to represent data?
    Query Performance How fast can you access specific fields?

    You usually only get two. JSON optimizes readability. Protobuf optimizes compactness. JSONB optimizes query performance. No format wins everywhere.


    JSON: The Ubiquitous Baseline

    Created: 2001 (discovered/formalized by Douglas Crockford) Specification: ECMA-404 (2013), RFC 8259 (2017) File Extension: .json

    JSON (JavaScript Object Notation) emerged from JavaScript’s object literal syntax but became language-agnostic. Crockford didn’t invent it—he “discovered” it already existing in JavaScript and formalized the specification.

    Technical Details

    • Encoding: UTF-8 text (UTF-16/32 allowed but rare)
    • Data Types: Objects {}, arrays [], strings, numbers, booleans, null
    • Schema: None required
    • Comments: Not allowed in strict JSON

    Strengths

    • Universal parser support (every language has one)
    • Human readable without tools
    • Web-native (JavaScript parses it natively)
    • Simple specification (fits on a business card)

    Weaknesses

    • Verbose (field names repeated for every object)
    • No binary data type (must base64-encode)
    • No comments (frustrating for config files)
    • Parsing overhead (tokenization + string decoding every time)

    ELI5

    Like typing a long email instead of sending a terse text. Every message spells everything out—clear, but verbose.

    When to Use

    REST APIs, configuration (when comments aren’t needed), data interchange between systems, anywhere human readability matters more than efficiency.


    JSONL / NDJSON: Streaming JSON

    Created: ~2013 (formalized) Specification: JSON Lines, NDJSON File Extension: .jsonl, .ndjson

    JSONL (JSON Lines) and NDJSON (Newline-Delimited JSON) are the same concept: one valid JSON object per line, separated by newlines.

    Technical Details

    {"name": "Alice", "score": 95}
    {"name": "Bob", "score": 87}
    {"name": "Carol", "score": 92}
    

    No wrapping array. Each line is independently parseable.

    Strengths

    • Streaming: Process line-by-line without loading entire file
    • Append-only: Add records without rewriting the file
    • Parallel processing: Split by line, distribute to workers
    • Fault-tolerant: One corrupt line doesn’t invalidate the file

    Weaknesses

    • Not valid JSON (can’t parse with standard JSON parser)
    • Still text-based (same verbosity as JSON)
    • No random access by index

    ELI5

    Like removing one comma per line to save some typing. Each line is self-contained, so you can grab and process them one at a time.

    When to Use

    Log files, big data pipelines (Spark, Pandas), ML datasets, event streams, anywhere you need to process records incrementally.


    JSONB: Binary JSON for Databases

    Created: 2014 (PostgreSQL 9.4) Specification: Implementation-specific (no universal standard) Storage: Database column type

    JSONB isn’t a file format—it’s a database storage optimization. PostgreSQL’s JSONB differs from MongoDB’s BSON, which differs from other implementations.

    PostgreSQL JSONB Details

    • Parsed once: Text converted to binary on INSERT
    • Keys sorted: Deterministic ordering for indexing
    • Duplicates removed: Last value wins
    • Offset table: O(log n) field lookup instead of O(n) text scanning

    MongoDB BSON

    Specification: bsonspec.org

    BSON (Binary JSON) is MongoDB’s serialization format. Unlike PostgreSQL’s JSONB, BSON is a standalone binary format:

    • Type-prefixed values
    • Supports additional types (Date, Binary, ObjectId)
    • Length-prefixed for fast skipping
    • ~10-15% smaller than JSON typically

    Strengths

    • Fast queries without re-parsing
    • Indexable (GIN indexes on JSONB in PostgreSQL)
    • Type coercion at storage time

    Weaknesses

    • Not portable (implementation-specific)
    • Not human-readable
    • INSERT overhead (parsing cost upfront)

    ELI5

    Instead of cooking from scratch every time, you heat a pre-made meal. The prep work happens once (on INSERT), so serving (queries) is fast.

    When to Use

    Database storage where you query into JSON structures. PostgreSQL JSONB + GIN indexes enable fast @> containment queries.


    Protocol Buffers: Google’s Schema-First Format

    Created: 2001 (internal Google), 2008 (open-sourced) Specification: developers.google.com/protocol-buffers File Extension: .proto (schema), binary wire format

    Protocol Buffers (Protobuf) is Google’s language-neutral, schema-required serialization format. It powers gRPC.

    Technical Details

    Schema definition:

    message Sensor {
      int32 temperature = 1;
      int32 humidity = 2;
    }
    

    Wire format uses field numbers, not names:

    Field 1: 72
    Field 2: 40
    

    Key Features

    • Varint encoding: Small integers use fewer bytes
    • Field numbers: Enable backward compatibility
    • Code generation: .proto → language-specific classes
    • No self-description: Receiver must know schema

    Strengths

    • Extremely compact (3-10x smaller than JSON typically)
    • Fast serialization/deserialization
    • Strong versioning semantics
    • gRPC integration

    Weaknesses

    • Requires schema agreement
    • Not human-readable
    • Tooling required for debugging
    • Schema evolution has rules

    ELI5

    Everyone agrees upfront what “field 1” means. You don’t waste space spelling out “temperature”—you just send the number 1 and the value. Both sides know the code.

    When to Use

    Microservices (gRPC), internal APIs, anywhere bandwidth and latency matter more than debuggability.


    ASN.1: The Telecom Veteran

    Created: 1984 (ITU-T X.208) Specification: ITU-T X.680-X.683 Encoding Rules: BER, DER, PER, XER, and more

    ASN.1 (Abstract Syntax Notation One) predates all modern formats. It defines both schema and encoding, with multiple encoding rules for different use cases.

    Encoding Rules Comparison

    Rule Use Case
    BER (Basic Encoding Rules) Flexible, general purpose
    DER (Distinguished Encoding Rules) Deterministic, for cryptography
    PER (Packed Encoding Rules) Most compact, for bandwidth-constrained
    XER (XML Encoding Rules) XML-based, for interop

    Where You See ASN.1

    • X.509 certificates (SSL/TLS certs are DER-encoded ASN.1)
    • LDAP (directory services)
    • SNMP (network management)
    • Telecom protocols (SS7, GSM, LTE)

    Strengths

    • Bit-level precision
    • Proven over 40 years
    • Multiple encoding options
    • Formal verification possible

    Weaknesses

    • Complex specification
    • Steep learning curve
    • Tooling can be expensive
    • Security vulnerabilities in parsers (historically)

    ELI5

    Same idea as Protobuf—everyone agrees upfront what each field number means. ASN.1 just got there 20 years earlier and handles even more edge cases.

    When to Use

    You probably won’t choose ASN.1 for new projects. You’ll encounter it in cryptography, certificates, and legacy telecom systems.


    YAML: Human-Friendly Configuration

    Created: 2001 (Clark Evans, Ingy döt Net, Oren Ben-Kiki) Specification: yaml.org/spec/1.2.2 File Extension: .yaml, .yml

    YAML (YAML Ain’t Markup Language) prioritizes human readability. It’s a superset of JSON—any valid JSON is valid YAML.

    Technical Details

    # Comments allowed!
    server:
      host: localhost
      port: 8080
      features:
        - auth
        - logging
    

    Key Features

    • Indentation-based: Whitespace matters
    • Comments: # for single-line
    • Anchors/aliases: &name and *name for references
    • Multiple documents: --- separator

    Strengths

    • Highly readable
    • Comments supported
    • Multi-line strings without escaping
    • Complex data structures

    Weaknesses

    • “Norway problem”: NO parses as boolean false
    • Whitespace sensitivity causes errors
    • Multiple ways to express same data
    • Security concerns (arbitrary code execution in some parsers)

    ELI5

    Optimized for clarity, not bandwidth. YAML is for humans editing config files—not for machines exchanging data over networks.

    When to Use

    Configuration files (Kubernetes, Docker Compose, CI/CD), anywhere humans edit data directly and comments help.


    TOML: Minimal Configuration

    Created: 2013 (Tom Preston-Werner) Specification: toml.io File Extension: .toml

    TOML (Tom’s Obvious Minimal Language) emerged as a reaction to YAML’s complexity. It’s used by Rust (Cargo.toml), Python (pyproject.toml), and others.

    Technical Details

    [server]
    host = "localhost"
    port = 8080
    
    [server.features]
    auth = true
    logging = true
    

    Key Features

    • Explicit typing: Dates, times, arrays have clear syntax
    • Sections: [section] and [section.subsection]
    • No anchors: Intentionally simpler than YAML
    • Deterministic: Same data = same representation

    Strengths

    • Easy to read and write
    • Unambiguous parsing
    • Clear error messages
    • Growing ecosystem support

    Weaknesses

    • Less expressive than YAML
    • Nested structures can be verbose
    • Smaller ecosystem than JSON/YAML

    ELI5

    Same goal as YAML—clarity for humans, not bandwidth for machines—but with stricter rules so you make fewer mistakes.

    When to Use

    Configuration files where YAML’s complexity isn’t needed. Rust projects (mandatory). Python packaging (pyproject.toml).


    TOON: Token-Optimized for LLMs

    Created: October 2025 (toon-format organization) Specification: github.com/toon-format/toon (v3.0) File Extension: .toon Media Type: text/toon (provisional)

    TOON (Token Oriented Object Notation) is the newest format in this list, designed specifically for LLM input. It’s a lossless representation of JSON that minimizes tokens.

    Technical Details

    TOON combines YAML-style indentation for nested objects with CSV-like tabular layouts for uniform arrays:

    users[2]{name,age}:
    Alice,25
    Bob,30
    

    Equivalent JSON:

    {"users": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]}
    

    Key Features

    • Header-based: Field names declared once, values follow
    • 40% fewer tokens: Than equivalent JSON typically
    • Lossless: Round-trips to JSON perfectly
    • UTF-8 always: No encoding ambiguity

    Performance

    Metric JSON TOON
    Accuracy 69.7% 73.9%
    Efficiency (acc/1K tokens) 15.3 26.9

    Strengths

    • Significant token savings at scale
    • Better context window utilization
    • Lower API costs for LLM applications
    • Human-readable (unlike binary formats)

    Weaknesses

    • New format (October 2025)
    • Limited tooling compared to JSON
    • Requires conversion layer for existing systems
    • Not yet widely adopted

    ELI5

    Like having one header row for each column in a table instead of repeating the column name for every single row. You declare field names once, then just list the values.

    When to Use

    LLM prompts with structured data, RAG applications, anywhere token efficiency matters. Especially useful for large datasets with uniform object arrays.

    Implementations

    • TypeScript: Reference implementation
    • Python: toons (Rust-based, fast)
    • Go, Rust, .NET: Available via toon-format org

    Alternatives Not in the Video

    MessagePack

    Created: 2008 (Sadayuki Furuhashi) Specification: msgpack.org

    Binary JSON without schema. Type-prefixed values, efficient numeric encoding.

    Use when: You want JSON semantics but smaller/faster.

    CBOR

    Created: 2013 (IETF) Specification: RFC 8949

    Concise Binary Object Representation. Designed for constrained environments (IoT).

    Use when: Resource-constrained devices, need a standard binary format.

    Apache Avro

    Created: 2009 (Apache, Doug Cutting) Specification: avro.apache.org

    Schema-based, row-oriented binary format. Schema embedded or stored separately. Strong schema evolution support.

    Use when: Big data pipelines (Hadoop, Kafka), schema evolution is critical.

    Apache Parquet

    Created: 2013 (Twitter + Cloudera) Specification: parquet.apache.org

    Columnar storage format. Not for serialization—for analytics storage.

    Use when: Large-scale analytics, data warehousing, Spark/Pandas workflows.

    Cap’n Proto

    Created: 2013 (Kenton Varda, ex-Protobuf author) Specification: capnproto.org

    Zero-copy serialization. The serialized form is the in-memory form.

    Use when: Extreme performance requirements, inter-process communication.

    FlatBuffers

    Created: 2014 (Google) Specification: google.github.io/flatbuffers

    Zero-copy like Cap’n Proto but with better tooling. Used in games, mobile.

    Use when: Games, mobile apps, anywhere memory allocation matters.


    Quick Reference

    Format Year Schema Binary Human-Readable Best For
    JSON 2001 No No Yes APIs, interchange
    JSONL 2013 No No Yes Logs, streaming
    JSONB 2014 No Yes No Database queries
    Protobuf 2008 Yes Yes No Microservices
    ASN.1 1984 Yes Yes No Crypto, telecom
    YAML 2001 No No Yes Config files
    TOML 2013 No No Yes Simple config
    TOON 2025 No No Yes LLM prompts
    MessagePack 2008 No Yes No Fast JSON
    CBOR 2013 Optional Yes No IoT
    Avro 2009 Yes Yes No Big data

    Key Takeaways

    1. No “best” format exists. Each optimizes for different constraints.

    2. Text formats favor humans. JSON, YAML, TOML prioritize readability over efficiency.

    3. Binary formats favor machines. Protobuf, MessagePack, CBOR prioritize compactness and speed.

    4. Schema formats favor correctness. Protobuf, Avro, ASN.1 catch errors at compile time.

    5. The tradeoff triangle is real. Readability, compactness, query performance—pick two.

    The question isn’t “which format wins?” The question is: what problem are you solving?


    Resources


    Data formats are design decisions. Choose based on your constraints, not trends.

    Questions? Find me on YouTube @SoftwareWrighter.

    Watch the Video

    Unmute to hear narration.

  • 444 words3 min readAbstract

    Five ML Concepts - #18

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #18
    Video

    References

    Concept Reference
    Preference Learning Learning to summarize from human feedback (Stiennon et al. 2020)
    Ensembling Ensemble Methods in Machine Learning (Dietterich 2000)
    ML Fragility Distribution Shift (Quinonero-Candela et al. 2009)
    Epoch Deep Learning (Goodfellow et al. 2016), Chapter 8
    Cost vs Quality Efficient Transformers: A Survey (Tay et al. 2022)

    Today’s Five

    1. Preference Learning

    Instead of learning from fixed labels, models are trained from comparisons between outputs. This helps align model behavior with human judgments.

    The approach works well when absolute quality is hard to define but relative preferences are easier to express.

    Like learning to cook by asking which dish tastes better.

    2. Ensembling

    Ensembling combines predictions from multiple models. Different models make different errors, and combining them can improve robustness.

    Common strategies include voting, averaging, and stacking models together.

    Like asking several experts and averaging their opinions.

    3. Why ML Is Fragile

    Models rely on statistical patterns learned from data. When those patterns shift, performance can degrade quickly.

    This fragility emerges because models optimize for training distributions, not arbitrary future scenarios.

    Like a spell checker that works on common words but struggles with unusual ones.

    4. Epoch

    An epoch is one complete pass through the training dataset. Multiple epochs allow the model to refine its weights over repeated passes.

    Training typically continues for many epochs until validation performance stops improving.

    Like reading a textbook from beginning to end more than once.

    5. Cost vs Quality Tradeoffs

    Increasing model size or compute often improves performance, but also increases cost and latency. Engineers balance quality against budget and responsiveness.

    Production systems often use smaller, faster models rather than the largest available.

    Like choosing between a luxury car and an economy car depending on your needs.

    Quick Reference

    Concept One-liner
    Preference Learning Train from comparisons, not labels
    Ensembling Combine models for robustness
    ML Fragility Statistical models break on distribution shift
    Epoch One pass through training data
    Cost vs Quality Bigger isn’t always better in production

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1063 words6 min readAbstract

    midi-cli-rs: Music Generation for AI Coding Agents

    AI coding agents can write code, generate images, and produce text. But what about music? When I needed background audio for explainer videos, I wanted a tool that AI agents could use directly—no music theory required.

    Resource Link
    Video midi-cli-rs Explainer
    Video
    Examples Listen to Samples
    Code midi-cli-rs

    The Problem

    Generating music programmatically is hard. Traditional approaches require understanding music theory, MIDI specifications, instrument mappings, and audio synthesis. That’s a lot to ask of an AI agent that just needs a 5-second intro.

    I wanted something simpler: a CLI tool where an agent could say “give me 5 seconds of suspenseful music” and get a usable WAV file.

    The Solution: Mood Presets

    midi-cli-rs solves this with mood presets—curated musical generators that produce complete compositions from a single command:

    # Generate a 5-second suspenseful intro
    midi-cli-rs preset --mood suspense --duration 5 -o intro.wav
    
    # Upbeat outro with specific key
    midi-cli-rs preset -m upbeat -d 7 --key C --seed 42 -o outro.wav
    

    Six moods are available:

    Mood Character
    suspense Low drones, tremolo strings, tension
    eerie Sparse tones, diminished harmony
    upbeat Rhythmic chords, energetic
    calm Warm pads, gentle arpeggios
    ambient Textural drones, pentatonic bells
    jazz Walking bass, brushed drums, piano trio

    Each mood generates multi-layer compositions with appropriate instruments, rhythms, and harmonies. The --seed parameter ensures reproducibility—same seed, same output. Different seeds produce meaningful variations in melody contour, rhythm patterns, and instrument choices.

    Melodic Variation

    The presets don’t just randomize notes—they use a contour-based variation system. Changing the seed produces melodies that follow different shapes (ascending, descending, arch, wave) while staying musically coherent. This means you can generate multiple versions of a mood and pick the one that fits best.

    How It Works

    The tool generates MIDI programmatically, then renders to WAV using FluidSynth:

    Mood Preset → MIDI Generation → FluidSynth → WAV Output
    

    MIDI generation uses the midly crate to create standard MIDI files. Each preset generates multiple tracks with different instruments, note patterns, and dynamics.

    Audio rendering calls FluidSynth as a subprocess with a SoundFont (instrument samples). This avoids LGPL licensing complications—subprocess execution doesn’t trigger copyleft.

    Note-Level Control

    When presets aren’t enough, you can specify exact notes:

    # Note format: PITCH:DURATION:VELOCITY[@OFFSET]
    midi-cli-rs generate \
        --notes "C4:0.5:80@0,E4:0.5:80@0.5,G4:0.5:80@1,C5:1:90@1.5" \
        -i piano -t 120 -o arpeggio.wav
    

    Or use JSON for complex multi-track arrangements:

    echo '{"tempo":90,"instrument":"piano","notes":[
      {"pitch":"C4","duration":0.5,"velocity":80,"offset":0},
      {"pitch":"E4","duration":0.5,"velocity":80,"offset":0.5},
      {"pitch":"G4","duration":1,"velocity":90,"offset":1}
    ]}' | midi-cli-rs generate --json -o output.wav
    

    Web UI

    For interactive composition, there’s a browser-based interface:

    midi-cli-rs serve  # Starts on http://127.0.0.1:3105
    

    The Presets tab lets you adjust mood, key, duration, intensity, and tempo with immediate audio preview. Click the clock button to generate a time-based seed for unique but reproducible results.

    The Melodies tab provides note-by-note composition with keyboard shortcuts:

    • a-g for note pitch
    • [ / ] to adjust duration
    • + / - to change octave
    • Tab to navigate between notes

    For AI Agents

    The CLI is designed for AI agent usage:

    1. Simple commands: One line generates complete audio
    2. Reproducible: Seed values ensure consistent output
    3. Self-documenting: --help includes agent-specific instructions
    4. Composable: Generate tracks separately, combine with ffmpeg
    # AI agent workflow
    midi-cli-rs preset -m suspense -d 5 --seed 1 -o intro.wav
    midi-cli-rs preset -m upbeat -d 10 --seed 2 -o main.wav
    ffmpeg -i intro.wav -i main.wav -filter_complex concat=n=2:v=0:a=1 final.wav
    

    SoundFont Quality Matters

    The quality of generated audio depends heavily on the SoundFont used. SoundFonts are collections of audio samples for each instrument—a tiny SoundFont with compressed samples will sound thin and artificial, while a larger one with high-quality recordings produces professional results.

    SoundFont Size Quality License
    TimGM6mb ~6MB Basic GPL v2
    GeneralUser GS ~30MB Good Permissive
    FluidR3_GM ~140MB Very Good MIT
    MuseScore_General ~200MB Excellent MIT

    For anything beyond quick prototypes, use a quality SoundFont. The difference is dramatic—the same MIDI file can sound like a toy keyboard or a real instrument depending on the samples.

    The tool auto-detects SoundFonts in common locations (~/.soundfonts/, /opt/homebrew/share/soundfonts/, etc.), or specify one explicitly with --soundfont.

    Technical Details

    Built with Rust 2024 edition using permissively licensed dependencies:

    Crate Purpose
    midly MIDI file generation
    clap CLI argument parsing
    serde JSON serialization
    rand Randomization for presets
    axum Web server (for serve command)

    FluidSynth is called as a subprocess for WAV rendering, keeping the main codebase MIT-licensed.

    Try It

    Listen to sample outputs, or build locally:

    git clone https://github.com/softwarewrighter/midi-cli-rs.git
    cd midi-cli-rs
    cargo build --release
    ./target/release/midi-cli-rs preset -m jazz -d 5 -o jazz.wav
    

    Requires FluidSynth for WAV output (brew install fluid-synth on macOS).


    Series: Personal Software (Part 2) Previous: cat-finder: Local ML in Rust Next: midi-cli-rs: Custom Mood Packs

    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Watch the Video

    Unmute to hear narration.

  • 472 words3 min readAbstract

    Five ML Concepts - #17

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #17
    Video

    References

    Concept Reference
    Benchmark Leakage Rethinking the Inception Architecture for Computer Vision (Szegedy et al. 2016)
    Concept/Data Drift Learning under Concept Drift: A Review (Lu et al. 2018)
    Weight Decay Decoupled Weight Decay Regularization (Loshchilov & Hutter 2019)
    Scaling Laws Scaling Laws for Neural Language Models (Kaplan et al. 2020)
    Shadow Deployment Reliable Machine Learning (Cathy Chen et al. 2022)

    Today’s Five

    1. Benchmark Leakage

    When benchmark or test data influences training, tuning, or model selection, evaluation results become unreliable. This inflates reported performance beyond real-world capability.

    Strict separation between development and evaluation data is essential for honest assessment.

    Like practicing with the exact questions that will appear on the final exam.

    2. Concept Drift vs Data Drift

    Data drift occurs when input distributions change. Concept drift occurs when the relationship between inputs and outputs changes. Both can degrade model performance over time.

    Data drift: customers buy different products. Concept drift: what “good” means has changed.

    Like customers buying different products versus products changing what they mean.

    3. Weight Decay

    A regularization method that penalizes large weights, often implemented as L2 regularization. This encourages simpler models that generalize better.

    Weight decay adds a term proportional to the squared magnitude of weights to the loss function.

    Like encouraging shorter, simpler answers instead of overly complicated ones.

    4. Scaling Laws

    Empirical relationships showing how performance tends to improve as model size, data, or compute increase. These relationships follow predictable power-law curves.

    Scaling laws help predict resource requirements for target performance levels.

    Like noticing that adding horsepower often increases a car’s speed, but with diminishing returns.

    5. Shadow Deployment

    Running a new model in parallel with production without affecting live user decisions. The shadow model processes real traffic but its outputs are only logged, not served.

    This allows safe evaluation before full deployment.

    Like a new chef preparing the same dishes in the back kitchen before serving customers.

    Quick Reference

    Concept One-liner
    Benchmark Leakage Test data contaminating training/selection
    Concept vs Data Drift Changed relationships vs changed inputs
    Weight Decay L2 penalty discourages large weights
    Scaling Laws Performance scales predictably with resources
    Shadow Deployment Test safely alongside production

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1069 words6 min readAbstract

    TBT (4/?): ToonTalk - Teaching Robots to Program

    I first discovered ToonTalk during the Windows XP era—probably around 2003 or 2004. It was unlike anything I’d seen: a programming environment disguised as a video game where you trained robots by showing them what to do. The concept stuck with me for two decades.

    Resource Link
    Video ToonTalk in Rust
    Video
    tt-rs Demo Live Demo
    tt-rs Repo tt-rs

    What is ToonTalk?

    ToonTalk is a visual programming environment created by Ken Kahn in 1995. The “Toon” stands for cartoon—every abstract programming concept is mapped to a concrete, animated metaphor:

    Concept ToonTalk Metaphor
    Variables Boxes with numbered holes
    Values Numbers, text, images in boxes
    Comparison Scales that tip when values differ
    Functions Robots that watch and learn
    Message passing Birds that carry items to nests
    Garbage collection Trucks that haul away unused items

    The design was influenced by games like The Legend of Zelda and Robot Odyssey—the kind of games that made you think while you played.

    Programming by Demonstration

    The core idea is radical: you don’t write code, you show a robot what to do.

    1. Create a robot and put it in “training mode”
    2. Perform actions while the robot watches (move items, compare values, etc.)
    3. The robot records your actions as a program
    4. Give the robot a box matching the training pattern—it executes the learned behavior

    This is programming by demonstration. The robot generalizes from your example, matching patterns and applying transformations. It’s the same conceptual model as teaching a child: “Watch what I do, then you try.”

    Three Generations

    ToonTalk has existed in three forms:

    Version Era Technology
    Original ToonTalk 1995-2009 C++, 3D desktop application
    ToonTalk Reborn 2014-2017 JavaScript/jQuery web app
    tt-rs 2025-2026 Rust/WebAssembly/Yew

    The original was a full 3D world—cities, houses, helicopters, even bombs for debugging. Ken Kahn later created ToonTalk Reborn, a simplified JavaScript version that runs in browsers.

    Why I Built tt-rs

    When I rediscovered ToonTalk Reborn a few years ago, I wanted to experiment with the concepts myself. But diving into a large jQuery codebase wasn’t appealing. So I did what any reasonable person would do: I vibe coded my own version in Rust.

    tt-rs is a modern reimplementation using:

    • Rust for core logic
    • WebAssembly for browser execution
    • Yew for reactive UI
    • SVG/CSS for graphics and animations

    It’s not a port—it’s a fresh implementation inspired by the same ideas. Building it myself lets me understand the concepts deeply and experiment with variations.

    Three Learning Levels

    The demo introduces concepts progressively through three levels:

    Level Concepts Widgets
    tt1 Basics Numbers, boxes, scales, wand, vacuum
    tt2 Messaging Birds and nests for communication
    tt3 Automation Sensors (time, random) + robots

    Level one covers the fundamentals: numbers with arithmetic, boxes as containers, scales for comparison, and tools for copying and removing. Level two adds asynchronous messaging—birds carry items to their paired nests. Level three brings sensors that produce values and robots that automate actions.

    Current Features

    The live demo includes:

    Widgets:

    • Numbers: Rational arithmetic with +, -, *, / operators
    • Boxes: Configurable containers with 0-9 holes (resize with keyboard)
    • Text: Basic text display
    • Scales: Visual comparison that tips when values differ
    • Robot: Training mode, action recording, execution
    • Bird/Nest: Message passing with pairing and delivery
    • Sensors: Time (milliseconds) and random number generation

    Tools:

    • Wand: Copy any widget
    • Vacuum: Remove widgets
    • Magnifier: Inspect nest message queues and robot actions

    Interactions:

    • Drag-and-drop with visual feedback
    • Box joining (drop box on edge of another)
    • Box splitting (drop box on a number)
    • Contextual help panel with level-specific content
    • Puzzle system with animated “Show Me” demos

    Robot Training

    The core feature is programming by demonstration:

    1. Click robot to enter training mode (yellow glow indicates “I’m watching”)
    2. Perform actions while the robot records (arithmetic, copy, remove, move to box)
    3. Click robot again to stop training
    4. Click robot to replay—it executes the recorded sequence

    The tutorials demonstrate this workflow step by step. In the “Train Robot” tutorial, you teach a robot to move a number into a box. In “Robot Sensors,” you train a robot to generate random numbers, apply modulo, and send results to a nest via a bird.

    Interactive Tutorials

    Each tutorial has two parts:

    1. Show Me: Watch an animated demonstration where a cursor walks through the solution
    2. Practice: Try it yourself with the same widgets

    The tutorials cover:

    • Fill a box with numbers
    • Add numbers together
    • Copy widgets with the wand
    • Send messages with birds and nests
    • Train your first robot
    • Combine robots with sensors

    What’s Next

    The immediate priorities:

    1. Pattern matching - Robot generalizes from specific values to “any number”
    2. Watched execution - See robot work step-by-step with animated cursor
    3. Persistence - Save and load workspaces

    Long term, I’d like to add the 3D elements from the original—the cities, the houses, the helicopter view. But that’s a much larger project.

    The Enduring Appeal

    What makes ToonTalk fascinating isn’t just the visual metaphors—it’s the computational model. Under the hood, ToonTalk implements concurrent constraint logic programming. The robots are essentially guarded Horn clauses. The birds and nests implement the actor model.

    Heavy concepts, but you don’t need to know any of that to use it. You just train robots by example. The abstraction is complete.

    That’s why it stuck with me for twenty years. Good abstractions are rare. When you find one, it’s worth understanding deeply.

    References

    Resource Link
    ToonTalk Website toontalk.com
    ToonTalk on Wikipedia Wikipedia
    ToonTalk Reborn (JS) github.com/ToonTalk/ToonTalk
    ToonTalk Reborn Demo toontalk.github.io/ToonTalk
    ToonTalk Reborn Wiki Wiki
    Ken Kahn’s Page Ken Kahn
    Original Paper (1995) ERIC - ToonTalk: An Animated Programming Environment
    Ken Kahn’s Research Academia.edu

    Some ideas are worth rediscovering. ToonTalk is one of them.

    Watch the Video

    Unmute to hear narration.

  • 468 words3 min readAbstract

    Five ML Concepts - #16

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #16
    Video

    References

    Concept Reference
    Train/Val/Test Split Deep Learning (Goodfellow et al. 2016), Chapter 5
    Overconfidence On Calibration of Modern Neural Networks (Guo et al. 2017)
    Batch Normalization Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy 2015)
    Optimization vs Generalization Understanding Deep Learning Requires Rethinking Generalization (Zhang et al. 2017)
    A/B Testing Controlled Experiments on the Web (Kohavi et al. 2009)

    Today’s Five

    1. Train / Validation / Test Split

    Data is divided into training, validation, and test sets. Training learns patterns, validation tunes hyperparameters, test evaluates final performance.

    Never use test data for any decisions during development—it should only be touched once.

    Like practicing on homework, checking with practice tests, then taking the real exam.

    2. Overconfidence

    Models can assign very high probabilities to incorrect predictions. This is often related to poor calibration and can be dangerous in high-stakes applications.

    Temperature scaling and other calibration methods can help align confidence with accuracy.

    Like a student who is absolutely certain of a wrong answer.

    3. Batch Normalization

    Normalizes layer activations during training to improve stability and convergence. Each mini-batch’s activations are normalized to have zero mean and unit variance.

    This reduces internal covariate shift and often allows higher learning rates.

    Like keeping everyone on a similar pace during training so no one runs too far ahead.

    4. Optimization vs Generalization

    Training loss can decrease while test performance does not improve. Good optimization does not guarantee good generalization.

    A model can perfectly fit training data while failing on new examples—this is overfitting.

    Like memorizing last year’s exam instead of understanding the subject.

    5. A/B Testing Models

    Comparing two model versions using controlled live traffic experiments. Users are randomly assigned to see predictions from model A or model B.

    Statistical analysis determines which model performs better on real-world metrics.

    Like taste-testing two recipes with real customers to see which works better.

    Quick Reference

    Concept One-liner
    Train/Val/Test Separate data for learning, tuning, and evaluation
    Overconfidence High probability on wrong predictions
    Batch Normalization Normalize activations for stable training
    Optimization vs Generalization Low train loss ≠ good test performance
    A/B Testing Compare models with live experiments

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 796 words4 min readAbstract

    Multi-Hop Reasoning (2/2): The Distribution Trap

    In Part 1, a tiny 135M model achieved 75% accuracy on multi-hop reasoning. This time we scale up to 360M—and discover that RSFT on easy examples makes performance worse.

    Resource Link
    Paper KG-Guided RAG (arXiv)
    Code multi-hop-reasoning
    ELI5 eli5.md
    Demo Live Demo
    Explainer Coming soon

    Scaling Up: SmolLM-360M

    Part 1 used the 135M model. For better reasoning traces and demo quality, we trained the 360M variant:

    Model Parameters Platform
    SmolLM-135M-Instruct 135M MLX (macOS)
    SmolLM-360M-Instruct 360M MLX + Unsloth (cross-platform)

    The 360M model produces more coherent traces and is used by the live inference demo.

    The Distribution Trap

    Here’s what happened when we trained RSFT on the “easy” training data:

    Phase Training Data Accuracy Notes
    Base 0% No format compliance
    SFT (500 iters) Easy (1-3 hop) 37% Learns TRACE + ANSWER format
    RSFT Easy (1-3 hop) 27% Worse than SFT!

    RSFT on easy examples performed worse than the SFT baseline.

    Why?

    The training examples (1-3 hops) don’t match the evaluation distribution (4-5 hops). The model learns shortcuts that work on easy problems but fail on hard ones.

    Training Distribution Eval Distribution Result
    Easy (1-3 hop) Hard (4-5 hop) 27% (worse)
    Hard (4-5 hop) Hard (4-5 hop) 75% (Part 1 result)

    The rejection sampling “winners” from easy examples teach strategies that don’t generalize.

    The Key Finding

    Rejection sampling must match your target distribution.

    This is counterintuitive. You might expect that training on more examples (even easy ones) would help. Instead:

    • Easy winners use shortcuts (fewer reasoning steps)
    • Hard eval requires full chain reasoning
    • Model learns the wrong patterns

    The fix: train RSFT on eval.jsonl (hard examples), not train.jsonl (easy examples).

    Demo Improvements

    The demo now includes four interactive tabs:

    Tab Feature
    Training Animated SFT→RSFT visualization with KG scoring
    Inference Pre-recorded inference examples
    Try It Live inference with 360M model
    Distribution Interactive visualization of the key finding

    Try It: Live Inference

    Ask DevOps troubleshooting questions and watch the model reason:

    Question: What causes TLSHandshakeError?
    
    TRACE: TLSHandshakeError is caused by ClockSkew,
    and ClockSkew leads to CertificateExpired,
    and CertificateExpired is fixed by RenewCert...
    ANSWER: B
    

    The knowledge graph scores the reasoning path during training, but at inference the model reasons independently.

    Cross-Platform Support

    The pipeline now runs on both platforms:

    Platform Framework Command
    macOS (Apple Silicon) MLX make train-360m
    Linux (NVIDIA CUDA) Unsloth make train-360m-unsloth

    Unsloth provides 2x faster training with 60% less memory on NVIDIA GPUs.

    Current Status

    Component Status
    SFT training (360M) Complete
    RSFT (wrong distribution) Complete (27%)
    RSFT (correct distribution) Next step
    Live demo with Try It Complete
    Cross-platform support Complete

    Next Steps

    Priority Task Expected Result
    High Retrain RSFT on eval.jsonl 75%+ accuracy
    Medium Update demo to use corrected model Better live inference
    Medium Curriculum learning (easy→hard) Smoother training
    Low Larger models (1B+) Higher ceiling

    The corrected RSFT training:

    python3 -m core.rsft \
      --examples data/eval.jsonl \  # Hard examples!
      --kg data/kg.json \
      --sft-adapter data/runs/run_360m/models/sft \
      --output data/runs/run_360m/models/rsft_eval \
      --model HuggingFaceTB/SmolLM-360M-Instruct \
      --k-samples 8 \
      --max-examples 50
    

    Lessons Learned

    1. Distribution Matching is Non-Negotiable

    This isn’t a minor optimization—it’s the difference between 27% and 75% accuracy. Wrong distribution = wrong winners = wrong model.

    2. Easy Examples Can Hurt

    More training data isn’t always better. Easy examples teach shortcuts that fail on hard problems.

    3. Verify Your Pipeline

    We trained a full RSFT model before realizing the distribution mismatch. Always check that training data matches eval distribution.

    4. The Fix is Simple

    Once identified, the fix is one flag change: --examples data/eval.jsonl instead of train.jsonl.

    Resources


    Training distribution matters. Easy examples teach easy shortcuts.

  • 775 words4 min readAbstract

    Towards Continuous LLM Learning (2): Routing Prevents Forgetting

    In Part 1, naive LoRA fine-tuning caused catastrophic forgetting. Now we’re implementing the Share algorithm properly—and we’re about 60% of the way to verifying the paper’s claims.

    Resource Link
    Code sleepy-coder
    Part 1 When Fine-Tuning Fails
    ELI5 eli5.md
    Share Paper arXiv:2602.06043

    Paper Claims vs Implementation Status

    We’re systematically verifying the claims from the Share and UWSH papers:

    Paper Claim Infrastructure Demonstrated?
    Shared basis via SVD Complete Yes
    ~100x parameter reduction Complete (76x) Yes
    Task routing beats averaging Tested (Exp 1b) Partial
    Prevents catastrophic forgetting Tested (Exp 1b) Partial
    Sequential learning Not tested No
    UWSH subspace stability Not tested No

    Overall: ~60% complete. Infrastructure is solid. Routing tested. Sequential learning remains.

    What We Built

    The full Share algorithm implementation:

    • Phase 1: SVD-based subspace extraction from 51 LoRA adapters (60% variance threshold)
    • Phase 2: Coefficient-only training with frozen basis (83K params vs 1.6M full LoRA)
    • Phase 3: Basis merging and updates
    • Routing: Error pattern classification for coefficient selection

    Bug Fixes That Unlocked Progress

    Two critical bugs blocked proper Phase 2 training:

    Bug 1: Zero-Gradient Saddle Point

    Both coefficient matrices initialized to zero:

    eps_beta = 0, eps_alpha = 0
    → delta_W = 0 @ 0 = 0
    → zero gradients, no learning
    

    Fix: Dual small-random initialization.

    Bug 2: Half-Parameter Training

    LoRA-style initialization only trained one coefficient set:

    Before: 112/224 parameters getting gradients
    After:  224/224 parameters getting gradients
    

    Fix: Both coefficient matrices need random initialization.

    Experiment 1b: Routing Works

    With gradient-trained v4 coefficients and proper routing:

    Strategy Pass Rate BC RH TB Regressions
    Baseline (no LoRA) 46.7% 70% 40% 30%
    Averaged 50.0% 70% 40% 40% 1
    Routed 50.0% 70% 50% 30% 0

    Result handling improved 40% → 50%. Zero regressions. This is the first positive transfer from Share coefficients.

    The Forgetting Heatmap

    We applied each coefficient individually to all 30 koans:

    Koan       BL  mut_bc dbl_mt ret_lr mis_cl mis_hs mis_or opt_ok res_me ROUTED
    bc_001-009 P   P      P      P      P      P      P      P      P      P
    bc_003,5,10.   .      .      .      .      .      .      .      .      .
    rh_002     .   .     +GAIN   .      .     +GAIN  +GAIN  +GAIN  +GAIN  +GAIN
    rh_008     P  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST   P
    tb_005     P   P      P      P      P     -LOST   P      P      P      P
    

    Key finding: rh_008 regresses under every coefficient applied globally. But routing saves it by falling back to the base model when no pattern matches.

    This is exactly what the Share paper predicts: task-specific coefficients improve targeted patterns without interfering with unrelated ones.

    What the Papers Claim vs What We’ve Verified

    Verified

    1. Shared basis via SVD — We extract principal components from 51 adapters. Works.

    2. 76x parameter reduction — 83K coefficient parameters vs 1.6M full LoRA. Verified.

    3. Routing prevents forgetting — Zero regressions with routed inference. The fragile rh_008 koan survives because it falls back to base model.

    4. Positive transfer possible — Result handling improved 40% → 50% with routed coefficients.

    Not Yet Verified

    1. Sequential learning — The core continual learning claim. Train task 1 → eval → train task 2 → eval (verify task 1 still passes). This is next.

    2. UWSH subspace stability — Do different adapter subsets converge to similar subspaces? Grassmann distance measurement needed.

    Next Experiments

    Priority Experiment Target
    High Sequential learning curve No degradation on prior tasks
    High Fix k_alpha=32 (paper recommends) Match paper exactly
    Medium UWSH verification >70% subspace overlap
    Medium Add rank update vectors Full algorithm

    The Architecture

    Day:   Agent attempts to fix Rust errors
           ↓
           Successes and failures logged
           ↓
    Night: Train coefficients (frozen basis)
           ↓
           83K params per task
           ↓
    Eval:  Route to appropriate coefficients
           ↓
           Pattern-matched inference
           ↓
    (repeat)
    

    The key insight: train small, route smart. The shared basis captures common structure. Per-task coefficients specialize without interference.

    Resources


    60% of the way to verifying the papers. Sequential learning is next.

  • 470 words3 min readAbstract

    Five ML Concepts - #15

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #15
    Video

    References

    Concept Reference
    Perplexity A Neural Probabilistic Language Model (Bengio et al. 2003)
    Catastrophic Forgetting Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017)
    Weight Initialization Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010)
    Curse of Dimensionality The Elements of Statistical Learning (Hastie et al. 2009), Chapter 2
    Monitoring & Drift Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift (Rabanser et al. 2019)

    Today’s Five

    1. Perplexity

    A metric for language models that reflects how well the model predicts the next token. Lower perplexity means better predictive performance.

    Perplexity is the exponentiated average negative log-likelihood per token.

    Like a test where lower scores mean you found the answers easier to guess.

    2. Catastrophic Forgetting

    When training on new tasks causes a model to lose performance on previously learned tasks. This is a key challenge in continual learning.

    Techniques like elastic weight consolidation help preserve important weights.

    Like learning a new phone number and forgetting the old one.

    3. Weight Initialization

    The starting values of model weights influence how well training progresses. Poor initialization can cause vanishing or exploding gradients.

    Xavier and He initialization are common strategies for setting initial weights appropriately.

    Like starting a race from a good position instead of stuck in a ditch.

    4. Curse of Dimensionality

    In high-dimensional spaces, data becomes sparse and distances behave differently, making learning harder. Points that seem close in low dimensions can be far apart in high dimensions.

    Feature selection and dimensionality reduction help mitigate this effect.

    Like searching for a friend in a city versus across the entire universe.

    5. Monitoring & Drift Detection

    Continuously tracking model performance and detecting shifts in input data distributions. Production models can degrade silently without proper monitoring.

    Automated alerts help catch problems before they affect users.

    Like a weather station alerting you when conditions change.

    Quick Reference

    Concept One-liner
    Perplexity How surprised the model is by the data
    Catastrophic Forgetting New learning erases old knowledge
    Weight Initialization Starting values affect training stability
    Curse of Dimensionality High dimensions make data sparse
    Monitoring & Drift Track performance and data changes

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 448 words3 min readAbstract

    Five ML Concepts - #14

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #14
    Video

    References

    Concept Reference
    ROC/AUC An Introduction to ROC Analysis (Fawcett 2006)
    Spurious Correlations Unbiased Look at Dataset Bias (Torralba & Efros 2011)
    Gradient Clipping On the Difficulty of Training Recurrent Neural Networks (Pascanu et al. 2013)
    Loss Landscapes Visualizing the Loss Landscape of Neural Nets (Li et al. 2018)
    Cold Start Addressing Cold Start in Recommender Systems (Schein et al. 2002)

    Today’s Five

    1. ROC / AUC

    ROC curves plot true positive rate against false positive rate across all classification thresholds. AUC (Area Under the Curve) summarizes overall ranking performance in a single number.

    AUC of 0.5 means random guessing; 1.0 means perfect ranking.

    Like judging a student by considering every possible passing grade cutoff.

    2. Spurious Correlations

    Coincidental patterns in training data that don’t reflect true relationships. Models that rely on them can fail when the coincidence disappears.

    Dataset curation and diverse evaluation help detect spurious features.

    Like assuming umbrellas cause rain because you always see them together.

    3. Gradient Clipping

    Limiting the size of gradients during backpropagation. This helps prevent exploding gradients and unstable training, especially in recurrent networks.

    Clipping can be by value or by global norm.

    Like putting a speed limit on a car so it doesn’t lose control.

    4. Loss Landscapes

    How model error changes across different parameter settings. Training is like navigating this surface toward regions of lower loss.

    Flat minima may generalize better than sharp ones.

    Like hiking through mountains searching for the lowest valley, feeling the slope beneath your feet.

    5. Cold Start Problems

    Difficulty predicting for new users or items with no history. Without prior data, personalization becomes difficult.

    Solutions include content-based features, popularity fallbacks, or asking initial questions.

    Like a librarian trying to recommend books to someone who just walked in.

    Quick Reference

    Concept One-liner
    ROC / AUC Classifier performance across thresholds
    Spurious Correlations Coincidental patterns that don’t generalize
    Gradient Clipping Limit gradient size for stability
    Loss Landscapes Error surface over parameter space
    Cold Start No history for new users/items

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 448 words3 min readAbstract

    Five ML Concepts - #13

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #13
    Video

    References

    Concept Reference
    Calibration On Calibration of Modern Neural Networks (Guo et al. 2017)
    Shortcut Learning Shortcut Learning in Deep Neural Networks (Geirhos et al. 2020)
    Early Stopping Early Stopping - But When? (Prechelt 1998)
    Universal Approximation Approximation by Superpositions of a Sigmoidal Function (Cybenko 1989)
    Checkpointing Training Deep Nets with Sublinear Memory Cost (Chen et al. 2016)

    Today’s Five

    1. Calibration

    How well a model’s predicted probabilities match real-world outcomes. If a model predicts 70% confidence many times, it should be correct about 70% of those cases.

    Well-calibrated models enable better decision-making under uncertainty.

    Like a weather forecast that predicts rain 30% of the time and is right about 30% of those forecasts.

    2. Shortcut Learning

    When models rely on superficial patterns instead of meaningful features. For example, identifying cows by detecting grass and failing when cows appear indoors.

    Shortcuts can inflate benchmark scores while masking poor real-world performance.

    Like passing a test by memorizing answer positions instead of learning the material.

    3. Early Stopping

    Training is stopped when validation performance stops improving. This helps prevent overfitting by halting before the model memorizes training data.

    Patience hyperparameters control how long to wait before stopping.

    Like knowing when to stop practicing before you start reinforcing mistakes.

    4. Universal Approximation

    The theorem stating that neural networks can approximate any continuous function, given enough capacity. In practice, finding the right weights through optimization is the challenge.

    The theorem guarantees existence, not learnability.

    Like having enough Lego blocks to build almost any shape—assembly is still hard.

    5. Checkpointing

    Saving the model’s state during training. This allows recovery from interruptions and comparison across training stages.

    Checkpoints also enable selecting the best model rather than just the final one.

    Like saving your game progress so you can reload if something goes wrong.

    Quick Reference

    Concept One-liner
    Calibration Predicted probabilities match outcomes
    Shortcut Learning Exploiting spurious patterns
    Early Stopping Stop when validation plateaus
    Universal Approximation NNs can approximate any function
    Checkpointing Save model state during training

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 488 words3 min readAbstract

    Five ML Concepts - #12

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #12
    Video

    References

    Concept Reference
    Precision/Recall The Truth of the F-Measure (Sasaki 2007)
    OOD Detection A Baseline for Detecting Misclassified and Out-of-Distribution Examples (Hendrycks & Gimpel 2017)
    Batch Size On Large-Batch Training for Deep Learning (Goyal et al. 2017)
    Inductive Bias Relational Inductive Biases, Deep Learning, and Graph Networks (Battaglia et al. 2018)
    Latency/Throughput Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al. 2021)

    Today’s Five

    1. Precision vs Recall

    Precision measures how often positive predictions are correct. Recall measures how many actual positives are successfully found. Improving one often reduces the other.

    The tradeoff depends on your application: spam filters favor precision, medical screening favors recall.

    Like a search party: you can find everyone but raise false alarms, or be very certain and miss some people.

    2. OOD Inputs (Out-of-Distribution)

    Data that differs significantly from what the model saw during training. Models may fail silently or produce confident but wrong answers.

    Detecting OOD inputs is an active research area for safer AI deployment.

    Like asking a chef trained only in Italian food to make sushi.

    3. Batch Size

    The number of training examples processed before updating model weights. Larger batches can be more efficient computationally, but may generalize worse.

    Finding the right batch size involves balancing speed, memory, and model quality.

    Like grading tests one at a time or waiting to grade a full stack.

    4. Inductive Bias

    The assumptions built into a model that guide how it learns from data. Without inductive bias, models cannot generalize beyond training examples.

    CNNs assume spatial locality; transformers assume tokens can attend to any position.

    Like expecting nearby houses to have similar prices before looking at the data.

    5. Latency vs Throughput

    Latency is how long a single request takes. Throughput is how many requests can be handled per second. Optimizing one often comes at the expense of the other.

    Batching improves throughput but increases latency for individual requests.

    Like a restaurant serving one table quickly or many tables at once.

    Quick Reference

    Concept One-liner
    Precision vs Recall Correct positives vs finding all positives
    OOD Inputs Data unlike training distribution
    Batch Size Examples per weight update
    Inductive Bias Built-in learning assumptions
    Latency vs Throughput Speed per request vs total capacity

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1048 words6 min readAbstract

    Neural-Net-RS: An Educational Neural Network Platform

    I wanted a neural network implementation where every step is visible—no framework magic hiding the math. Something I could use to teach the fundamentals, with a CLI for quick experiments and a web UI for visual demonstrations. Claude Code built it.

    This is Personal Software for education: a complete neural network training platform with multiple interfaces, all from a single Rust codebase.

    Resource Link
    Repo neural-net-rs
    Video Neural-Net-RS Explainer
    Video

    Why Build Your Own Neural Network?

    Frameworks like PyTorch and TensorFlow are production-ready, but they hide the fundamentals. When teaching or learning, you want to see:

    • How weights and biases actually change during training
    • Why XOR needs a hidden layer when AND doesn’t
    • What backpropagation really computes

    Neural-Net-RS exposes all of this. No autograd magic—every gradient is computed explicitly. No tensor abstractions—just matrices with clear row-major storage.

    What Got Built

    A modular Rust workspace with multiple interfaces to the same core:

    neural-net-rs/
    ├── matrix/              # Linear algebra foundation
    ├── neural-network/      # Core ML implementation
    ├── neural-net-cli/      # Command-line interface
    ├── neural-net-server/   # REST API with SSE streaming
    └── neural-net-wasm/     # WebAssembly for browser
    

    One codebase, three ways to interact:

    • CLI: Train from terminal with progress bars
    • Web UI: Visual training with real-time loss charts
    • WASM: Run entirely in browser, no server needed

    The Classic Problems

    The platform includes 8 built-in examples that teach ML concepts progressively:

    Problem Architecture Key Concept
    AND, OR 2→2→1 Linear separability
    XOR 2→3→1 Why hidden layers matter
    Parity3 3→6→1 Scaling non-linearity
    Quadrant 2→8→4 Multi-class classification
    Adder2 4→8→3 Learning arithmetic
    Iris 4→8→3 Real-world dataset
    Pattern3x3 9→6→4 Visual pattern recognition

    The XOR Problem

    XOR is the canonical neural network problem. AND and OR are linearly separable—a single line can divide the outputs. XOR isn’t. You need a hidden layer.

    AND: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1  ← One line separates
    XOR: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0  ← No line works
    

    Watch XOR training and you see why neural networks are powerful: they learn to create intermediate representations that make non-linear problems separable.

    Implementation Details

    Feed-Forward with Backpropagation

    pub struct Network {
        pub layers: Vec<usize>,      // [input, hidden..., output]
        pub weights: Vec<Matrix>,    // Learned connections
        pub biases: Vec<Matrix>,     // Per-neuron offsets
        pub learning_rate: f64,      // Training step size
    }
    

    Forward pass: Each layer computes activation(weights × input + bias)

    Backward pass: Gradients flow backward using the chain rule, updating weights to reduce error.

    The sigmoid activation function maps any input to (0, 1):

    σ(x) = 1 / (1 + e^(-x))
    

    Custom Matrix Library

    Educational clarity over maximum performance:

    pub struct Matrix {
        rows: usize,
        cols: usize,
        data: Vec<f64>,  // Row-major storage
    }
    

    Operations: dot product, transpose, element-wise multiply, map. Everything visible, nothing hidden.

    Checkpoint System

    Training can be interrupted and resumed:

    # Train for 5000 epochs, save checkpoint
    neural-net-cli train xor --epochs 5000 --checkpoint model.json
    
    # Resume from checkpoint
    neural-net-cli train xor --epochs 10000 --resume model.json
    

    Checkpoints include version metadata to prevent loading incompatible models.

    CLI Usage

    # List available examples
    neural-net-cli examples
    
    # Train XOR with progress bar
    neural-net-cli train xor --epochs 10000 --learning-rate 0.5
    
    # Predict with trained model
    neural-net-cli predict model.json --input "0,1"
    
    # Run web UI
    neural-net-cli serve --port 8080
    

    The CLI uses indicatif for real-time progress bars:

    Training XOR [=========>   ] 7500/10000 (75%) Loss: 0.0023
    

    Web Interface

    The server embeds all assets at compile time—one binary serves everything:

    • Training panel: Select problem, set hyperparameters, watch loss decrease
    • Network visualization: See layer structure and connection strengths
    • Prediction panel: Test the trained model interactively
    • Loss chart: Real-time plotting via Server-Sent Events

    Two training modes:

    • Local (WASM): Runs entirely in browser
    • Remote (API): Server-side with streaming progress

    Technology Choices

    Component Purpose
    Rust Performance, safety, single-binary distribution
    Axum Lightweight async web framework
    wasm-bindgen Rust → WebAssembly compilation
    Indicatif Terminal progress bars
    Serde JSON serialization for checkpoints

    The WASM module is ~248KB after optimization.

    Test Coverage

    136+ tests across the workspace:

    • Matrix operations (unit tests)
    • Network training (integration tests)
    • CLI commands (integration tests)
    • Server endpoints (integration tests)
    • WASM bindings (unit tests)

    Zero clippy warnings. Reproducible results via seeded RNG.

    References

    Resource Link
    Backpropagation Learning representations by back-propagating errors (Rumelhart et al. 1986)
    Multi-Layer Perceptron Multilayer perceptron (Wikipedia)
    XOR Problem Perceptrons (Minsky & Papert 1969)
    Weight Initialization Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010)
    Inspired by codemoonsxyz/neural-net-rs

    The Vibe Coding Process

    This project grew through iterative conversation with Claude Code:

    1. “Build a basic neural network in Rust with backpropagation”
    2. “Add a CLI with progress bars”
    3. “Add a web UI with real-time training visualization”
    4. “Compile to WASM so it runs in the browser”
    5. “Add checkpoint save/resume”
    6. “Include classic ML examples with educational documentation”

    Each request built on the previous. The AI handled architecture decisions, chose appropriate crates, and maintained test coverage throughout.


    When you want to understand how neural networks actually work, sometimes you need to see every weight update. That’s what this platform provides—education through transparency.

    Watch the Video

    Unmute to hear narration.

  • 914 words5 min readAbstract

    Cat Finder: Personal Software via Vibe Coding

    I needed to find cat photos scattered across my system. Instead of searching the app store, signing up for a cloud service, or uploading my personal photos to someone else’s servers, I asked Claude Code to build me the tool I needed. An hour later, I had it.

    This is Personal Software—software that exists because you needed it, built the way you want it, running entirely under your control.

    Resource Link
    Repo cat-finder
    Video Cat Finder Explainer
    Video

    The Vibe Coding Approach

    Vibe Coding is about describing what you want and letting AI handle the implementation details. No boilerplate, no Stack Overflow rabbit holes, no fighting with build systems. You focus on the what, the AI handles the how.

    For Cat Finder, the conversation went something like:

    “I want a CLI tool that scans directories for images containing cats. Run locally, no cloud. Use YOLO for detection. Output just the file paths so I can pipe them to other commands.”

    Claude Code chose the tech stack (Rust, YOLOv8n, ONNX Runtime), handled the tensor math, figured out the COCO class IDs, and produced a working tool. I guided the direction; the AI wrote the code.

    Why Personal Software?

    The traditional options for “find cat photos” would be:

    1. Cloud service: Upload photos to Google/Apple/Amazon, let them scan everything, hope they respect your privacy
    2. Desktop app: Find something in an app store, hope it does what you want, deal with subscription fees or ads
    3. Write it yourself: Spend days learning YOLO integration, tensor formats, image preprocessing

    Personal Software offers a fourth path: describe what you need, get exactly that, own the result completely.

    Cat Finder runs entirely on my machine. No accounts, no uploads, no subscriptions, no ads. The code is mine to modify, extend, or share.

    What Got Built

    A Rust CLI tool using YOLOv8n (the nano variant) through ONNX Runtime:

    Directory Traversal → Image Preprocessing → YOLO Inference → Cat Detection → Output
    

    The Detection Pipeline

    1. Walk directories recursively, finding image files (jpg, png, gif, webp, etc.)
    2. Preprocess each image: resize to 640×640, normalize to 0.0-1.0, convert to NCHW tensor format
    3. Run inference through the YOLOv8n ONNX model
    4. Parse output for class ID 15 (cat in COCO ordering) above confidence threshold
    5. Print matching paths to stdout for easy piping to other tools

    Unix Philosophy

    # stdout: just paths (machine-parseable)
    # stderr: logging and progress
    
    cat-finder ~/Photos | xargs -I {} cp {} ~/CatPhotos/
    

    This separation enables composable Unix pipelines. The tool does one thing well and plays nicely with others.

    Technology Stack

    Component Purpose
    Rust Memory-safe, high-performance core
    YOLOv8n Lightweight object detection (12MB model)
    ONNX Runtime Cross-platform inference engine
    clap CLI argument parsing
    ndarray Tensor operations
    walkdir Recursive directory traversal

    Total footprint: ~80MB (runtime + model + binary)

    I didn’t choose this stack—Claude Code did, based on the requirements. It made good choices.

    Usage

    # Basic usage
    cat-finder ~/Photos
    
    # Adjust confidence threshold
    cat-finder --confidence 0.5 ~/Photos
    
    # Verbose output with timestamps
    cat-finder -v -t ~/Photos
    
    # Copy all cat photos to a new folder
    cat-finder ~/Photos | xargs -I {} cp {} ~/CatAlbum/
    

    Honest About Limitations

    The README documents failure cases transparently:

    Image Type Detection Success
    Clear photographs High
    Artistic/stylized images Low
    Cats in clothing Low
    Small/partial cats Variable
    Low quality/blurry Variable

    Test results: 7 of 9 cat images detected (77.8% recall). Oil paintings and anthropomorphized cats confuse models trained on photographs. This is documented, not hidden.

    Bonus Features

    The project grew organically based on related needs:

    Duplicate Finder: A second binary for finding duplicate images using size-based filtering followed by SHA-256 checksums.

    find-duplicates ~/Photos
    

    Web Demo: A Flask-based interface for visual feedback with real-time progress via Server-Sent Events.

    These emerged from “while you’re at it…” requests during development. Vibe coding makes feature additions nearly frictionless.

    Setup

    git clone https://github.com/sw-ml-study/cat-finder
    cd cat-finder
    ./scripts/setup.sh  # Downloads model, builds project
    ./cat-finder ~/Photos
    

    The Personal Software Philosophy

    Privacy-first: All processing happens locally. No cloud APIs, no external services, no data leaving your machine.

    Ownership: The code is yours. Modify it, extend it, share it, delete it.

    Fit-for-purpose: Built for exactly what you need, nothing more, nothing less.

    Transparency: Known limitations documented. No marketing spin.

    References

    Resource Link
    YOLOv8 Ultralytics YOLOv8 - State-of-the-art object detection
    ONNX Runtime ONNX Runtime - Cross-platform inference engine
    ort crate ort - Rust bindings for ONNX Runtime
    COCO Dataset COCO Classes - Class ID 15 = cat

    You don’t always need an app store or a cloud service. Sometimes you just need to describe what you want and let an AI build it for you. That’s vibe coding. That’s personal software.

    Watch the Video

    Unmute to hear narration.

  • 503 words3 min readAbstract

    Five ML Concepts - #11

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #11
    Video

    References

    Concept Reference
    RNN Learning representations by back-propagating errors (Rumelhart et al. 1986)
    Chain of Thought Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)
    Softmax Deep Learning (Goodfellow et al. 2016), Chapter 6
    MoE Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al. 2017)
    Distribution Shift Dataset Shift in Machine Learning (Quiñonero-Candela et al. 2009)

    Today’s Five

    1. RNN (Recurrent Neural Network)

    Networks designed for sequential data that maintain a hidden state carrying information across time steps. This makes them useful for language, time series, and audio.

    LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are improved variants that better handle long-range dependencies.

    Like reading a story while keeping mental notes about characters and plot as you go.

    2. Chain of Thought

    A prompting technique that encourages step-by-step reasoning in language models. Instead of producing an answer immediately, the model generates intermediate steps.

    This can improve performance on math, logic, and multi-step problems.

    Like showing your work on a math test instead of just writing the final answer.

    3. Softmax

    Converts a vector of scores into a probability distribution where each output falls between zero and one, and all outputs sum to one. It is commonly used in classification models.

    Softmax makes raw scores easier to interpret as probabilities.

    Like turning test scores into percentages that add up to 100%.

    4. MoE (Mixture of Experts)

    Instead of one large network, the model contains many smaller expert networks with a routing mechanism that selects which experts process each input. This allows models to scale capacity while keeping computation efficient.

    Only a subset of experts activates for any given input.

    Like a hospital with specialists where a receptionist directs you to the right doctor.

    5. Distribution Shift

    Occurs when deployment data differs from training data, causing a model trained on one environment to perform poorly in another. Common causes include seasonal changes, user behavior shifts, or new populations.

    Monitoring for drift and retraining helps maintain performance.

    Like a weather model trained on summer data struggling to predict winter storms.

    Quick Reference

    Concept One-liner
    RNN Sequential processing with memory across time
    Chain of Thought Step-by-step reasoning in prompts
    Softmax Scores to normalized probabilities
    MoE Route inputs to specialized experts
    Distribution Shift Training vs deployment data mismatch

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 995 words5 min readAbstract

    RLM: Recursive Language Models for Massive Context

    What happens when your data won’t fit in a context window? RLM expands the workspace instead of cramming everything into limited memory. This post covers the MIT paper, my Rust implementation, and six video demonstrations.

    Resource Link
    Paper arXiv:2512.24601
    Code rlm-project
    Playlist RLM Implementations

    The Problem: Context Limits

    Large language models have a hard limit. They can only process so much text at once.

    Imagine a cookie jar that holds 100 cookies. What if you need to search through ten thousand? When you force too much in, the model forgets things—this is called context rot.

    Bigger models help, but the limit always exists. We need a different approach.

    The RLM Solution

    Recursive Language Models flip the problem. Instead of bigger jars, use better tools.

    The data stays in a context box. The model gets tools to peek inside:

    Tool Purpose
    slice Get a character range
    find Search for text
    regex Pattern matching
    count Count occurrences
    llm_query Ask a sub-LLM to analyze a chunk

    Small, focused, deliberate. The model thinks about what it needs, then asks for just that.

    The Results

    From the MIT paper—on tasks that don’t fit in context:

    Approach Accuracy
    Standard prompting 0%
    RLM 87-91%

    Results hold across GPT-4, Claude, Llama, Mistral, and Gemini.

    My Implementation: Four Capability Levels

    I built a Rust implementation with four capability levels:

    Level Name Description
    L1 DSL Built-in commands (find, regex, count)
    L2 WASM LLM generates Rust → compiles to WebAssembly sandbox
    L3 CLI LLM generates Rust → compiles to native binary
    L4 LLM Recursive delegation to sub-LLMs

    Each level trades off safety for capability:

    • L1 is instant but limited to predefined operations
    • L2 runs custom code but in a sandboxed environment
    • L3 breaks free for large datasets that would timeout in WASM
    • L4 uses LLM reasoning for semantic analysis

    The Video Series

    Six videos demonstrate RLM in action:

    1. RLM Explained

    RLM Explained

    The foundational video. Covers the MIT paper, the cookie jar analogy, and benchmark results showing 0% → 91% accuracy improvement.

    Key insight: Expand the workspace, not the context.


    2. War and Peace Demo

    War and Peace Demo

    Can AI read all of War and Peace to find a hidden secret? The full text is 3.2 MB with 65,666 lines—way too big for any context window.

    RLM finds “the password to Prince Andrei’s secret vault” in just 2 iterations using only 3,000 tokens. That’s 100% savings compared to sending the full document.


    3. WASM Sandboxing

    WASM Sandboxing

    What if your LLM could write custom analysis code on the fly? Level 2 demonstrates WebAssembly sandboxing.

    The LLM writes Rust code that compiles to WASM and runs in a secure sandbox. Demos include:

    • Error ranking in logs
    • Response time percentiles
    • Unique IP counting

    Trade-offs: ASCII only, 64MB memory limit, subset of Rust.


    4. Native CLI Binaries

    Native CLI Binaries

    When 5,000 lines would timeout in WASM, Level 3 breaks free. Native Rust binaries process massive datasets with no limits.

    Four CLI demos:

    • Error ranking: Hash map counts error types
    • Unique IPs: Hash set finds distinct addresses
    • Percentiles: Sort and index for p50/p95/p99
    • Word frequency: Tokenize, filter stop words, count

    5. Detective Mystery Demo

    Detective Mystery Demo

    A murder at the manor. Seven suspects. Dozens of clues. Can an LLM solve it?

    Level 4 delegates reasoning to sub-LLMs. Instead of code execution, the model calls other models to:

    • Analyze witness statements
    • Compare alibis
    • Draw conclusions

    Watch as L4 examines each suspect and identifies the killer.


    6. Large Context Processing

    Large Context Processing

    War and Peace is 3MB—far too large for any context window. This video shows Level 4 extracting noble family relationships from the entire novel.

    The process:

    1. L3 extracts relationship sentences (father, mother, son, daughter…)
    2. L4 analyzes filtered data with sub-LLMs
    3. Final output: structured family trees

    Three million characters → structured family trees in ~90 seconds.


    Architecture

    ┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
    │   Client    │────▶│  RLM Server     │────▶│  Root LLM   │
    │  /visualize │     │  (Rust/Axum)    │     │  (DeepSeek) │
    └─────────────┘     └────────┬────────┘     └─────────────┘
                                 │
                        ┌────────▼────────┐
                        │ Command Executor │
                        │  slice, find,   │
                        │  regex, count,  │
                        │  llm_query...   │
                        └────────┬────────┘
                                 │
                  ┌──────────────┼──────────────┐
                  ▼              ▼              ▼
            ┌──────────┐  ┌──────────┐  ┌──────────┐
            │  Ollama  │  │  Ollama  │  │  Ollama  │
            │ (local)  │  │ (remote) │  │ (other)  │
            └──────────┘  └──────────┘  └──────────┘
                  Sub-LM Pool (for llm_query)
    

    Quick Start

    cd rlm-orchestrator
    
    # Configure providers in config.toml
    export DEEPSEEK_API_KEY="your-key"
    
    # Run the server
    cargo run --bin rlm-server
    
    # Open visualizer
    open http://localhost:8080/visualize
    

    Think of it like this:

    • Old way: Dump everything on the table, then dig through the mess
    • RLM way: Use a scoop—grab just the cookies you need

    The key insight is simple: expand the workspace, not the context.

    Resources


    When context windows aren’t enough, RLM gives your LLM tools to explore. Six videos, four capability levels, one insight: expand the workspace, not the context.

    Watch the Video

    Unmute to hear narration.

  • 499 words3 min readAbstract

    Five ML Concepts - #10

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #10
    Video

    References

    Concept Reference
    CNN ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al. 2012)
    Encoder-Decoder Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
    RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
    Few-shot Learning Language Models are Few-Shot Learners (Brown et al. 2020)
    Distillation Distilling the Knowledge in a Neural Network (Hinton et al. 2015)

    Today’s Five

    1. CNN (Convolutional Neural Network)

    Networks designed for image data that use small filters sliding across an image to detect edges, textures, and shapes. Early layers find simple patterns, while deeper layers recognize complex objects.

    CNNs are a foundation of modern computer vision.

    Like scanning a photo with a magnifying glass that learns to recognize patterns at different scales.

    2. Encoder-Decoder

    A model architecture with two parts: the encoder compresses input into a representation, and the decoder generates an output from that representation. This pattern is common in translation, summarization, and speech systems.

    The representation acts as a bottleneck that captures essential information.

    Like summarizing a book into notes, then writing a new version from those notes.

    3. RAG (Retrieval-Augmented Generation)

    Instead of relying only on learned parameters, the model retrieves relevant documents and uses them during generation. This helps ground responses in external information and can reduce hallucinations.

    RAG combines the strengths of retrieval systems and generative models.

    Like an open-book exam where you can look up facts instead of relying purely on memory.

    4. Few-shot Learning

    Adapting behavior from just a few examples provided directly in the prompt. Instead of retraining, the model infers the pattern and applies it to new inputs.

    Zero-shot learning relies only on instructions, without examples.

    Like learning a card game by watching a few hands before playing.

    5. Distillation

    Transferring knowledge from a large teacher model to a smaller student. The student learns to match the teacher’s outputs, not its internal weights.

    This produces models that are smaller and cheaper while retaining much of the original capability.

    Like an apprentice learning by imitating a master’s finished work, not by copying their brain.

    Quick Reference

    Concept One-liner
    CNN Sliding filters for hierarchical image features
    Encoder-Decoder Compress input, then generate output
    RAG Retrieve context before generating
    Few-shot Learning Learn from examples in the prompt
    Distillation Small student mimics large teacher

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1633 words9 min readAbstract

    TBT (3/?): Vector Graphics Games

    Before pixels, there were vectors. This Throwback Thursday explores the evolution of vector graphics gaming—from military radar displays to arcade classics—and my attempt to recreate them in Rust and WebAssembly.

    My First Vector Display: The IBM 2250

    IBM 2250 Graphics Display Unit with light pen, October 1969
    IBM 2250 at Brown University, 1969. Photo credit

    My first encounter with vector graphics was an IBM 2250 Graphics Display Unit—introduced in 1964, costing around $280,000 in period dollars. It connected to an IBM 1130 that acted as a graphics controller for an IBM S/370 mainframe where the graphical applications ran. At that price, nobody was playing games on it—Computer Aided Design was the killer app.

    The 2250’s specifications were impressive for its era:

    Specification Value
    Display 21-inch P39 phosphor CRT
    Resolution 1024 × 1024 addressable points
    Usable area 12” × 12” (square aspect)
    Refresh rate ~40 frames/second
    Input Light pen for direct interaction
    Vector drawing Hardware character generator optional

    The CRT drew lines by steering an electron beam directly—no pixel grid, no rasterization. Just pure geometry traced in phosphor glow. The green P39 phosphor had long persistence, reducing flicker but creating ghostly trails on moving objects.

    The light pen was revolutionary: you could point directly at displayed geometry and the system knew which vector you were touching. Interactive graphics in 1964.

    The Arcade Era

    Vector displays found their way into arcades, where they defined a visual style that’s still recognizable today:

    Game Year Innovation
    Lunar Lander 1979 Physics simulation, thrust/gravity
    Asteroids 1979 Wrap-around space, particle effects
    BattleZone 1980 Green wireframe 3D, first-person tanks
    Tempest 1981 Multi-colored vectors, pseudo-3D depth

    (Note: Pong (1972) was actually a raster game using discrete logic, but its simple geometry makes it a natural fit for vector recreation.)

    Each generation built on the last. White vectors on black screens gave way to green wireframes, then full color. The hardware pushed boundaries that feel primitive now but were revolutionary then.

    The Vectorcade Project

    Vectorcade recreates these mechanics using modern tools:

    • Rust for game logic and rendering
    • WebAssembly for browser deployment
    • wgpu for GPU-accelerated vector rendering
    • Yew for the web frontend

    Multi-Repo Architecture

    The project architecture emerged from a design session with ChatGPT, exploring how to structure a multi-agent development workflow. The result: a DAG of repositories, each with clear ownership boundaries:

    vectorcade-shared/      (Pure Rust API contracts)
        ↓
    vectorcade-fonts/       (Vector font styles)
        ↓
    vectorcade-games/       (Game logic: Pong, Asteroids, etc.)
        ↓
    vectorcade-render-wgpu/ (wgpu + lyon tessellation)
        ↓
    vectorcade-web-yew/     (Yew web shell)
    

    This DAG structure allows parallel development with assigned agent roles:

    Agent Repo Focus
    A vectorcade-shared Core API steward: minimal, stable, pure
    B vectorcade-fonts Font stylist: 3-5 distinct vector styles
    C vectorcade-games Game logic: Pong → Asteroids → Lunar Lander
    D vectorcade-render-wgpu Renderer: lyon tessellation → wgpu triangles
    E vectorcade-web-yew Integrator: UI, mobile controls, PWA

    Each agent works against stable interfaces—the DrawCmd display list and Game trait—so they don’t step on each other.

    The Display List Model

    Games don’t render directly. They emit draw commands that the renderer interprets:

    pub enum DrawCmd {
        Clear { color: Rgba },
        Line(Line2),
        Polyline { pts: Vec<[f32;2]>, closed: bool, stroke: Stroke },
        Text { pos: [f32;2], s: String, size_px: f32, color: Rgba },
        PushTransform(Transform2),
        PopTransform,
    }
    

    This keeps game logic portable. The same Asteroids code can render through wgpu on desktop, WebGPU in browsers, or even a software rasterizer.

    Vector Fonts

    Classic arcade games had distinctive lettering. Vectorcade includes multiple font styles to match:

    Style Look Games
    ATARI Boxy, utilitarian Asteroids, Lunar Lander
    CINEMATRONICS Thin, angular Star Castle
    MIDWAY Slightly rounded Defender
    VECTOR_SCANLINE Broken segments “Beam jitter” effect

    Each font is pure vector geometry—no bitmaps, no texture atlases.

    3D Projection

    BattleZone and Tempest need 3D-to-2D projection. Instead of a full 3D renderer, Vectorcade uses a “2.5D pipeline”:

    pub struct Camera3 {
        pub pos: [f32;3],
        pub yaw: f32,
        pub pitch: f32,
        pub fov_y_rad: f32,
    }
    
    pub fn project_polyline(cam: &Camera3, pts3: &[[f32;3]]) -> Vec<[f32;2]>;
    

    Games maintain 3D geometry; the core projects it to 2D lines. Depth-based brightness gives the classic “farther = dimmer” effect.

    Why Rust + WASM?

    The combination solves several problems:

    1. Performance: Games need consistent frame rates; Rust delivers
    2. Portability: Same code runs native and in browsers
    3. Safety: No dangling pointers in the game loop
    4. Modern tooling: Cargo, wasm-pack, Trunk make deployment straightforward

    The wgpu + lyon stack provides cross-platform GPU rendering with proper thick-line support (WebGL’s lineWidth is notoriously inconsistent).

    Current Status

    Component Status
    vectorcade-shared Functional
    vectorcade-fonts Functional
    vectorcade-games Playable (5 demos)
    vectorcade-render-wgpu Functional
    vectorcade-web-yew Functional

    The core architecture works. All five demos are playable in the browser. Polish and audio remain.

    The Demos

    The video showcases five demonstrations, progressing from static display to full gameplay:

    1. IBM 2250 Chessboard

    A static image rendered in the style of the original IBM 2250. The 2250 was mainly used for Computer Aided Design, but programmers did create games on it—this chessboard pays tribute to that era.

    2. Pong (Playable)

    A vector implementation of the classic. The original Pong (1972) wasn’t actually a vector game—it used discrete logic and a raster display—but some clones used vector hardware. This recreation captures the pure-geometry aesthetic.

    3. Asteroids (Playable)

    One of the most popular vector arcade games. Rotate, thrust, and shoot to survive. The ship and asteroids wrap around screen edges, creating the classic “infinite space” feel.

    4. BattleZone (Playable)

    Green wireframe 3D tanks. Drive through a battlefield, shooting enemies and dodging missiles. One of the first games with true 3D perspective—rendered entirely with vectors.

    5. Tempest (Playable)

    The pinnacle of vector arcade hardware. Move around the edge of geometric tubes, shooting enemies that climb up from the depths. Each level changes the tube shape and color scheme.

    Implementation

    Each game implements the same Game trait:

    pub trait Game {
        fn metadata(&self) -> GameMeta;
        fn reset(&mut self, ctx: &mut GameCtx);
        fn update(&mut self, ctx: &mut GameCtx, dt: f32);
        fn render(&mut self, ctx: &mut GameCtx, out: &mut Vec<DrawCmd>);
    }
    

    This makes games drop-in replaceable in the web shell—no renderer changes needed.

    TODO

    The demos are playable but not finished. Remaining work:

    • GPU rendering: Switch from Canvas 2D emulation to actual wgpu GPU rendering [Ed. Completed 2/13]
    • Music and sound effects: Authentic arcade audio
    • More aggressive opponents: AI improvements for challenge
    • Additional levels/difficulties: Progression and replay value
    • More animations: Explosions, transitions, effects

    Resources


    Before pixels, there were vectors. Vectorcade brings them back—in Rust, for the browser, with phosphor glow optional.

    Credits

    Role Credit
    Director Mike Wright
    Research & Architecture ChatGPT
    vectorcade-shared Claude Code CLI agent
    vectorcade-fonts Claude Code CLI agent
    vectorcade-games Claude Code CLI agent
    vectorcade-render-wgpu Claude Code CLI agent
    vectorcade-web-yew Claude Code CLI agent
    Explainer Video Claude Code
    Blog Post Claude Code

    Timeline: First pass vibe coded in one day (February 12, 2026)

    • First commit: 11:08 AM PST
    • Last commit: 5:08 PM PST
    • Total commits: 52 across 4 repositories
    • WGPU support added February 13, 2026

    References

    IBM 2250 Photo: “HES IBM 2250 Console grlloyd Oct1969” by Gregory Lloyd, October 1969. Brown University Hypertext Editing System (HES) demonstration. Licensed under CC BY-SA 4.0. Used with attribution.

    Watch the Video

    Unmute to hear narration.

  • 781 words4 min readAbstract

    DyTopo: Dynamic Topology for Multi-Agent AI

    When multiple AI agents work together, how should they communicate? Fixed patterns fail at scale. DyTopo rebuilds the communication graph each round based on what agents need and what they can offer.

    Resource Link
    Video DyTopo
    Video
    Paper arXiv:2505.16128
    Code dytopo-rs

    The Problem: Fixed Topologies Don’t Scale

    Multi-agent systems need communication patterns. The obvious approaches have problems:

    Topology Problem
    All-to-all Context explosion—every agent reads every message
    Chain Bottlenecks—one slow agent blocks everyone
    Star Single point of failure at the hub

    As agent count grows, fixed topologies either explode in messages or create chokepoints.

    The DyTopo Solution: Dynamic Routing

    DyTopo (Dynamic Topology) solves this by reconstructing the communication graph each round. The key insight: agents know what they need and what they can offer.

    Each round, every agent emits:

    • Query: What information do I need?
    • Key: What can I contribute?

    The router computes semantic similarity between all keys and queries, then builds a sparse directed graph:

    score(sender → receiver) = cosine(sender.key, receiver.query)
    

    High-scoring pairs connect. Low-scoring pairs are ignored. The result: efficient, adaptive communication.

    How It Works

    Round N:
      1. Manager broadcasts goal
      2. Each agent produces:
         - Query (what I need)
         - Key (what I offer)
         - Draft (my current contribution)
      3. Router embeds keys and queries
      4. Similarity matrix → sparse graph (top-K per receiver)
      5. Messages flow along edges
      6. Trace written to JSONL
    

    The topology adapts every round. An agent working on parsing might connect to the syntax expert in round 1, then the error-handling expert in round 2.

    The Implementation: Rust, Zero Python

    dytopo-rs is a fully Rust implementation with no Python dependencies:

    Crate Purpose
    dytopo-core Shared types (AgentId, Topology, TraceEvent)
    dytopo-embed Text embedding (hash-based baseline, semantic planned)
    dytopo-router Sparse graph construction
    dytopo-agents Agent implementations
    dytopo-orchestrator Main execution loop
    dytopo-viz DOT export for visualization
    dytopo-cli Command-line interface

    Why Rust?

    1. Zero-cost abstractions for performance-critical embedding/routing
    2. Strong type system catches protocol mismatches at compile time
    3. No Python dependency for baseline demos
    4. Fearless concurrency for future parallelization

    Running the Demo

    cargo run -p dytopo-cli -- demo --rounds 3 --agents 5 --topk 2
    

    This produces:

    • Per-round topology printed to stdout
    • ./traces/trace_*.jsonl for machine-readable analysis
    • DOT files for graph visualization

    Current Status

    Milestone 0 is complete—the system runs end-to-end with stub agents and hash-based embeddings.

    Feature Status
    Core types and traits Done
    Hash embedder (deterministic) Done
    Top-K sparse routing Done
    Stub agents with templates Done
    Orchestrator loop Done
    JSONL tracing Done
    DOT visualization Done

    Planned

    • Semantic embeddings (fastembed/candle)
    • LLM-backed agents (Ollama integration)
    • Inbox summarization for long conversations
    • Evaluation harness comparing topologies

    Key Design Decisions

    Why Hash Embeddings First?

    The baseline uses deterministic hash-based embeddings:

    • Reproducible demos for debugging
    • No external dependencies to download
    • Validates the full pipeline before adding ML complexity

    Semantic embeddings are planned as drop-in replacements.

    Why Sparse Graphs?

    Each agent receives at most topk messages per round:

    • Prevents context explosion as agent count grows
    • Makes communication interpretable—you can trace why agents connected
    • Matches the paper’s approach

    Why JSONL Traces?

    Every event is logged to JSONL:

    • Append-only for streaming
    • Line-based for grep/filtering
    • Machine-parseable for analysis tools
    • Human-readable for debugging

    Topology Comparison

    The system supports multiple topology strategies for comparison:

    Strategy Description Use Case
    dynamic DyTopo routing Adaptive, sparse
    fully_connected All-to-all Baseline comparison
    chain Sequential Pipeline tasks
    star Hub-and-spoke Centralized coordination

    What’s Next

    1. LLM Agent Support (Milestone 2)—Replace stubs with real reasoning
    2. Semantic Embeddings (Milestone 1)—Meaningful routing decisions
    3. Evaluation Harness (Milestone 4)—Quantify DyTopo advantages

    Resources


    Dynamic topology lets agents find the right collaborators each round. No context explosion. No bottlenecks. Just efficient, adaptive communication.

    Watch the Video

    Unmute to hear narration.

  • 1211 words7 min readAbstract

    Towards Continuous LLM Learning (1): Sleepy Coder - When Fine-Tuning Fails

    What if your AI coding assistant could learn from its mistakes? Not just for one session, but across training cycles. We built exactly that—and fifty-one adapters later, learned the mistake was trying to teach it at all.

    Resource Link
    Video Sleepy Coder
    Video
    Code sleepy-coder
    Share Paper arXiv:2602.06043
    UWSH Paper arXiv:2512.05117
    Part 2 Routing Prevents Forgetting

    The Dream: Day/Night Learning

    AI coding agents have a memory problem. They fix a bug today, then make the same mistake next week. Every session starts from the same frozen model. Nothing carries forward.

    The idea was elegant: build an agent that improves overnight.

    DAY CYCLE (Inference)
      Agent attempts to fix Rust compiler errors
      Successes and failures are logged
            ↓
    NIGHT CYCLE (Training)
      Fine-tune on failure patterns using LoRA
      Create specialized adapters
            ↓
    EVAL
      Test against benchmark
      Measure improvement
            ↓
    (repeat)
    

    During the day, the agent works and we log its failures—the error messages, the broken code, and the fixes that worked. Overnight, we fine-tune the model on those failures. Each morning, a new checkpoint should wake up a little better than before.

    We based this on two papers from the Johns Hopkins team (Kaushik, Vaidya, Chaudhari, Chellappa, Yuille):

    1. Share LoRA Subspaces (arXiv:2602.06043) — Learn a shared low-rank basis across tasks, then train only coefficients (76x fewer parameters per task)

    2. UWSH (arXiv:2512.05117) — The Universal Weight Subspace Hypothesis suggests neural networks converge to shared spectral subspaces

    The theory was sound. The implementation worked. The results were devastating.

    The System

    The Sleepy Coder agent runs in a Rust runtime, fixing compiler errors on 30 “koans” (small coding exercises) across 5 error families:

    • Borrow Checker: Ownership and lifetime errors
    • Type Bounds: Missing trait implementations
    • Result Handling: Option/Result conversions
    • Type Mismatches: Incompatible types
    • Missing Items: Undefined functions or modules

    The base model: Qwen2.5-Coder-1.5B-Instruct — small enough to train on a single GPU, capable enough to pass most koans without any fine-tuning.

    The Journey: From Hope to Reality

    Chapter 1: Naive LoRA

    First attempt: standard fine-tuning on failure patterns.

    Metric Before After
    Pass Rate 73.3% 60.0%
    Change -13.3%

    Catastrophic forgetting. The model learned the new patterns but forgot how to do everything else.

    Chapter 2: The Paper Chase

    We found the Share paper promising “continual learning without forgetting.” The UWSH paper provided theoretical backing: neural networks naturally converge to shared low-rank subspaces.

    Key insight from Share:

    Train ONLY the coefficients. Keep the basis FROZEN.

    This meant ~21,000 trainable parameters instead of ~1.6 million. A 76x reduction.

    Chapter 3: The Proper Implementation

    SVD: Singular Value Decomposition breaks a matrix into components that reveal its underlying structure. In Share, SVD finds the common “directions” that multiple LoRA adapters share—a compressed basis that captures what they have in common.

    We rebuilt everything:

    • Phase 1: Extract shared basis from 51 adapters via SVD
    • Phase 2: Train only coefficient vectors (frozen basis)
    • Phase 3: Merge and update basis periodically

    We trained 51 pattern-specific adapters. We followed the algorithm precisely.

    Chapter 4: The Stubborn Seven

    No matter what we tried, 7 tasks kept failing:

    Task The Problem
    bc_003 Mutable borrow while immutable exists
    bc_005 Double mutable borrow
    bc_010 Returning reference to local data
    tb_002 Missing Clone trait
    tb_007 Missing Hash trait
    tb_008 Missing Ord trait
    rh_004 Option to Result conversion

    These require deep understanding of Rust’s ownership system—something a 1.5B model can’t reliably learn.

    Chapter 5: The Final Score

    Approach Pass Rate vs Baseline Regressions
    Baseline (no training) 73.3% 0
    Naive LoRA 60.0% -13.3% Many
    Targeted LoRA (7 patterns) 63.3% -10% 4+
    Replay buffer 70.0% -3.3% 2
    Phase 2 coef-only (10K params) 66.7% -6.6% 2
    Share Full (Ph2+Ph3) 73.3% 0% 0

    The Share algorithm did exactly what it claimed: it prevented forgetting. But it couldn’t improve beyond baseline because there was nothing to improve.

    What Went Wrong

    1. The Model Already Knows

    The base model already passes 73% of patterns. Training on these patterns doesn’t add knowledge—it dilutes what’s there.

    2. Training Causes Forgetting

    Even training only on the 7 failure patterns (44 examples) caused 4 new regressions. The model’s knowledge is interconnected.

    3. Averaging Destroys Specialization

    The Share paper assumes task routing at inference—selecting the right coefficients for each task. We averaged coefficients, which negated any specialization.

    4. More Adapters Made It Worse

    Adapter Count Pass Rate
    6 adapters 73.3%
    51 adapters 70.0%

    More adapters meant more subspace dilution when averaging. The signal got lost in the noise.

    The Critical Insight

    LoRA fine-tuning cannot improve a capable base model for tasks it already handles reasonably well.

    The model’s knowledge is interconnected. Even 10,000 trainable parameters (0.0007% of the model) can break things. The baseline represents the ceiling, not the floor.

    What We Learned

    1. Read the room. If your base model passes 73%, maybe it doesn’t need fine-tuning. Maybe it needs better prompts.

    2. Negative results are results. 51 failed experiments taught us more than a successful one would have.

    3. Catastrophic forgetting is real. Small models especially can’t absorb new knowledge without losing old.

    4. Share prevents forgetting, not ignorance. The algorithm does what it claims—it just can’t create knowledge from nothing.

    5. Sometimes the answer is “don’t.” The best LoRA adapter for this task is no adapter.

    6. Task routing vs averaging matters. The Share paper assumes you select coefficients based on task type, not blend them together.

    7. AI coding agents cut corners. When implementing research papers, AI agents repeatedly stopped before completing all phases of the algorithm. I had to direct the agent to re-read the papers many times before it implemented them correctly.

    Paths Forward

    Since fine-tuning doesn’t work here, alternatives:

    Approach Tradeoff
    Prompt engineering No weight changes, limited by context
    Multi-turn repair Uses base model reasoning, slower
    Larger model (7B+) More capacity to absorb knowledge
    Task routing with Share Select coefficients, don’t average
    Model ensemble Multiple models, pick best output
    Accept baseline 73% may be good enough

    The Numbers

    Experiments run:        51 adapters, multiple algorithms
    Parameters trained:     From 10K to 1.6M per adapter
    Best achieved:          73.3% (matches baseline)
    Target:                 ≥76.7%
    Conclusion:             Target not achievable with LoRA
    

    Resources


    Sometimes the most valuable research shows what doesn’t work. Fifty-one adapters later, we know: let sleeping models lie.

    Watch the Video

    Unmute to hear narration.

  • 470 words3 min readAbstract

    Five ML Concepts - #9

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #9
    Video

    References

    Concept Reference
    Dropout Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
    RLHF Training language models to follow instructions with human feedback (Ouyang et al. 2022)
    Inference Deep Learning (Goodfellow et al. 2016), Chapter 5
    Quantization A Survey of Quantization Methods for Efficient Neural Network Inference (Gholami et al. 2021)
    Flash Attention FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al. 2022)

    Today’s Five

    1. Dropout

    A regularization technique that randomly disables units during training. This encourages the network to rely on multiple pathways instead of memorizing patterns.

    It helps reduce overfitting, especially in large models.

    Like training a team where random members sit out each practice, so no one becomes a single point of failure.

    2. RLHF (Reinforcement Learning from Human Feedback)

    A training approach where humans rank or compare model outputs to produce a reward signal. The model is then optimized to better match human preferences.

    This technique is central to aligning language models with human intent.

    Like teaching by grading essays instead of dictating every word.

    3. Inference

    The process of running a trained model to make predictions on new data. Training updates the model’s parameters; inference uses them.

    The distinction matters for optimization, deployment, and cost.

    Like the difference between studying for an exam and actually taking it.

    4. Quantization

    Reducing the numerical precision used to store and compute model weights. This can shrink model size and speed up inference, sometimes with a small accuracy tradeoff.

    Essential for deploying large models on limited hardware.

    Like compressing a high-resolution photo into a smaller file that still looks good.

    5. Flash Attention

    An optimized attention algorithm designed to reduce memory usage. It avoids materializing the full attention matrix by computing attention in blocks.

    This enables longer sequences and faster training.

    Like reading a book chapter by chapter instead of photocopying the whole thing first.

    Quick Reference

    Concept One-liner
    Dropout Random disabling to prevent overfitting
    RLHF Learn from human preference comparisons
    Inference Using a trained model for predictions
    Quantization Lower precision for smaller, faster models
    Flash Attention Block-wise attention for memory efficiency

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 477 words3 min readAbstract

    Five ML Concepts - #8

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #8
    Video

    References

    Concept Reference
    Bias-Variance The Elements of Statistical Learning (Hastie et al. 2009), Chapter 7
    Diffusion Denoising Diffusion Probabilistic Models (Ho et al. 2020)
    KV Cache Fast Transformer Decoding (Pope et al. 2022)
    Mixed Precision Mixed Precision Training (Micikevicius et al. 2017)
    MLA DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (DeepSeek-AI 2024)

    Today’s Five

    1. Bias-Variance Tradeoff

    A fundamental tension where simpler models tend to underfit (high bias), and more flexible models can overfit (high variance). The goal is finding a balance that generalizes well to unseen data.

    One of the oldest ideas in machine learning, still relevant today.

    Like choosing between a ruler that only draws straight lines and one so flexible it traces every bump.

    2. Diffusion Models

    A generative approach that trains a model to reverse a gradual noising process. During generation, the model starts from noise and removes it step by step.

    The foundation of image generators like Stable Diffusion and DALL-E.

    Like learning to restore a photo by practicing on progressively more damaged versions.

    3. KV Cache

    A technique that stores attention key and value tensors from earlier tokens so they don’t need to be recomputed during generation. This significantly speeds up autoregressive inference.

    Essential for efficient LLM serving.

    Like keeping notes from earlier in a conversation instead of rereading everything.

    4. Mixed Precision

    A training strategy that uses lower-precision math for most operations, while keeping some calculations in higher precision for stability. This reduces memory use and often speeds up training with little accuracy loss.

    Standard practice for modern deep learning.

    Like drafting in pencil and only using ink for the final signature.

    5. MLA (Multi-head Latent Attention)

    An attention variant that compresses key and value information into a lower-dimensional latent space. This reduces memory usage for long sequences while retaining useful context.

    Used in DeepSeek-V2 and related architectures.

    Like summarizing meeting notes instead of recording every word verbatim.

    Quick Reference

    Concept One-liner
    Bias-Variance Balance underfitting vs overfitting
    Diffusion Generate by learning to denoise
    KV Cache Store past keys/values for fast inference
    Mixed Precision Lower precision for speed, higher for stability
    MLA Compress attention into latent space

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1033 words6 min readAbstract

    Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation

    We started by training models to act like they had memory. Then we found an open source implementation that does it for real. This is what we learned.

    Resource Link
    Paper arXiv:2601.07372
    Our Code engram-poc
    Reference weagan/Engram
    Video Engram Revisited
    Video
    Playlist All Engram Videos

    The Journey

    Phase 1: Behavioral Emulation

    Part 2 described our first approach: LoRA fine-tuning to make a model behave like it has memory. Train on patterns, and the model learns to respond consistently.

    Metric Baseline LoRA-tuned
    Accuracy 8.6% 14.1%
    Improvement - +63% relative

    It worked, but the architecture was unchanged. We were approximating Engram benefits, not implementing them.

    Phase 2: The Discovery

    Then we found weagan/Engram on GitHub—real hash-based memory in ~300 lines of Python:

    class EnhancedEngramModule(nn.Module):
        def __init__(self, table_size=50000, d_model=512):
            # Large learnable memory table
            self.memory_table = nn.Parameter(torch.zeros(table_size, d_model))
    
            # Gate decides when to trust memory
            self.gate = nn.Sequential(
                nn.Linear(d_model * 2, d_model),
                nn.ReLU(),
                nn.Linear(d_model, 1),
                nn.Sigmoid()
            )
    
        def forward(self, hidden_states, input_ids):
            # O(1) hash lookup
            indices = self.multi_head_hash(input_ids)
            retrieved = F.embedding(indices, self.memory_table)
    
            # Gated injection
            gate_score = self.gate(torch.cat([hidden_states, retrieved], dim=-1))
            return hidden_states + gate_score * retrieved
    

    The key insight: the gate decides when to trust the lookup. Not every token needs memory.

    Phase 3: Integration with HuggingFace

    We ported the module to work with HuggingFace models:

    SmolLM-135M (frozen)
            ↓
    EnhancedEngramModule (per layer)
      - 50K slot memory table
      - O(1) hash-based lookup
      - Learned gating
            ↓
    Output
    

    The proof it works—O(1) lookup regardless of sequence length:

    Sequence Length Lookup Time Expected if O(n)
    64 tokens 0.15 ms -
    2048 tokens 2.77 ms 4.8 ms

    Sub-linear scaling proves constant-time hash lookup.

    The Reality Check

    Here’s where it gets interesting. Real Engram memory excels at some tasks and hurts others.

    Where Engram Helps

    Task Type Baseline Engram Change
    Acronym expansion 25% 75% +200%
    Element symbols 33% 67% +103%
    Long-term fact recall 90% 100% +11%

    For exact-match lookups with structured keys, Engram dominates.

    Where Engram Hurts

    Task Type Baseline Engram Change
    World capitals 83% 67% -19%
    Pattern completion 14% 11% -21%

    For tasks where the base model already knows the answer, Engram’s hash collisions add noise.

    The Key Insight

    Engram is a specialized tool, not a general enhancement.

    Use Engram For Don’t Use Engram For
    FAQ responses Creative generation
    Terminology lookup Novel combinations
    Entity facts Context-dependent answers
    Code boilerplate Reasoning tasks

    The gating mechanism is critical: it must learn to suppress memory when it doesn’t help. Without proper gating, hash collisions inject noise into every token.

    Obstacles Encountered

    1. Hash Collisions

    Different inputs can map to the same memory slot. The gate must learn to ignore irrelevant retrievals.

    2. Parameter Explosion

    50K slots × 768 dimensions × 30 layers = 1.2B additional parameters. We had to inject selectively (every 4th layer) to stay practical.

    3. Training Dynamics

    Memory tables start at zero. They need higher learning rates (10x) to develop meaningful representations before the model learns to use them.

    4. Evaluation Mismatch

    Our pattern completion task wasn’t ideal for hash-based memory. Engram shines on exact-match retrieval, not generalization.

    Combined Approach

    The best results came from combining both methods:

    Base Model (SmolLM-135M)
            ↓
    EnhancedEngramModule
      - Long-term fact storage
      - O(1) lookup for known patterns
            ↓
    LoRA Adapters
      - Pattern completion
      - Domain-specific behaviors
            ↓
    Output
    

    This gives you:

    • Long-term memory from hash tables
    • Pattern consistency from behavioral training
    • Flexibility to disable either component

    What We Learned

    1. Emulation vs Implementation: LoRA fine-tuning approximates memory behavior; hash tables implement it. Both have their place.

    2. Gating is Essential: The learned gate prevents hash collisions from degrading performance. Never use Engram without gating.

    3. Match Task to Tool: Hash-based memory excels at exact lookups, not pattern generalization. Use it where applicable.

    4. Selective Application: Don’t inject Engram everywhere. Target layers and use cases where it helps.

    5. The Gate as a Safety Valve: When the gate learns to output near-zero for a task, that’s the model telling you Engram doesn’t help there. Listen to it.

    Resources

    Series Recap

    Part Topic Key Insight
    1 mHC Doubly-stochastic constraints bound signal amplification
    2 Engram Intro O(1) lookup beats recomputing through attention
    3 Engram Revisited Use Engram where applicable; gate to avoid worse results

    Hash-based memory is powerful but specialized. The gate decides when to use it—and when not to.

    Watch the Video

    Unmute to hear narration.

  • 469 words3 min readAbstract

    Five ML Concepts - #7

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #7
    Video

    References

    Concept Reference
    Cross-Validation A Study of Cross-Validation and Bootstrap (Kohavi 1995)
    GPT Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
    GQA GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al. 2023)
    Context Window Attention Is All You Need (Vaswani et al. 2017)
    Self-Attention Attention Is All You Need (Vaswani et al. 2017)

    Today’s Five

    1. Cross-Validation

    A technique that splits data into multiple folds to evaluate model performance on data it wasn’t trained on. By rotating which data is held out, it gives a more reliable estimate of generalization.

    Essential for honest model evaluation.

    Like practicing with different sets of flashcards to see if you actually learned the material.

    2. GPT

    Generative Pre-trained Transformer. A family of autoregressive language models trained to predict the next token in a sequence.

    Many AI assistants and chatbots are built on this approach.

    Like autocomplete, but scaled up and trained on vast text data.

    3. GQA (Grouped Query Attention)

    An attention variant where multiple query heads share key and value projections. This reduces memory usage and can speed up inference compared to standard multi-head attention.

    Widely adopted in efficient transformer architectures.

    Like several students sharing one set of notes instead of copying everything separately.

    4. Context Window

    The maximum number of tokens a model can process in a single forward pass. Larger context windows allow longer inputs, but increase memory and compute costs.

    A key constraint in language model design.

    Like the size of a desk that limits how many papers you can spread out at once.

    5. Self-Attention

    A mechanism where each token computes attention scores with other tokens in the same sequence. This lets the model weigh which parts of the input are most relevant to each position.

    The core operation inside transformers.

    Like everyone in a meeting deciding who to listen to based on the conversation.

    Quick Reference

    Concept One-liner
    Cross-Validation Rotate held-out data for reliable evaluation
    GPT Predict next token, at scale
    GQA Shared keys/values for efficient attention
    Context Window How much the model sees at once
    Self-Attention Each token attends to all others

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 491 words3 min readAbstract

    Five ML Concepts - #6

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #6
    Video

    References

    Concept Reference
    Regularization Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
    BERT BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
    RoPE RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al. 2021)
    Prompting Language Models are Few-Shot Learners (Brown et al. 2020)
    Positional Encoding Attention Is All You Need (Vaswani et al. 2017)

    Today’s Five

    1. Regularization

    Techniques that reduce overfitting by adding constraints or penalties during training. Common examples include L2 weight decay, L1 sparsity, dropout, and early stopping.

    The goal is better generalization, not just fitting the training set.

    Like adding friction so a model can’t take the easiest overfit path.

    2. BERT

    Bidirectional Encoder Representations from Transformers. A transformer encoder trained with masked language modeling: predicting hidden tokens using context from both sides.

    It was a major step forward for many NLP tasks after its 2018 release.

    Like filling in blanks by reading the whole sentence, not just the words before it.

    3. RoPE (Rotary Positional Embeddings)

    A way to represent token position inside attention by rotating query and key vectors as a function of position. This gives attention information about relative order and distance.

    It’s widely used in modern transformer models.

    Like turning a dial differently for each position so the model can tell where tokens are.

    4. Prompting

    Crafting inputs to steer a model toward the output you want. Small changes in instructions, examples, or format can change behavior significantly.

    A key skill for working effectively with language models.

    Like asking a question in just the right way to get a useful answer.

    5. Positional Encoding

    Transformers need a way to represent token order, because attention alone doesn’t include sequence position. Different methods do this, including learned embeddings and rotary approaches like RoPE.

    Without it, “the cat sat on the mat” would be indistinguishable from “mat the on sat cat the.”

    Like numbering the pages of a shuffled book so you can read them in order.

    Quick Reference

    Concept One-liner
    Regularization Add constraints to prevent overfitting
    BERT Bidirectional masked language modeling
    RoPE Position info via rotation in attention
    Prompting Craft inputs to steer model outputs
    Positional Encoding Tell the model where tokens are in sequence

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 493 words3 min readAbstract

    Five ML Concepts - #5

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #5
    Video

    References

    Concept Reference
    Perceptron The Perceptron: A Probabilistic Model (Rosenblatt 1958)
    Pre-training BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
    Speculative Decoding Fast Inference from Transformers via Speculative Decoding (Leviathan et al. 2022)
    ICL Language Models are Few-Shot Learners (Brown et al. 2020)
    Latent Space Auto-Encoding Variational Bayes (Kingma & Welling 2013)

    Today’s Five

    1. Perceptron

    The simplest neural network: a single linear unit with weights and a bias. It computes a weighted sum and applies a threshold or activation.

    It inspired modern neural networks, even though today’s models are far more complex.

    Like a single voter weighing inputs before deciding yes or no.

    2. Pre-training

    Training a model on a large, general dataset before adapting it to a specific task. This gives the model broad patterns that later training can refine.

    BERT, GPT, and most modern LLMs use this approach.

    Like going to medical school before choosing a specialty.

    3. Speculative Decoding

    A technique where a small, fast model proposes tokens, and a larger model verifies or rejects them in parallel. This can speed up inference without changing final outputs.

    A key optimization for production LLM deployments.

    Like a junior writer drafting text for a senior editor to approve in batches.

    4. In-Context Learning (ICL)

    When a model adapts its behavior using examples in the prompt, without updating its weights. It allows flexible task behavior at inference time.

    This emergent capability surprised researchers when GPT-3 demonstrated it.

    Like solving a new puzzle after seeing a few worked examples.

    5. Latent Space

    The internal representations a model learns as it processes data. In this space, similar inputs tend to be located near each other.

    It’s not a literal place, but a useful way to think about how models organize information.

    Like a map where cities are arranged by similarity instead of geography.

    Quick Reference

    Concept One-liner
    Perceptron Single linear unit—the neural network ancestor
    Pre-training Learn general patterns before specializing
    Speculative Decoding Draft fast, verify in parallel
    ICL Adapt from prompt examples without training
    Latent Space Internal representations where similar things cluster


    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 453 words3 min readAbstract

    Five ML Concepts - #4

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #4
    Video

    References

    Concept Reference
    Activation Functions Deep Learning (Goodfellow et al. 2016), Chapter 6
    Transfer Learning A Survey on Transfer Learning (Pan & Yang 2010)
    VLM Learning Transferable Visual Models (CLIP) (Radford et al. 2021)
    Adam Adam: A Method for Stochastic Optimization (Kingma & Ba 2014)
    Superposition Toy Models of Superposition (Elhage et al. 2022)

    Today’s Five

    1. Activation Functions

    Functions like ReLU, sigmoid, and tanh that transform neuron outputs. They introduce nonlinearity, allowing networks to learn complex patterns beyond simple linear relationships.

    Without them, stacking layers would just be matrix multiplication.

    Like an on-off switch that can also dim the lights.

    2. Transfer Learning

    Using knowledge a model learned on one task to improve performance on a related task. This often reduces training time and data requirements dramatically.

    Pre-trained models can be fine-tuned for specific applications.

    Like a chef who already knows French cooking learning Japanese cuisine faster.

    3. VLM (Vision-Language Models)

    Models trained to work with both images and text. They learn shared representations that connect visual and language understanding.

    CLIP, GPT-4V, and LLaVA are examples of this approach.

    Like someone who can look at a photo and describe what’s happening.

    4. Adam

    An optimizer that adapts learning rates for each parameter using information from past gradients. It combines ideas from momentum and adaptive learning-rate methods.

    One of the most popular optimizers in deep learning.

    Like a hiker who adjusts step size for each part of the trail, steep or flat.

    5. Superposition

    A way neural networks represent many concepts using overlapping directions in the same space. This allows models to pack more information into fewer neurons than expected.

    It’s why interpretability is hard—features aren’t neatly separated.

    Like discovering a painting has hidden layers that appear under the right light.

    Quick Reference

    Concept One-liner
    Activation Functions Introduce nonlinearity to enable complex patterns
    Transfer Learning Reuse knowledge from one task for another
    VLM Joint understanding of images and text
    Adam Adaptive per-parameter learning rates
    Superposition Many concepts packed into overlapping representations

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 524 words3 min readAbstract

    Five ML Concepts - #3

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #3
    Video

    References

    Concept Reference
    Loss Function A Survey of Loss Functions for Deep Neural Networks (Janocha & Czarnecki 2017)
    Overfitting Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
    Fine-tuning A Survey on Transfer Learning (Zhuang et al. 2020)
    LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
    Tokenization Neural Machine Translation of Rare Words with Subword Units (Sennrich et al. 2015)

    Today’s Five

    1. Loss Function

    A formula that measures how far off the model’s predictions are from the correct answers. It quantifies the gap between what the model predicted and what it should have predicted.

    Training a neural network means minimizing this function.

    Like a scorecard that tells the model how badly it messed up.

    2. Overfitting

    When a model learns the training data too well, including noise and outliers, and fails on new data. The model performs great on examples it has seen but poorly on anything new.

    One of the most common pitfalls in machine learning.

    Like memorizing the answers to a test instead of understanding the subject.

    3. Fine-tuning

    Taking a pre-trained model and training it further on a specific task or dataset. Instead of training from scratch, you start from a model that already understands language or images, then specialize it.

    This makes powerful models accessible without massive compute budgets.

    Like teaching a chef who already knows cooking to specialize in sushi.

    4. LoRA (Low-Rank Adaptation)

    An efficient fine-tuning method that trains a small number of added parameters instead of the full model. It inserts small trainable matrices into each layer while keeping the original weights frozen.

    This dramatically reduces the memory and compute needed for fine-tuning.

    Like adding sticky notes to a textbook instead of rewriting the whole thing.

    5. Tokenization

    The process of breaking text into smaller units called tokens that a model can process. Most modern models use subword tokenization, splitting words into common pieces rather than individual characters or whole words.

    It determines what the model actually “sees” and affects everything from vocabulary size to multilingual performance.

    Like chopping sentences into bite-sized pieces a model can digest.

    Quick Reference

    Concept One-liner
    Loss Function How far off the model’s predictions are
    Overfitting Memorizing the test instead of learning the subject
    Fine-tuning Specializing a pre-trained model for a new task
    LoRA Efficient fine-tuning with small added matrices
    Tokenization Breaking text into the pieces a model actually reads

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 1779 words9 min readAbstract

    TBT (2/?): Pipelines on OS/390

    Unix invented pipes. Mainframes reinvented them—for records, not bytes.

    This is the second Throwback Thursday post—revisiting technologies that shaped how I think about programming. This time: CMS/TSO Pipelines, and a vibe coding project that brings them back to life in Rust for education, fun, and nostalgic reasons.

    Resource Link
    Code pipelines-rs
    Demo Live Demo
    Video Pipelines on OS/390 #TBT

    The 1996 Olympics and a Pair of Mainframes

    In 1996, IBM hosted the Olympics Web Server—one of the largest public web properties at the time. Many distributed IBM systems in different regions served dynamic web pages. The logs from all of them were funneled to a pair of IBM S/390 mainframes I was in charge of, running OS/390 (formerly MVS).

    When you’re processing millions of log records for statistics and forensics, you need tools that think in records, not lines. That’s where Pipelines for TSO/E came in.

    Pipelines for TSO/E was the MVS/ESA port of CMS Pipelines, which ran on VM/ESA. Both let you chain stages together to filter, transform, and aggregate record-oriented data—record-oriented pipelines that evolved in parallel with Unix’s byte-stream pipes.

    Two Traditions of Piping

    Unix pipes came first—Thompson and McIlroy at Bell Labs, 1969–1974. Byte streams, file descriptors, the | operator. Brutally simple. Explosively powerful. POSIX.1-1988 standardized pipe(2) and shell pipelines, though POSIX work began in the mid-1980s.

    CMS Pipelines emerged on IBM mainframes in the mid-to-late 1980s. They weren’t a Unix clone—they were convergent evolution under different pressures. Where Unix piped bytes between small programs, CMS piped records through declarative stages. Pipelines for TSO/E followed in the late 1980s and early 1990s, porting CMS concepts to the MVS multi-user environment. Unlike CMS Pipelines (which ships with z/VM), the TSO/E port is typically installed separately on z/OS.

    Neither tradition was “behind.” They were optimizing different dimensions:

      Unix Pipes CMS/TSO Pipelines
    Era 1969–1974 Mid-to-late 1980s
    Data unit Byte stream Records (fixed or variable length)
    Stage input stdin (bytes) Record buffer
    Field access awk, cut (text parsing) Column positions (direct)
    Execution Typically a process per stage Stages in one address space
    Topology Linear by default; fan-out/fan-in via tee, FIFOs, or process substitution Multi-stream, fan-out/fan-in built in
    Philosophy Small tools, ad hoc composition Declarative data transformation

    Many datasets on mainframes are record-structured. Records can be fixed-length or variable-length. CMS and TSO/E Pipelines treat records as byte arrays—character-oriented stages assume EBCDIC text, while position/length stages are binary-safe. A fixed-length 80-byte record isn’t arbitrary text—columns 1-8 are the name, 9-18 are the department, 19-26 are the salary. You don’t parse. You just read the right columns.

    Unix won culturally—cheap hardware, academic distribution, C portability. But IBM’s record-oriented pipelines were better at structured dataflow, and they anticipate or parallel patterns seen in ETL frameworks like Spark and Beam.

    CMS Pipelines ships with z/VM and is still used; Pipelines for TSO/E exists for z/OS but isn’t universally installed. These are not historical curiosities—mainframes continue to process a significant share of high-value transactions, and pipelines remain an available tool for data transformation on those systems.

    What a Pipeline Looks Like

    CMS Pipelines uses a DSL with PIPE as the command, | to chain stages, and ? as a command terminator (it suppresses the console from being used as implicit input):

    PIPE CONSOLE
    | FILTER 18,10 = "SALES"
    | SELECT 0,8,0; 8,10,8
    | CONSOLE
    ?
    

    This reads input records, keeps only those where columns 18–27 equal “SALES”, extracts the name fields, and writes the result. No regex. No string splitting. Just column positions.

    Note: pipelines-rs uses 0-based offsets (e.g., SELECT 0,8,0). Historical CMS Pipelines uses 1-based column positions.

    Compare with the Unix equivalent:

    cat input.txt | awk '$3 == "SALES" {print $1, $2}'
    

    The Unix version looks simpler—until your fields contain spaces, or your records contain non-text bytes, or you need to chain 15 stages without spawning 15 processes.

    Bringing It Back in Rust (Vibe Coding)

    pipelines-rs is a nostalgia-driven vibe coding project—my attempt to emulate Pipelines for TSO/E in Rust, not because it’s practical, but because these ideas deserve to be celebrated. It supports a subset of stages and features two execution models:

    The Two Executors

    Batched processes all records through one stage before moving to the next:

    All records → Stage 1 → All records → Stage 2 → All records → Stage 3
    

    This emulates the correct output and is faster, but doesn’t demonstrate record-oriented dataflow well.

    Record-At-a-Time (RAT) sends each record through the entire pipeline before reading the next:

    Record 1 → Stage 1 → Stage 2 → Stage 3 → Output
    Record 2 → Stage 1 → Stage 2 → Stage 3 → Output
    Record 3 → Stage 1 → Stage 2 → Stage 3 → Output
    

    RAT is the implementation shown in the video. It’s a naive approach—more buffers, more copying—but it shows the dataflow concepts clearly and enables the visual debugger. Both run in linear time (records × stages) and produce identical output for all 23 test specifications.

    A future version will aim for fewer buffers and fewer copy operations. Whether it’s faster than Batched remains to be seen.

    The 80-Byte Record

    The Rust implementation supports fixed-length records only. The fundamental data type is the Record—exactly 80 bytes, matching historical punch card width. Variable-length input lines are accepted and padded to 80 bytes:

    pub const RECORD_WIDTH: usize = 80;
    
    pub struct Record {
        data: [u8; RECORD_WIDTH],
    }
    

    Fields are accessed by column position and length. No parsing, no delimiters. The data is always right where you expect it.

    Supported Stages

    The current implementation supports 14 stages:

    Stage Purpose Example
    FILTER Keep/reject records by field value FILTER 18,10 = "SALES"
    LOCATE Keep records containing a pattern LOCATE "ERROR"
    NLOCATE Keep records NOT containing a pattern NLOCATE "DEBUG"
    SELECT Extract and reposition fields SELECT 0,8,0; 8,10,8
    CHANGE Text replacement CHANGE "SALES" "MKTG"
    COUNT Count records COUNT
    TAKE Keep first N records TAKE 5
    SKIP Skip first N records SKIP 2
    DUPLICATE Repeat each record N times DUPLICATE 3
    LITERAL Append a literal record LITERAL "--- END ---"
    UPPER/LOWER Case conversion UPPER
    REVERSE Reverse record text REVERSE
    HOLE Discard all input HOLE
    CONSOLE Driver stage: source or sink depending on position CONSOLE

    The CLI

    Both executors have identical CLIs:

    # Batch executor
    pipe-run specs/filter-sales.pipe specs/input-fixed-80.data -v
    
    # Record-at-a-time executor
    pipe-run-rat specs/filter-sales.pipe specs/input-fixed-80.data -v
    

    Given this input data:

    SMITH   JOHN      SALES     00050000
    JONES   MARY      ENGINEER  00075000
    DOE     JANE      SALES     00060000
    WILSON  ROBERT    MARKETING 00055000
    CHEN    LISA      ENGINEER  00080000
    GARCIA  CARLOS    SALES     00045000
    TAYLOR  SUSAN     MARKETING 00065000
    BROWN   MICHAEL   ENGINEER  00090000
    

    And this pipeline:

    PIPE CONSOLE
    | FILTER 18,10 = "SALES"
    | CONSOLE
    ?
    

    The output is:

    SMITH   JOHN      SALES     00050000
    DOE     JANE      SALES     00060000
    GARCIA  CARLOS    SALES     00045000
    Records:  8 in -> 3 out
    

    Exactly what I’d have gotten on OS/390 in 1996, but with Web Server log data showing client IP address, OS, browser type/version, user cookies, timestamps, URLs, and more, instead of accounting data. 😊

    The Web UI for Two pipelines-rs Implementations

    The web interface runs entirely in the browser via WebAssembly. It has three panels: input records with an 80-column ruler, the pipeline editor, and the output.

    Tutorial Mode

    The tutorial walks through each stage with examples, running pipelines automatically to show results. You can step through manually or let it auto-advance.

    The Visual Debugger

    The debugger is the reason RAT exists. It lets you:

    • Step through execution one pipe point at a time
    • Watch data at specific pipe points between stages
    • Set breakpoints to pause at specific stages
    • See stage state for stateful stages like COUNT

    You load a pipeline, click Run, then Step to watch each record flow through each stage. The debugger highlights which stages have been reached with a green border. For COUNT and other aggregation stages, you can watch the flush phase where accumulated state becomes output.

    What’s Next

    The current RAT executor is intentionally naive—it uses a buffer at every pipe point and copies each record between them. A better implementation would minimize buffers and copy operations while preserving the record-at-a-time semantics.

    Multi-pipe features are also planned—CMS Pipelines supported fan-out (one input, multiple output streams) and fan-in (multiple inputs merged), which enabled complex processing topologies beyond simple linear chains.

    How pipelines-rs Differs from IBM Pipelines

      IBM CMS/TSO/E Pipelines pipelines-rs
    Indexing 1-based column positions 0-based offsets
    Record format Fixed or variable length, EBCDIC Fixed 80-byte ASCII only (variable-length input padded)
    Stages Hundreds of built-in stages 14 implemented so far
    Topology Multi-stream: fan-out, fan-in, multi-pipe Linear only (multi-pipe planned)
    Environment z/VM, z/OS mainframes CLI (native) and browser (WASM)
    Character set EBCDIC ASCII/UTF-8

    This is a teaching tool and nostalgia project, not a production replacement.

    Implementation Details

    Metric Value
    Language Rust (2024 edition)
    Web UI Yew framework, compiled to WASM
    Stages 14 implemented
    Test Specs 23 pipeline specifications
    Tests 60+ (including batch/RAT equivalence)
    License MIT
    Live Demo sw-comp-history.github.io/pipelines-rs

    Resources

    Credits

    Role Who
    Concept & direction Mike Wright
    Content creation Claude (Anthropic)
    Editorial review ChatGPT (OpenAI)

    Mainframe ideas, modern tools. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 985 words5 min readAbstract

    Small Models (6/6): Which Small AI Fits YOUR Laptop?

    Maximum AI capability on minimum hardware. The 2-3B efficient frontier.

    This is Part 6 (the finale) of the Small Models, Big Brains series. We’re benchmarking the best small models to help you choose the right one for your laptop.

    The Efficient Frontier

    In economics, the “efficient frontier” is the set of optimal portfolios offering the highest return for a given level of risk.

    In AI, it’s the models offering the best capability for a given size.

    The Contenders

    Model Params Source Key Strength
    Phi-2 2.7B Microsoft Reasoning, synthetic data
    Gemma-2B 2B Google Distillation, multilingual
    SmolLM2-1.7B 1.7B HuggingFace 11T tokens, fast inference
    SmolLM3-3B 3B HuggingFace Dual reasoning, 6 languages

    Benchmark Results

    Actual measurements on Apple Silicon (M-series) from efficient-llm:

    Model MMLU GSM8K HumanEval Speed (CPU) Memory
    Phi-2 61.7% 57.0% 50.0% 7.1 tok/s 5.2GB
    Gemma-2B 38.9% 18.0% 90.0% 8.5 tok/s 4.7GB
    SmolLM2 55.6% * * 3.7 tok/s 3.2GB

    *SmolLM2 GSM8K/HumanEval scores reflect prompt format incompatibility, not capability.

    The Key Insight: Data Quality Beats Parameters

    Phi-2 achieves 61.7% MMLU with only 2.7B parameters.

    For comparison:

    • Llama-2-7B: ~46% MMLU
    • Llama-2-13B: ~55% MMLU

    Phi-2 beats models 5x its size. The secret? Synthetic textbook training.

    Microsoft generated high-quality educational content specifically designed to teach reasoning. Quality data > quantity data > model size.

    Model Profiles

    Phi-2: The Reasoning Champion

    Strengths: Math, logic, code understanding
    Weakness:  Less conversational
    Best for:  Technical tasks, chain-of-thought
    

    Phi-2 was trained on “textbook quality” synthetic data. It thinks like a textbook explains.

    Gemma-2B: The Distillation Expert

    Strengths: Multilingual, edge deployment
    Weakness:  Lower benchmark scores
    Best for:  Production apps, Google ecosystem
    

    Google distilled knowledge from larger models into this compact package. Great tooling and documentation.

    SmolLM2-1.7B: The Speed Demon

    Strengths: Fastest inference, smallest footprint
    Weakness:  Prompt format sensitivity
    Best for:  Memory-constrained environments
    

    HuggingFace trained on 11T tokens—massive overtraining like TinyLlama but at a slightly larger scale.

    SmolLM3-3B: The Balanced Choice

    Strengths: Dual reasoning modes, 6 languages
    Weakness:  Newest, less battle-tested
    Best for:  General-purpose small model needs
    

    The latest from HuggingFace, designed to be the go-to small model.

    Decision Framework

    ├── Need best reasoning?           → Phi-2
    ├── Need instruction following?    → SmolLM2 or SmolLM3
    ├── Need multilingual?             → Gemma-2B or SmolLM3
    ├── Memory constrained (<4GB)?     → SmolLM2 + INT4
    ├── Need Google ecosystem?         → Gemma-2B
    ├── General purpose?               → SmolLM3
    └── Maximum quality per byte?      → Phi-2
    

    Running the Benchmarks

    git clone https://github.com/softwarewrighter/efficient-llm
    cd efficient-llm
    
    # Setup
    uv venv && source .venv/bin/activate
    uv pip install torch transformers accelerate bitsandbytes datasets tqdm
    
    # HuggingFace login (required for Gemma)
    huggingface-cli login
    
    # Download and benchmark
    python download_models.py
    python benchmark_quality.py
    python benchmark_speed.py
    python benchmark_memory.py
    
    # Interactive demos
    python demo_reasoning.py
    python demo_code.py
    python demo_chat.py
    

    Hardware Requirements

    Setup Models You Can Run
    4GB RAM SmolLM2 (INT4)
    8GB RAM All models (INT4)
    16GB RAM All models (FP16)
    Apple Silicon All models (MPS)

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 7 .py files
    Estimated Size ~1.4 KLOC
    Framework Transformers, PyTorch
    Build System uv / pip
    Key Features MMLU/GSM8K/HumanEval benchmarks, demos

    Good for you if: You want to benchmark 2-3B models, compare quality vs speed tradeoffs, or run interactive comparisons between Phi-2, Gemma, and SmolLM.

    Complexity: Low. Similar structure to billion-llm. Standalone Python scripts for each benchmark and demo. Requires HuggingFace authentication for Gemma access.

    Series Recap

    Over six parts, we’ve explored the cutting edge of small model research:

    Part Model/Topic Key Insight
    1 TRM (<1K params) Iteration beats scale
    2 MobileLLM (350M) Offline AI is practical
    3 HRM (27M) Hierarchy enables reasoning
    4 BDH Sparsity enables interpretability
    5 1B models The efficiency sweet spot
    6 2-3B models Data quality beats parameters

    Key Takeaways

    1. Data quality beats parameter count. Phi-2 proves careful curation outperforms brute scaling.

    2. The 2-3B range is remarkably capable. These models handle real tasks, not just demos.

    3. Each model has its niche. Match the model to your use case.

    4. Quantization makes everything accessible. INT4 lets you run 3B models on 4GB RAM.

    5. The frontier keeps moving. SmolLM3 is weeks old. Better models are coming.

    What We’ve Learned

    Small models aren’t a compromise—they’re a different optimization target. When you can’t throw compute at a problem, you’re forced to be clever:

    • Recursive reasoning (TRM)
    • Mobile-optimized architectures (MobileLLM)
    • Hierarchical decomposition (HRM)
    • Sparse interpretable activations (BDH)
    • Overtraining on quality data (TinyLlama, Phi-2)

    These techniques will eventually feed back into large models too. Small model research isn’t a dead end—it’s the frontier.

    Resources


    Part 6 of 6 in the Small Models, Big Brains series. Thanks for following along!

    Have questions? Find me on YouTube @SoftwareWrighter or Discord.

    Watch the Video

    Unmute to hear narration.

  • 446 words3 min readAbstract

    Five ML Concepts - #2

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #2
    Video

    References

    Concept Reference
    Gradient Descent An overview of gradient descent optimization algorithms (Ruder 2016)
    Attention Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014)
    DPO Direct Preference Optimization (Rafailov et al. 2023)
    Learning Rate Cyclical Learning Rates (Smith 2015)
    Temperature On the Properties of Neural Machine Translation (Cho et al. 2014)

    Today’s Five

    1. Gradient Descent

    A general optimization method used across machine learning. It improves a model by taking small steps in the direction that reduces error the most.

    Many learning algorithms rely on it, especially neural networks.

    Like walking downhill in fog, adjusting each step based on the slope beneath your feet.

    2. Attention

    A mechanism that lets models weigh different parts of the input by importance. Instead of treating everything equally, attention highlights what matters most.

    This was key to breakthroughs in translation and language models.

    Like reading a sentence and focusing more on the important words.

    3. DPO (Direct Preference Optimization)

    A method for aligning language models with human preferences. Unlike RLHF, it trains directly on preference comparisons and avoids an explicit reward model.

    This simplifies training while achieving comparable alignment.

    Like learning preferences by observing choices, not by designing a scoring system.

    4. Learning Rate

    Controls how large each update step is during training. Too large and learning becomes unstable. Too small and training is slow or gets stuck.

    One of the most important hyperparameters to tune.

    Like choosing how fast to walk downhill without losing balance.

    5. Temperature

    A parameter that controls randomness during text generation. Low temperature favors predictable, high-probability outputs. Higher temperature increases variety and surprise.

    A tradeoff between consistency and creativity.

    Like adjusting a dial from cautious to adventurous.

    Quick Reference

    Concept One-liner
    Gradient Descent Walk downhill to minimize error
    Attention Focus on what matters in the input
    DPO Align models from preference pairs directly
    Learning Rate Step size that balances speed and stability
    Temperature Dial between predictable and creative

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 839 words5 min readAbstract

    Small Models (5/6): Max AI Per Watt

    One billion parameters. The sweet spot for AI.

    Big enough to reason. Small enough to run anywhere. Maximum capability per watt.

    This is Part 5 of the Small Models, Big Brains series, comparing four models at the 1B parameter point.

    Resource Link
    Code billion-llm
    TinyLlama jzhang38/TinyLlama
    Llama 3.2 ai.meta.com/llama
    Pythia EleutherAI/pythia
    Video Max AI Per Watt
    Video

    Why One Billion?

    Range Reality
    Below 1B Models struggle with complex reasoning
    Above 1B Hardware requirements increase significantly
    At 1B Maximum capability per watt

    1B parameters is where you get:

    • Real language understanding
    • Ability to follow instructions
    • Fine-tuning in minutes on a laptop
    • Deployment anywhere (phone, Raspberry Pi, browser)

    The Contenders

    Model Params Key Strength Training Data
    TinyLlama 1.1B Overtrained on 3T tokens Community
    Llama-3.2-1B 1B Official Meta ecosystem Meta
    StableLM-1.6B 1.6B Multilingual, 2T tokens Stability AI
    Pythia-1B 1.08B 154 research checkpoints EleutherAI

    TinyLlama: The Overtraining Champion

    TinyLlama breaks the rules. The Chinchilla scaling laws suggest training tokens should scale with parameters. TinyLlama uses 100x more data than optimal.

    Chinchilla-optimal for 1B: ~30B tokens
    TinyLlama actual:          3T tokens (3,000B)
    

    The result? A tiny model that punches well above its weight.

    Benchmarks

    From the billion-llm repository:

    Model MMLU HumanEval Speed Memory
    TinyLlama 25.3% 12.2% Fast 2.2GB
    Llama-3.2-1B 32.1% 18.5% Fast 2.4GB
    StableLM-1.6B 30.8% 15.1% Medium 3.2GB
    Pythia-1B 26.4% 10.3% Fast 2.2GB

    Llama-3.2-1B leads on quality. TinyLlama offers the best value when you factor in the open training recipe.

    LoRA Fine-Tuning in Minutes

    All these models can be fine-tuned on a laptop using LoRA:

    cd billion-llm
    python finetune_demo.py --model tinyllama --epochs 3
    

    LoRA adds small trainable adapters without modifying base weights:

    Base Model (frozen): 1.1B parameters
    LoRA Adapters:       ~4M parameters (0.4%)
    Training time:       5-10 minutes on M1 Mac
    

    Speculative Decoding: 2-3x Speedup

    Use a fast 1B model to draft tokens, verify with a slower 7B model:

    Draft (1B):   "The quick brown fox" → [jumps, over, the, lazy]
    Verify (7B):  Accept [jumps, over, the] → Reject [lazy] → Generate [sleepy]
    

    The 1B model generates candidates quickly. The 7B model only needs to verify, not generate from scratch.

    python speculative_demo.py
    

    Results: 2-3x speedup on autoregressive generation.

    Hardware Requirements

    Setup What You Can Run
    CPU only All models (slower, INT4 quantized)
    4GB VRAM All models (INT4 quantized)
    8GB VRAM All models (FP16)
    Apple Silicon All models (MPS acceleration)

    Quick Start

    git clone https://github.com/softwarewrighter/billion-llm
    cd billion-llm
    
    # Setup
    uv venv && source .venv/bin/activate
    uv pip install -r requirements.txt
    
    # Download models
    python download_models.py
    
    # Run benchmarks
    python benchmark.py
    
    # Interactive comparison
    python demo_chat.py --compare tinyllama llama3.2-1b
    

    Which Model Should You Choose?

    ├── Need Meta ecosystem compatibility? → Llama-3.2-1B
    ├── Need multilingual support?         → StableLM-1.6B
    ├── Need research reproducibility?     → Pythia-1B (154 checkpoints)
    ├── Need maximum performance/size?     → TinyLlama
    └── Just getting started?              → Any of them work!
    

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 8 .py files
    Estimated Size ~1.4 KLOC
    Framework Transformers, PyTorch
    Build System uv / pip
    Key Features Benchmarking, LoRA fine-tuning, speculative decoding

    Good for you if: You want to benchmark small LLMs, learn LoRA fine-tuning, experiment with speculative decoding, or compare models head-to-head.

    Complexity: Low. Clean Python scripts with HuggingFace Transformers. Each script is standalone—run benchmarks, chat demos, or fine-tuning independently. Well-documented with shell scripts for common tasks.

    Key Takeaways

    1. 1B is the efficiency sweet spot. Below this, capability drops. Above, hardware costs rise.

    2. Overtraining works. TinyLlama proves you can compensate for size with data.

    3. LoRA makes fine-tuning accessible. Customize models on consumer hardware.

    4. Speculative decoding is free speed. Use small models to accelerate large ones.

    5. All roads lead to open weights. Every model here is fully open.

    What’s Next

    Part 6 explores the 2-3B efficient frontier—Phi-2, Gemma, and SmolLM pushing the limits of small model capability.

    Resources


    Watch the Video

    Unmute to hear narration.

  • 411 words3 min readAbstract

    Five ML Concepts - #1

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #1
    Video

    References

    Concept Reference
    Backprop Learning representations by back-propagating errors (Rumelhart, Hinton, Williams 1986)
    Transformer Attention Is All You Need (Vaswani et al. 2017)
    Mamba Mamba: Linear-Time Sequence Modeling (Gu & Dao 2023)
    Hallucination Survey of Hallucination in NLG (Ji et al. 2023)
    Embedding Word2Vec (Mikolov et al. 2013)

    Today’s Five

    1. Backpropagation

    Back propagation of errors. It’s how neural networks learn—flowing error backward through the network to adjust each weight.

    Without it, modern deep learning wouldn’t be practical.

    Think of it like retracing your steps to see which earlier choices caused the mistake.

    2. Transformer

    The architecture behind GPT, Claude, and most modern language models. Instead of processing words one at a time, transformers use attention to weigh relationships between all tokens.

    This enables parallel training and rich context awareness.

    Like understanding a sentence by seeing how every word relates to every other.

    3. Mamba (State Space Models)

    A newer alternative to transformers that processes sequences in linear time instead of quadratic.

    This allows scaling to very long documents with much lower memory use.

    Like a smart conveyor belt that carries forward only what matters.

    4. Hallucination

    When a model generates confident-sounding nonsense. It happens because language models predict plausible next words, not true facts.

    They optimize for likelihood, not correctness.

    Like a student who writes confidently without verifying sources.

    5. Embedding

    Turning words, images, or concepts into vectors of numbers. Similar meanings end up close together in this space.

    This lets math capture semantic relationships.

    Think of it as a coordinate system for meaning.

    Quick Reference

    Concept One-liner
    Backprop Learn by flowing error backward
    Transformer Attention over all tokens at once
    Mamba Linear-time sequence modeling
    Hallucination Confident nonsense from likelihood optimization
    Embedding Meaning as coordinates in vector space

    Short, accurate ML explainers. Follow for more.

    Watch the Video

    Unmute to hear narration.

  • 842 words5 min readAbstract

    Small Models (4/6): This AI Has a Visible Brain

    LLMs are black boxes. Baby Dragon Hatchling (BDH) is different—a brain-inspired language model with sparse, interpretable activations.

    Train it on Shakespeare and actually see what’s happening inside.

    This is Part 4 of the Small Models, Big Brains series, exploring interpretability through sparsity.

    Resource Link
    Paper Pathway (Sparse Coding)
    Original Code pathwaycom/bdh
    Fork (with tools) softwarewrighter/bdh
    Video This AI Has a Visible Brain
    Video

    The Black Box Problem

    Modern neural networks are opaque:

    • Billions of parameters
    • Dense activations everywhere
    • No clear mapping from neurons to concepts
    • “It works, but we don’t know why”

    This isn’t just an academic concern. We’re deploying AI systems we don’t understand.

    Baby Dragon Hatchling: A Different Approach

    BDH takes inspiration from biological brains, which use sparse coding:

    Biological Brains Dense Neural Networks
    ~1-5% neurons active ~100% neurons active
    Energy efficient Computationally expensive
    Interpretable patterns Distributed, opaque
    Robust to noise Brittle

    Sparse Activations

    BDH enforces 80% sparsity—only 20% of neurons are active for any given token.

    Dense Network:    [████████████████████] 100% active
    BDH:              [████░░░░░░░░░░░░░░░░]  20% active
    

    This constraint forces the network to learn meaningful, localized representations.

    Training on Shakespeare

    The demo trains BDH on Shakespeare’s works:

    Training Progress:
    Epoch 1:   Loss 0.86
    Epoch 50:  Loss 0.54
    Epoch 100: Loss 0.38
    Epoch 200: Loss 0.22
    

    Loss drops from 0.86 to 0.22—the architecture works.

    Seeing Inside the Model

    With sparse activations, you can actually inspect what neurons mean:

    # Which neurons fire for "love"?
    activations = model.forward("love")
    active_neurons = activations.nonzero()
    
    # Neuron 47: fires for emotional words
    # Neuron 112: fires for abstract nouns
    # Neuron 203: fires for relationship terms
    

    When only 20% of neurons fire, each one carries interpretable meaning.

    Running the Code

    The bdh repository is a fork of Pathway’s original with added inspection tools:

    git clone https://github.com/softwarewrighter/bdh
    cd bdh
    pip install -r requirements.txt
    
    # Train on Shakespeare
    python train.py --dataset shakespeare --sparsity 0.8
    
    # Inspect activations
    python inspect.py --model checkpoint.pt --text "To be or not to be"
    

    GPU recommended (Nvidia or Apple Silicon) for reasonable training times.

    Why Sparsity Enables Interpretability

    Dense Networks

    Every neuron participates in every computation. The “meaning” of any single neuron is distributed across all inputs it ever sees.

    Input: "cat"  → All neurons contribute → Output
    Input: "dog"  → All neurons contribute → Output
    Input: "love" → All neurons contribute → Output
    

    Trying to understand one neuron means understanding everything.

    Sparse Networks

    Only a small subset of neurons fire for each input. Neurons develop specialization.

    Input: "cat"  → Neurons [12, 47, 89] fire → Output
    Input: "dog"  → Neurons [12, 52, 89] fire → Output
    Input: "love" → Neurons [47, 112, 203] fire → Output
    

    Neuron 12 might mean “animal.” Neuron 47 might mean “emotional/living.” You can actually trace meaning.

    Comparison with Other Sparse Architectures

    Model Sparsity Type Purpose
    Mixture of Experts Routing sparsity Efficiency
    Top-k attention Attention sparsity Memory
    BDH Activation sparsity Interpretability

    BDH’s sparsity is specifically designed for understanding, not just efficiency.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 9 .py files
    Estimated Size ~1.5 KLOC
    Framework PyTorch
    Build System pip / requirements.txt
    GPU Support CUDA, MPS (Apple Silicon)

    Good for you if: You want to experiment with sparse neural architectures, study interpretability techniques, or train small language models with visible internals.

    Complexity: Low-Moderate. Standard PyTorch project structure. The sparse activation mechanism is well-documented. Fork includes additional inspection tools not in the original.

    Key Takeaways

    1. Sparsity enables interpretability. When fewer neurons fire, each one means more.

    2. Brain-inspired design works. Biological neural coding principles transfer to AI.

    3. Interpretability doesn’t require sacrifice. BDH learns effectively despite constraints.

    4. We can build AI we understand. Black boxes aren’t inevitable.

    Current Limitations

    • Early research stage
    • Smaller scale than production models
    • Training requires more epochs
    • Not yet competitive with dense models on benchmarks

    But the principle is sound: constraint breeds clarity.

    What’s Next

    Part 5 dives into the 1B parameter sweet spot—comparing TinyLlama, Llama 3.2, StableLM, and Pythia.

    Resources


    Watch the Video

    Unmute to hear narration.

  • 1473 words8 min readAbstract

    Solving Sparse Rewards with Many Eyes

    Single explorer: 0% success. Five explorers: 60% success.

    Learning often fails not because models are slow, but because they see too little. In sparse-reward environments, a single explorer is likely to miss the rare feedback entirely. The solution? Put many eyes on the problem.

    The Problem: Sparse Rewards Create Blindness

    As IRPO formalizes: in sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal.

    A 7x7 grid with a single goal demonstrates this perfectly:

    • Random agent success rate: ~9%
    • With limited training (75 episodes), a single learner exploring alone never finds the goal

    This isn’t a compute problem. It’s an information problem.

    Challenge Effect Paper Connection
    Rare rewards Weak gradient signal IRPO’s core problem statement
    Single explorer Limited coverage Why multiple scouts help
    Random exploration Misses valuable states Why intrinsic rewards matter
    No feedback structure Can’t distinguish “almost right” from “nonsense” Reagent’s motivation

    The Solution: Many Eyes

    Instead of one explorer, use multiple scouts—independent exploratory agents that gather diverse information.

    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
    │   Scout 1   │  │   Scout 2   │  │   Scout N   │
    │ (strategy A)│  │ (strategy B)│  │ (strategy N)│
    └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
           │                │                │
           v                v                v
    ┌─────────────────────────────────────────────────┐
    │              Experience Buffer                   │
    └─────────────────────────────────────────────────┘
                           │
                           v
    ┌─────────────────────────────────────────────────┐
    │               Shared Learner                     │
    └─────────────────────────────────────────────────┘
    

    Each scout explores with its own strategy. Their discoveries are aggregated to improve a shared learner.

    Results

    On a 7x7 sparse grid with 75 training episodes:

    Method Success Rate
    Random baseline 9%
    Single scout 0%
    Many eyes (3 scouts) 40%
    Many eyes (5 scouts) 60%

    Same total environment steps. Dramatically better outcomes.

    Why It Works

    Single Scout Fails Because:

    In IRPO terms: sparse reward → sparse gradient signal → no learning.

    1. Random exploration rarely reaches the goal (~9%)
    2. Insufficient successful trajectories
    3. DQN can’t learn from sparse positive examples
    4. The policy gradient has near-zero magnitude

    Many Eyes Succeeds Because:

    IRPO’s key insight: multiple exploratory policies manufacture signal.

    1. More coverage: Different scouts explore different regions (intrinsic rewards drive novelty-seeking)
    2. More discoveries: Higher probability of reaching goal (scouts find extrinsic reward)
    3. Signal routing: Scout discoveries update the shared learner (surrogate gradient in IRPO, experience pooling in many-eyes)
    4. Better gradients: Aggregated experience provides meaningful learning signal

    Scout Strategies (Intrinsic Rewards)

    IRPO uses intrinsic rewards to drive exploration. The many-eyes-learning project implements several strategies:

    Strategy Intrinsic Motivation IRPO Connection
    Epsilon-greedy Random action with probability ε Simple exploration noise
    Curious Bonus for novel states: 1/√(count+1) Count-based intrinsic reward
    Optimistic High initial Q-values Optimism under uncertainty
    Random Pure random baseline Maximum entropy exploration
    # CuriousScout intrinsic reward (simplified)
    def intrinsic_reward(self, state):
        count = self.state_counts[state]
        return self.bonus_scale / sqrt(count + 1)
    

    Scouts can be homogeneous (same strategy, different seeds) or heterogeneous (different strategies). IRPO supports swapping intrinsic reward functions—many-eyes makes this concrete with pluggable scout types.

    Running the Demo

    git clone https://github.com/softwarewrighter/many-eyes-learning
    cd many-eyes-learning
    
    # Setup
    uv venv .venv
    source .venv/bin/activate
    uv pip install -e ".[dev]"
    
    # Interactive CLI demo
    python experiments/cli_demo.py
    
    # Full experiment
    python experiments/run_experiment.py --episodes 75 --scouts 1 3 5
    
    # Generate plots
    python experiments/plot_results.py
    

    Results appear in ~5-10 minutes on a laptop.

    Diversity Experiment

    Does diversity of strategies matter, or just number of scouts?

    Configuration Success Rate
    5 random scouts 20%
    5 epsilon-greedy scouts 40%
    5 diverse scouts (mixed strategies) 40%

    Finding: In simple environments, strategy quality matters more than diversity. Epsilon-greedy beats random regardless of diversity.

    Key Insight

    The problem isn’t that learning is slow. The problem is that learning is blind.

    Many eyes make learning better.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files ~12 .py files
    Estimated Size ~1.5 KLOC
    Framework PyTorch, NumPy
    Platform CPU (no GPU required)

    Good for you if: You want to understand exploration in RL, experiment with sparse-reward environments, or see a clean implementation of scout-based learning.

    Complexity: Low-Moderate. Clean codebase with CLI demos. Runs on a laptop in minutes.

    Design Philosophy

    The project prioritizes clarity over performance:

    • Single-file implementations where practical
    • Minimal dependencies
    • Sequential mode is first-class (parallel optional)
    • Reproducible experiments with fixed seeds

    Simplifications from IRPO

    Full IRPO computes Jacobians to route gradients from exploratory policies back to the base policy. Many-eyes-learning simplifies this:

    IRPO Many-Eyes-Learning
    Jacobian chain rule Experience pooling
    Surrogate gradient Standard DQN updates
    Learned intrinsic rewards Hand-designed strategies

    The core insight remains: scouts explore with intrinsic motivation, discoveries benefit the shared learner. The math is simpler, the demo runs on a laptop, and the concept is clear.

    Key Takeaways

    1. Sparse rewards create information bottlenecks. Learning fails not from lack of compute, but lack of signal.

    2. More eyes = more information. Multiple scouts increase coverage and discovery rate.

    3. Diversity helps, but quality matters more. In simple environments, good exploration strategy beats diversity.

    4. Same compute, better outcomes. Many-eyes improves sample efficiency, not wall-clock speed.

    The Papers Behind Many-Eyes

    This project builds on two recent papers that attack the same fundamental problem: sparse rewards starve learning of signal.

    IRPO: Intrinsic Reward Policy Optimization

    IRPO (Cho & Tran, UIUC) formalizes the scouts concept mathematically.

    The core insight: In sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal. Learning stalls.

    IRPO’s solution:

    ┌─────────────────────────────────────────────────┐
    │  1. Train exploratory policies (scouts)         │
    │     using INTRINSIC rewards                     │
    ├─────────────────────────────────────────────────┤
    │  2. Scouts discover EXTRINSIC rewards           │
    │     through exploration                         │
    ├─────────────────────────────────────────────────┤
    │  3. Route extrinsic signal back to base policy  │
    │     via surrogate gradient (Jacobian chain)     │
    └─────────────────────────────────────────────────┘
    
    IRPO Concept What It Means
    Intrinsic rewards “Explore what’s new” - reward novelty
    Exploratory policies Scouts driven by intrinsic motivation
    Surrogate gradient Trade bias for signal - approximate gradient that actually has magnitude
    Base policy The learner that benefits from scout discoveries

    How many-eyes-learning demonstrates this:

    • Scouts implement intrinsic motivation (CuriousScout uses count-based novelty bonuses)
    • Multiple exploration strategies create diverse coverage
    • Aggregated experience routes discoveries to the shared DQN learner
    • Simplified gradient routing - we pool experiences rather than compute full Jacobians

    Reagent: Reasoning Reward Models for Agents

    Reagent (Fan et al., CUHK/Meituan) takes a different approach: make feedback richer and more structured.

    The problem with sparse rewards: They can’t distinguish “almost right, failed at the end” from “complete nonsense.” Both get the same zero reward.

    Reagent’s solution: Build a Reasoning Reward Model that emits:

    Signal Purpose
    <think> Explicit reasoning trace
    <critique> Targeted natural-language feedback
    <score> Overall scalar reward

    This provides dense-ish supervision without hand-labeling every step.

    How many-eyes-learning relates:

    • Both papers recognize sparse rewards as an information problem
    • Reagent enriches the reward signal; IRPO multiplies the exploration
    • Many-eyes takes the IRPO path: more explorers finding the sparse signal
    • Future work could combine both: scouts + richer feedback per trajectory

    The Shared Meta-Lesson

    Both papers are saying the same thing:

    Sparse signals are a tragedy. Let’s smuggle in richer ones.

    • IRPO: via intrinsic-reward exploration gradients
    • Reagent: via language-based reward feedback

    Many-eyes-learning demonstrates the IRPO intuition in a simple, visual, reproducible way.

    Resources


    Sparse rewards are an information problem. Many eyes provide the solution.

    Watch the Video

    Unmute to hear narration.

  • 661 words4 min readAbstract

    MCP: Teaching Claude to Play (and Trash Talk)

    Claude learned to play tic-tac-toe. And trash talk. Using one protocol that works with any language model.

    The Problem

    Language models are stuck in text. They can’t click buttons, make moves, or interact with real systems. Every integration is custom—different for Claude, GPT, Gemini.

    The Solution: MCP

    Model Context Protocol is a standard way for models to use tools. Define your tools once, they work with Claude, GPT, or any MCP-compatible agent.

    The protocol is simple:

    • JSON-RPC 2.0 over stdio
    • No HTTP needed
    • Clean request/response cycle

    The Demo: Trash Talkin’ Tic Tac Toe

    This proof-of-concept implements 6 MCP tools:

    Tool Purpose
    view_game_state See the board, players, status
    get_turn Whose turn is it?
    make_move Play a square (row, col)
    taunt_player Send trash talk to opponent
    restart_game Start a new game
    get_game_history All moves with timestamps

    The AI calls tools, the server responds. Claude can play a full game AND talk trash—all through the same protocol.

    Architecture

    ┌─────────────────────────────────────────────┐
    │            Claude Code (AI)                 │
    │              (MCP Client)                   │
    └──────────────────┬──────────────────────────┘
                       │ JSON-RPC 2.0 via stdio
                       ▼
    ┌─────────────────────────────────────────────┐
    │         MCP Server (Rust Binary)            │
    │  ┌───────────────────────────────────────┐  │
    │  │  6 Tools: view, turn, move, taunt,   │  │
    │  │           restart, history            │  │
    │  └───────────────────────────────────────┘  │
    │                   ▼                         │
    │  ┌───────────────────────────────────────┐  │
    │  │      SQLite (game.db)                 │  │
    │  │  • Games • Moves • Taunts             │  │
    │  └───────────────────────────────────────┘  │
    └─────────────────────────────────────────────┘
             ▲                           ▲
             │ REST API                  │ MCP
             │                           │
        Browser (UI)              AI Agent
        (Yew/WASM)              (Claude Code)
    

    Running It

    git clone https://github.com/sw-game-dev/game-mcp-poc
    cd game-mcp-poc
    
    # Development mode (with hot-reload)
    ./scripts/dev.sh
    
    # Or production build
    ./scripts/build.sh
    ./scripts/serve.sh
    

    The server runs on http://localhost:7397 serving:

    • REST API for UI interactions
    • MCP endpoint for AI agents
    • SSE for real-time updates
    • Yew/WASM frontend

    Configuring Claude Code

    Add to ~/.config/claude-code/mcp.json:

    {
      "mcpServers": {
        "tic-tac-toe": {
          "command": "/path/to/game-mcp-poc/target/release/game-mcp-server",
          "args": [],
          "env": {
            "GAME_DB_PATH": "/path/to/game.db"
          }
        }
      }
    }
    

    Restart Claude Code, then:

    You: "Let's play tic-tac-toe! Show me the board."
    You: "I'll take the center."
    You: "Your turn!"
    You: "Can you taunt me?"
    

    Implementation Details

    Metric Value
    Language Rust 2024 Edition
    Frontend Yew + WebAssembly
    Database SQLite
    Tests 175+ passing
    LOC ~2,500 (backend) + ~1,500 (tests)
    Binary Size ~8 MB

    Good for you if: You want to learn MCP, build AI-tool integrations, or see a production-quality Rust game server.

    Complexity: Moderate. Clean architecture with TDD. Requires Rust toolchain and understanding of JSON-RPC.

    Key Takeaways

    1. MCP standardizes AI tools. Define once, works with any compatible model.

    2. JSON-RPC over stdio is elegant. No HTTP complexity for local tools.

    3. Rust + WASM = fast everywhere. Same language for server and (via Yew) frontend.

    4. Trash talk is essential. Games without taunting are just… exercises.

    Resources


    MCP turns language models into tool users. This demo proves it works—and that AI can talk trash.

    Watch the Video

    Unmute to hear narration.

  • 789 words4 min readAbstract

    Small Models (3/6): Planner + Doer = Genius

    27 million parameters beats o3-mini on ARC.

    The hardest reasoning benchmark. Most LLMs score under 5 percent. This tiny model scores 40 percent.

    This is Part 3 of the Small Models, Big Brains series, exploring the Hierarchical Reasoning Model (HRM)—a brain-inspired architecture that separates planning from execution.

    Resource Link
    Paper Hierarchical Reasoning Model
    Original Code sapientinc/HRM
    Visualization viz-hrm-ft
    Video Planner + Doer = Genius
    Video

    The ARC Challenge

    The Abstraction and Reasoning Corpus (ARC) tests:

    • Abstract reasoning
    • Pattern matching
    • Spatial logic
    • Puzzles requiring real thinking

    These aren’t problems you can memorize. Each puzzle is unique, requiring genuine understanding of the underlying pattern.

    Why LLMs Struggle

    Challenge LLM Limitation
    Novel patterns Can’t rely on training data
    Spatial reasoning Text-based thinking is linearized
    Multi-step logic Each step compounds errors
    Abstraction Pattern matching isn’t generalization

    Meet HRM: The Hierarchical Reasoning Model

    HRM uses just 27 million parameters but achieves remarkable results by mimicking how the brain thinks: plan first, then act.

    Two-Module Architecture

    ┌─────────────────────────────────────┐
    │           PLANNER                   │
    │   Thinks slow and abstract          │
    │   Sets goals and strategies         │
    └─────────────┬───────────────────────┘
                  │ Goals
                  ▼
    ┌─────────────────────────────────────┐
    │            DOER                     │
    │   Moves fast                        │
    │   Takes concrete actions            │
    └─────────────────────────────────────┘
    
    Module Speed Function
    Planner Slow Abstract thinking, goal setting
    Doer Fast Concrete actions, execution

    This mirrors the brain’s dual-process theory: System 1 (fast, intuitive) and System 2 (slow, deliberate).

    Results

    Benchmark HRM (27M) o3-mini GPT-4
    ARC 40% <40% <5%
    Hard Mazes 99% - ~0%
    Complex Sudoku 99% - -

    A 27M parameter model outperforming models 1000x larger on reasoning tasks.

    The Visualization Tool

    The viz-hrm-ft repository provides a React app to visualize HRM’s reasoning process:

    • Watch the Planner form strategies
    • See the Doer execute actions
    • Visualize the feedback loop between modules
    • Simulate fine-tuning on BabyAI tasks
    git clone https://github.com/softwarewrighter/viz-hrm-ft
    cd viz-hrm-ft
    npm install
    npm start
    

    Why Hierarchy Works

    Traditional Flat Models

    Input → [Single Network] → Output
    

    Everything happens in one pass. Complex problems overwhelm the network.

    Hierarchical Models

    Input → [Planner] → Strategy
                      ↓
    Strategy → [Doer] → Action
                      ↓
    Action → [Environment] → Feedback
                           ↓
    Feedback → [Planner] → Refined Strategy
                         ↓
                        ...
    

    The Planner doesn’t worry about details. The Doer doesn’t worry about strategy. Each module focuses on what it does best.

    Key Insights

    1. Separation of concerns scales. Splitting planning from execution lets each module specialize.

    2. Iteration enables refinement. The Planner-Doer loop allows course correction.

    3. Small can beat big. 27M parameters with good architecture beats 100B+ with brute force.

    4. Brain-inspired design works. Mimicking cognitive architecture yields better results.

    Comparison with Part 1 (TRM)

    Aspect TRM HRM
    Parameters <1,000 27M
    Architecture Think-Act cycles Planner-Doer hierarchy
    Strength Maze solving Abstract reasoning
    Key insight Iteration Hierarchical decomposition

    Both use recursive reasoning, but HRM adds hierarchical structure for more complex tasks.

    Implementation Details

    Metric Value
    Primary Language TypeScript
    Source Files 26 .ts/.tsx, 7 .js
    Estimated Size ~4 KLOC
    Framework React
    Build System npm / Create React App
    Visualization Canvas-based rendering

    Good for you if: You want to visualize neural reasoning processes, build interactive ML demos, or learn React with a real project.

    Complexity: Low-Moderate. Standard React/TypeScript project. No ML training code—this is a visualization tool for understanding the HRM architecture. Easy to extend with new visualizations.

    Key Takeaways

    1. Plan, then act. Separating strategy from execution mirrors effective human thinking.

    2. Hierarchy enables complexity. Multi-level reasoning handles problems flat networks can’t.

    3. Architecture > Scale for reasoning tasks.

    4. ARC remains unsolved by brute-force scaling—clever architectures are the path forward.

    What’s Next

    Part 4 explores Baby Dragon Hatchling (BDH)—a brain-inspired model with visible, interpretable activations.

    Resources


    Watch the Video

    Unmute to hear narration.

  • 705 words4 min readAbstract

    Deepseek Papers (2/3): Engram - Conditional Memory for Transformers

    Deepseek publishes papers. I implement them. This paper tackles another fundamental transformer problem: redundant computation.

    This post covers my implementation of Engram (Conditional Memory via Scalable Lookup)—running on both Apple Silicon and NVIDIA GPUs.

    Resource Link
    Paper arXiv:2601.07372
    Code engram-poc
    Video 1 Engram Part 1
    Video
    Video 2 Engram Part 2
    Video

    The Problem: Redundant Computation

    LLMs waste compute reconstructing patterns they’ve seen before:

    • Style rules repeated across files
    • Common code idioms re-derived each call
    • Boilerplate knowledge injected repeatedly

    Attention computes everything from scratch every time. For recurring patterns, this is wasteful.

    The Engram Solution: O(1) Lookup

    Engram introduces conditional memory as a complementary sparsity axis. Instead of recomputing common patterns through attention, look them up in O(1) time.

    Think of it as a cache for the model’s learned patterns:

    Without Engram With Engram
    Recompute pattern every call Look up cached result
    O(n²) attention O(1) deterministic lookup
    Implicit knowledge Explicit, inspectable memory

    The PoC Approach

    The full Engram paper describes in-model memory. The engram-poc repo approximates the benefits through behavioral fine-tuning:

    1. Pattern Injection: Training data encodes lookup-like patterns
    2. LoRA Adapters: Learn to recognize and consistently respond
    3. Evaluation: Compare baseline vs tuned model

    Pattern Categories

    The PoC includes 131 patterns across 4 categories:

    Category Examples
    Code Idioms for i in range(len(items)):
    Factual Recall HTTP status for 'Not Found'?404
    Format Transforms snake_case: getUserNameget_user_name
    Error Fixes Fix: if x = 5:if x == 5:

    Results

    Training on SmolLM-135M-Instruct:

    Metric Value
    Training Examples 337
    Training Time ~10 seconds (M-series Mac)
    Loss Reduction 58.2% (4.34 → 1.82)

    Behavioral change:

    Prompt: Complete: for i in range(
    
    Baseline:     "Here is a Python function that implements this approach..."
    Engram-tuned: "len(items)):"
    

    The tuned model produces direct, pattern-completing responses instead of verbose explanations.

    Running the Engram Demo

    git clone https://github.com/softwarewrighter/engram-poc
    cd engram-poc
    
    # Apple Silicon
    uv venv && source .venv/bin/activate
    uv pip install -r requirements.txt
    ./scripts/run_all.sh
    
    # NVIDIA GPU (separate directory)
    cd unsloth-nvidia
    uv venv && source .venv/bin/activate
    uv pip install torch --index-url https://download.pytorch.org/whl/cu124
    uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
    ./scripts/run_all.sh
    

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 24 .py, 10 .sh, 6 .yaml
    Estimated Size ~3.0 KLOC
    Frameworks MLX-LM, Unsloth
    Platforms Apple Silicon, NVIDIA CUDA
    Key Features LoRA fine-tuning, pattern evaluation, interactive demo

    Good for you if: You want to experiment with LoRA fine-tuning, understand behavioral pattern injection, or compare MLX vs Unsloth workflows.

    Complexity: Moderate. Includes extensive documentation and video recording guides. Pattern data is human-readable YAML.

    Key Takeaways

    1. Engram reduces redundant computation. O(1) lookup for recurring patterns beats recomputing through attention.

    2. LoRA makes experimentation accessible. Fine-tune small models in seconds on a laptop.

    3. Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, with different tooling for each.

    4. Deepseek publishes useful research. Their papers address real problems with practical solutions.

    What’s Next

    Part 3 will cover Engram Revisited—what happened when we moved from behavioral emulation to real hash-based memory implementation. Spoiler: it works, but not everywhere.

    Resources


    Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.

    Watch the Video

    Unmute to hear narration.

  • 692 words4 min readAbstract

    Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs

    A tiny 135M parameter model goes from 0% to 75% accuracy in 5 minutes of training. The secret? Knowledge graph-guided training with rejection sampling.

    The Problem: Multi-Hop Reasoning

    LLMs struggle with questions requiring multiple reasoning steps. “What’s the fix for a crash caused by a corrupted config file on a system running outdated firmware?” requires connecting several facts:

    1. Corrupted config → need config reset
    2. Outdated firmware → need firmware update
    3. Crash context → check dependencies between these fixes

    Standard fine-tuning teaches pattern matching. Multi-hop reasoning requires following logical chains.

    The Paper’s Approach

    Training wheels

    Learn with training wheels, remove them after learning completes.

    Knowledge Graph-Guided RAG from Princeton proposes using knowledge graphs during training to score reasoning quality—then removing the graph at inference.

    The key insight: train with scaffolding, test without it.

    My Implementation

    The repo implements this for a software troubleshooting domain:

    Component Details
    Knowledge Graph ~200 entities, ~600 edges (symptoms, causes, fixes)
    Training Data MCQs with 1-3 hop paths
    Eval Data MCQs with 4-5 hop paths (harder)
    Model SmolLM-135M-Instruct
    Framework MLX (Apple Silicon native)

    The Training Pipeline

    ┌─────────────────────────────────────────┐
    │  1. SFT: Learn output format            │
    │     TRACE: <reasoning>                  │
    │     ANSWER: A|B|C|D                     │
    ├─────────────────────────────────────────┤
    │  2. RSFT: Rejection Sampling FT         │
    │     - Generate multiple answers         │
    │     - Score with knowledge graph        │
    │     - Keep only correct traces          │
    │     - Train on winners                  │
    └─────────────────────────────────────────┘
    

    The Reward Function

    The knowledge graph scores outputs during training:

    • R_corr: +1.0 correct answer, -2.0 incorrect
    • R_path: Entity coverage (did the trace mention relevant nodes?)
    • P_spam: -0.5 penalty for repeating entities (prevents gaming)

    At inference, the graph is removed. The model must reason from learned patterns.

    Results

    Phase Accuracy Training Time
    Base model 0% -
    After SFT 30% ~2 min
    After RSFT 75% ~3 min

    The critical finding: distribution matching matters.

    Training on easy examples (1-2 hops) hurt performance on hard eval (4-5 hops). Training on examples matching the eval distribution achieved 75%.

    Running It

    git clone https://github.com/softwarewrighter/multi-hop-reasoning
    cd multi-hop-reasoning
    
    # Setup (Apple Silicon)
    make setup-mlx
    
    # Full pipeline
    make train
    

    Results appear in ~5 minutes on an M-series Mac.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 12 .py files
    Estimated Size ~1.5 KLOC
    Framework MLX, Transformers
    Platform Apple Silicon (MLX native)

    Good for you if: You want to understand knowledge graph-guided training, experiment with rejection sampling fine-tuning, or see how small models can learn reasoning patterns.

    Complexity: Moderate. Clean codebase with Make targets for each step. Requires understanding of fine-tuning concepts.

    Key Takeaways

    1. Scaffolded training works. Use structured feedback during training, remove it at inference.

    2. Distribution matching matters. Train on examples that match your eval distribution.

    3. Small models can reason. 135M parameters is enough for 75% accuracy on 4-5 hop questions.

    4. MLX makes iteration fast. Full pipeline runs in 5 minutes on a MacBook.

    Resources


    Knowledge graphs as training wheels—helping small models learn to reason, then letting go.

    Watch the Video

    Unmute to hear narration.

  • 765 words4 min readAbstract

    Small Models (2/6): AI in Your Pocket

    AI on your phone. All day. No internet required.

    This is Part 2 of the Small Models, Big Brains series. Today we’re putting a language model in your pocket with Pocket Eliza++—a modern AI therapist that runs completely offline on Android.

    Resource Link
    Paper MobileLLM (ICML 2024)
    Code pocket-llm
    Runtime llama.cpp
    Video AI in Your Pocket
    Video

    Why Offline Matters

    Benefit Description
    Privacy Data never leaves your device
    Speed No network latency
    Cost No API fees
    Offline Works without internet
    Battery Efficient on-device inference

    Cloud AI is convenient, but sometimes you want a conversation that stays on your device.

    MobileLLM: Meta’s Edge Champion

    MobileLLM is Meta’s sub-500M parameter model optimized specifically for on-device inference.

    Architecture Optimizations

    Technique Benefit
    Deep-thin design More layers, fewer parameters per layer
    SwiGLU activation Better performance than ReLU
    Embedding sharing Saves 30% of parameters
    Grouped-query attention Faster inference

    The result: a 260MB quantized model (Q4_K_M) that runs smoothly on phones.

    Pocket Eliza++

    Eliza taking notes

    The original ELIZA (1966) used pattern matching to simulate a Rogerian therapist. Pocket Eliza++ uses the same therapeutic approach but with actual language understanding.

    Therapeutic Design

    The system prompt instructs the model to:

    • Ask one short question at a time
    • Never repeat questions
    • Vary question types (feelings, motivations, specifics)
    • Never give advice or explanations

    It’s a reflective listener, not a problem solver.

    Technical Stack

    ┌─────────────────────────────────┐
    │     Kotlin + Jetpack Compose    │  UI Layer
    ├─────────────────────────────────┤
    │            JNI Bridge           │
    ├─────────────────────────────────┤
    │           llama.cpp             │  Inference Engine
    ├─────────────────────────────────┤
    │    MobileLLM-350M (Q4_K_M)      │  Model (260MB)
    └─────────────────────────────────┘
    
    • Model: MobileLLM-350M quantized to Q4_K_M (260MB GGUF)
    • Runtime: llama.cpp compiled for Android via NDK
    • Interface: Kotlin + Jetpack Compose
    • Bridge: JNI bindings connect Kotlin to native llama.cpp

    Building the App

    # Clone the repository
    git clone https://github.com/softwarewrighter/pocket-llm
    cd pocket-llm/android-demo
    
    # Clone llama.cpp into native source
    git clone https://github.com/ggerganov/llama.cpp.git \
        app/src/main/cpp/llama.cpp
    
    # Download the model (260MB)
    mkdir -p app/src/main/assets
    curl -L -o app/src/main/assets/MobileLLM-376M-Q4_K_M.gguf \
        "https://huggingface.co/pjh64/MobileLLM-350M-GGUF/resolve/main/MobileLLM-376M-Q4_K_M.gguf"
    
    # Build and install
    ./gradlew assembleDebug
    adb install -r app/build/outputs/apk/debug/app-debug.apk
    

    Build Requirements

    Requirement Value
    Target SDK 35 (Android 15)
    Min SDK 28 (Android 9.0)
    ABI arm64-v8a
    NDK CMake for native build
    Kotlin 2.0.0

    Quick CLI Demo

    Don’t want to build the Android app? Test with Ollama:

    pip install -r requirements.txt
    ollama pull smollm:360m
    python3 eliza.py
    

    Performance

    On a mid-range Android phone (Snapdragon 7 series):

    • First token: ~500ms
    • Generation: ~10 tokens/second
    • Memory: ~400MB RAM
    • Battery: Minimal impact for short sessions

    Implementation Details

    Metric Value
    Languages Kotlin (UI), Python (CLI), C++ (JNI)
    Source Files 6 .kt, 4 .py, 2 .cpp
    Estimated Size ~1.3 KLOC
    Android Target SDK 28+ (Android 9.0)
    Build System Gradle + CMake (NDK)
    Key Dependency llama.cpp (vendored)

    Good for you if: You want to deploy LLMs on Android, learn JNI/NDK integration, or build privacy-focused mobile AI apps.

    Complexity: Moderate-High. Requires Android Studio, NDK setup, and understanding of JNI bridges. The llama.cpp integration is the tricky part; the Kotlin UI is straightforward Jetpack Compose.

    Key Takeaways

    1. Sub-500M models are phone-ready. MobileLLM proves useful AI fits in your pocket.

    2. llama.cpp is the universal runtime. Same engine runs on Mac, Linux, Windows, and Android.

    3. Privacy doesn’t require sacrifice. Offline AI can still be conversational and helpful.

    4. Quantization is essential. Q4_K_M brings 350M parameters down to 260MB with minimal quality loss.

    What’s Next

    Part 3 explores the Hierarchical Reasoning Model (HRM)—a 27M parameter model that beats o3-mini on abstract reasoning.

    Resources


    Watch the Video

    Unmute to hear narration.

  • 760 words4 min readAbstract

    Deepseek Papers (1/3): mHC - Training Stability at Any Depth

    Deepseek publishes papers. I implement them. This paper tackles a fundamental transformer problem: training stability in deep networks.

    This post covers my implementation of mHC (Manifold-Constrained Hyper-Connections)—running on both Apple Silicon and NVIDIA GPUs.

    Resource Link
    Paper arXiv:2512.24880
    Code mHC-poc
    ELI5 eli5-mHC.md
    ELI4 eli4-mHC.md
    Video 1 mHC Demo
    Video
    Video 2 mHC Explained
    Video
    Video 3 mHC Results
    Video

    The Problem: Deep Networks Explode

    Residual connections revolutionized deep learning. Skip connections let gradients flow through hundreds of layers. But there’s a catch.

    Standard residual connections:

    output = layer(input) + input
    

    This works, but the signal accumulates. With many layers, small amplifications compound into instability.

    Hyper-Connections (HC) tried to fix this by learning connection weights:

    output = α₁ × layer(input) + α₂ × input
    

    Better expressiveness, but learned weights can still cause explosion. At 48 layers, HC becomes unstable.

    The mHC Solution: Doubly-Stochastic Constraints

    mHC constrains the connection weights using Sinkhorn-Knopp iteration—a mathematical technique that ensures weights form a doubly-stochastic matrix.

    What does “doubly-stochastic” mean?

    • Each row sums to 1
    • Each column sums to 1

    This bounds the total signal flow. No matter how deep the network, amplification stays controlled.

    # Sinkhorn-Knopp iteration (simplified)
    def make_doubly_stochastic(weights, iterations=5):
        for _ in range(iterations):
            weights = weights / weights.sum(dim=0)  # Column normalize
            weights = weights / weights.sum(dim=1)  # Row normalize
        return weights
    

    Results: Stability at Depth

    The mHC-poc repo stress-tests this with a depth sweep:

    Depth Baseline HC mHC
    12 layers Stable Stable Stable
    24 layers Struggling Stable Stable
    48 layers Oscillating Explodes Stable

    At 48 layers:

    • HC gain proxy: 10²⁷ (catastrophic amplification)
    • mHC gain proxy: 10⁻⁰·⁶ (bounded, healthy)

    HC’s final loss at 48 layers: 5.54 (never learns) mHC’s final loss at 48 layers: 0.0002 (perfect convergence)

    Cross-Platform Validation

    The implementation runs on both Apple Silicon (MLX) and NVIDIA (PyTorch/CUDA):

    Metric MLX (Apple) CUDA (NVIDIA)
    Gain Proxy (24L) -0.6 -0.602
    Gradient Stability Stable Stable
    NaN Events 0 0

    Identical results confirm the Sinkhorn-Knopp projection works correctly on both platforms.

    Running the mHC Demo

    git clone https://github.com/softwarewrighter/mHC-poc
    cd mHC-poc
    
    # Apple Silicon (MLX)
    uv venv && source .venv/bin/activate
    uv pip install -r mlx/requirements.txt
    bash scripts/run_depth_sweep.sh
    
    # NVIDIA (CUDA)
    cd cuda
    uv venv && source .venv/bin/activate
    uv pip install -r requirements.txt
    bash scripts/run_cuda_depth_sweep.sh
    

    Results go to runs/ with plots showing loss, gradient norms, and gain proxy across depths.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 29 .py, 3 .sh, 10 .yaml
    Estimated Size ~2.5 KLOC
    Frameworks MLX, PyTorch
    Platforms Apple Silicon, NVIDIA CUDA
    Key Features Depth sweep, cross-platform validation, visualization

    Good for you if: You want to understand mHC’s stability benefits, compare MLX vs PyTorch implementations, or experiment with residual connection variants.

    Complexity: Moderate. Well-documented with ELI5 explanations in docs/. Requires understanding of residual connections and matrix constraints.

    Key Takeaways

    1. mHC solves deep network instability. Doubly-stochastic constraints bound signal amplification at any depth.

    2. Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, validated to produce identical results.

    3. Deepseek publishes useful research. Their papers address real problems with practical solutions.

    What’s Next

    Part 2 covers Engram—Deepseek’s approach to reducing redundant computation through conditional memory.

    Resources


    Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.

    Watch the Video

    Unmute to hear narration.

  • 703 words4 min readAbstract

    Small Models (1/6): 976 Parameters Beat Billions

    The best large language models score zero on hard mazes. A model with under 1,000 parameters scores 85 percent.

    This is Part 1 of the Small Models, Big Brains series, exploring how tiny models with clever architectures outperform massive ones on specific tasks.

    Why LLMs Fail at Mazes

    Large language models generate one token at a time. They cannot backtrack. One wrong move and the entire solution fails.

    Maze solving requires:

    • Exploring dead ends
    • Backtracking when stuck
    • Maintaining spatial awareness
    • Planning multiple steps ahead

    Autoregressive generation is fundamentally incompatible with these requirements.

    Meet TRM: The Tiny Recursive Model

    The Tiny Recursive Model uses under 1,000 parameters. Instead of being bigger, it thinks in loops.

    Input → Think → Act → Think → Act → ... → Output
    

    A simple two-layer network that iterates until the solution emerges.

    The Architecture

    TRM alternates between two phases:

    Phase Purpose
    Think Update internal latent state by processing input, current answer, and previous state
    Act Update the answer based on the refined latent state

    This process repeats for multiple cycles, progressively improving the output.

    TRMConfig {
        input_dim: 5,
        output_dim: 5,
        hidden_dim: 16,
        latent_dim: 16,
        l_layers: 2,      // Network depth
        h_cycles: 3,      // Outer think-act cycles
        l_cycles: 4,      // Inner think cycles
    }
    

    The Secret: Deep Supervision

    The key insight isn’t just recursion—it’s supervising every step, not just the final answer.

    Traditional training:

    Input → [black box] → Final Output → Loss
    

    TRM training:

    Input → Step 1 → Loss₁
          → Step 2 → Loss₂
          → Step 3 → Loss₃
          → ...
          → Final  → Loss_n
    

    Every iteration gets feedback. The model learns to make progress at each step.

    Results

    Model Maze Accuracy
    GPT-4 ~0% on hard mazes
    Claude ~0% on hard mazes
    TRM (976 params) 85%

    Iteration beats scale.

    Running the Code

    The train-trm repo provides a complete Rust implementation:

    # Clone and build
    git clone https://github.com/softwarewrighter/train-trm
    cd train-trm
    ./scripts/build.sh --release
    
    # Train a model
    ./scripts/train.sh --epochs 1000 --lr 0.01
    
    # Evaluate
    ./scripts/eval.sh
    
    # Or launch the web UI
    cargo install --locked trunk
    ./scripts/web-serve.sh
    

    The web UI includes interactive maze visualization with solution paths and real-time training charts.

    Implementation Details

    Metric Value
    Primary Language Rust
    Source Files 21 .rs files
    Estimated Size ~2.5 KLOC
    Also Includes HTML (web UI), Shell scripts
    Build System Cargo + Trunk (WASM)
    Dependencies ndarray, serde, clap, wasm-bindgen

    Good for you if: You want to learn Rust ML from scratch, experiment with recursive architectures, or need a web-based training visualization.

    Complexity: Moderate. Clean Rust code with good documentation. The neural network is implemented from scratch (no PyTorch/TensorFlow), making it educational but requiring Rust familiarity.

    Key Takeaways

    1. Parameter count isn’t everything. Architecture and training strategy matter more for certain tasks.

    2. Recursion enables backtracking. By iterating, TRM can explore and refine solutions.

    3. Deep supervision accelerates learning. Feedback at every step, not just the end.

    4. Task-specific models excel. TRM isn’t a general-purpose LLM—it’s optimized for maze-like reasoning.

    What’s Next

    Part 2 explores MobileLLM and running AI completely offline on your Android phone.

    Resources


    Watch the Video

    Unmute to hear narration.

  • 1013 words6 min readAbstract

    Welcome to Software Wrighter Lab

    Welcome to Software Wrighter Lab—a blog, YouTube channel, Discord server, and GitHub repos for exploring the intersection of AI coding agents, systems programming, and practical machine learning.

    I’m Mike Wright, a software engineer with over four decades of experience, currently focused on AI-assisted development with Rust and WebAssembly.

    Quick Links  
    YouTube @SoftwareWrighter
    GitHub softwarewrighter
    Discord SW Lab

    Contents:

    About Me

    I’ve been writing code professionally for over 35 years—an Emacs user since 1989, still going strong.

    My background spans mainframes to startups:

    • IBM Data Processing Division - MVS Dynamic Reconfiguration and Standalone Dump (SADUMP)
    • IBM T.J. Watson Research - Advisory Programmer on MVS Batch Pipes, Automatic Restart Manager, Java Record I/O, and IMS Data Sharing
    • Forte Software / Sun Microsystems - Senior Programmer on Forte 4GL/Conductor/Fusion, Open Enterprise Service Bus, and Glassfish
    • Startups - Individual contributor and management roles including LogiCoy (Open ESB), Likestream (Facebook Clojure App), Guidewire (Platform), Illumio (Network Security Web UI), and Signifyd (Gradle/Docker performance tuning)

    Areas I’ve worked in: mainframe O/S development, EAI/B2B middleware, platform engineering, build/release engineering, and embedded programming.

    Programming Languages

    Over the years, I’ve written production code in:

    Era Languages
    Mainframe APL, Assembler (S/370, S/390), IBM PL/S, PL/AS, PL/X, CMS/TSO Pipelines
    Systems C, C++
    Enterprise Java, Forte 4GL, Guidewire Gosu, Groovy
    Web/Modern JavaScript, TypeScript, Go, Clojure, ClojureScript
    Current Elisp, JavaScript, Kotlin, Python, Rust, WebAssembly

    Each language taught me something different about how to think about problems. APL taught me array thinking. Assembler taught me what the machine is actually doing. CMS/TSO Pipelines taught me dataflow composition (an area I plan to revisit in Throwback Thursday posts). Lisp (via Clojure) taught me functional composition. Rust is teaching me ownership and fearless concurrency.

    I’m a lifelong learner. When Rust emerged as a modern systems language, I dove in. When AI coding agents became capable enough to be genuine collaborators, I started exploring how they change the way we build software.

    This blog and the accompanying YouTube channel document that exploration.

    What This Blog Covers

    Software Wrighter Lab focuses on three main areas:

    1. AI Coding Agents

    How do tools like Claude Code, Cursor, and other AI assistants actually perform on real projects? I build the same applications with different agents to compare their strengths and weaknesses.

    • Vibe coding comparisons (Claude vs GLM, different models)
    • Practical workflows (parallel coding with git worktrees, hooks, custom scripts)
    • Tool development (guardian-cli, proact, ralphy)

    2. Machine Learning Research Implementation

    When interesting ML papers come out, I implement them to understand how they work. The goal isn’t to compete with research labs—it’s to learn by building.

    Recent implementations include:

    • Tiny Recursive Model (TRM) - Under 1,000 parameters solving mazes
    • Hierarchical Reasoning Model (HRM) - Planner-Doer architecture for abstract reasoning
    • MobileLLM - Running LLMs offline on Android
    • Deepseek papers (mHC, Engram) - Novel architectures for efficient inference
    • MIT’s Recursive Language Model - Implemented in Rust with WASM

    3. Rust, WebAssembly, and Practical Tools

    Rust is my language of choice for new projects. Combined with WebAssembly, it enables building tools that run anywhere—CLI, browser, or embedded.

    Topics include:

    • Rust/Yew/WASM web applications
    • Visualization (Three.js, d3.js, pure CSS approaches)
    • Video production tools (TTS, lip sync, explainer generation)
    • Developer utilities (installation scripts, repo assistants, modularizers)

    Why “Software Wrighter”?

    A “wright” is a craftsperson—someone who builds things. A wheelwright builds wheels. A playwright builds plays.

    A Software Wrighter builds software, with attention to craft.

    The name reflects my belief that good software comes from treating programming as a craft: learning continuously, choosing tools deliberately, and building things that work well and last.

    What to Expect

    Posts on this blog will typically include:

    • Links to papers, repos, and videos (above the fold)
    • Implementation details (language, LOC, complexity assessment)
    • Working code you can clone and run
    • Honest assessments of what works and what doesn’t

    I’m not trying to sell you anything. This is a lab notebook—a record of experiments, some successful, some not.

    Current Projects

    As of February 2026, I’m actively working on:

    Project Description Status
    Small Models, Big Brains 6-part series on efficient LLMs Publishing
    Deepseek papers mHC and Engram implementations In progress
    Explainer pipeline AI-generated video production Ongoing
    RLM implementations Recursive Language Models in Rust Complete

    Technology Stack

    Most of my current work uses:

    Layer Technology
    Systems Rust
    Web Yew, WASM, TypeScript
    ML Python, PyTorch, HuggingFace
    AI Agents Claude Code, Cursor
    Video OBS, FFmpeg, TTS tools

    Get Involved

    If any of this resonates with you:

    I’m always interested in discussing these topics with other engineers exploring similar territory.

    What’s Next

    The first content series, Small Models, Big Brains, starts tomorrow. It’s a 6-part deep dive into small language models that outperform much larger ones on specific tasks:

    1. TRM: 976 parameters beating GPT-4 on mazes
    2. MobileLLM: AI running offline on your phone
    3. HRM: 27M parameters beating o3-mini on abstract reasoning
    4. BDH: A language model with visible, interpretable activations
    5. Billion-parameter models: The efficiency sweet spot
    6. The 2-3B efficient frontier: Phi-2, Gemma, SmolLM

    Each post maps to a YouTube video, a GitHub repo, and working code you can run yourself.

    Thanks for reading. Let’s build something interesting.


    Mike Wright Software Wrighter LLC San Francisco Bay Area

  • 1138 words6 min readAbstract

    TBT (1/?): My First Program Was a Horse Race

    My first program was a horse race. Written in APL. On a mainframe. In 1972.

    This is the first Throwback Thursday post—a series where I revisit the technologies, languages, and ideas that shaped how I think about programming.

    APL: A Programming Language

    APL was created by Kenneth Iverson at IBM in the 1960s. The name literally means “A Programming Language”—Iverson was a mathematician who designed it as a notation for describing algorithms before it became an actual programming language.

    What made APL special:

    Feature Description
    Array-oriented Operations work on entire arrays, not single values
    Symbolic notation Greek letters and mathematical symbols as operators
    Interactive REPL-style development decades before it was common
    Terse Complex operations in a few characters

    APL programs look like nothing else:

    POS←POS+?5⍴3
    

    This single line adds random values (1-3) to all five horse positions simultaneously. No loops. No iteration. The operation just happens across the entire array.

    The IBM 2741 Experience

    In 1972, APL\360 ran on IBM mainframes. You accessed it through terminals like the IBM 2741—essentially a modified Selectric typewriter with a special APL typeball.

    IBM Selectric APL typeball
    APL typeball for IBM Selectric

    The typeball had all the APL glyphs: ⍴ ⍳ ∇ ⎕ ← ⌈ ⌊ ⍵ ⍺ ∘ ⊃ ⊂ and dozens more. You physically typed these symbols. The keyboard layout was completely different from anything you’d seen before.

    When you made an error, there was no backspace in the modern sense. You’d overstrike characters or start the line over. Programs were stored in workspaces, saved to tape or disk.

    The terminal printed on paper. Every interaction left a physical record.

    The Horse Race Program

    Horse race simulations were popular APL demonstrations. They showed off several things:

    1. Random number generation (? roll operator)
    2. Array operations (updating all positions at once)
    3. Character graphics (crude but effective visualization)
    4. Interactive output (watching the race unfold)

    Here’s the verbose version from the repo:

    ∇ RACE;HORSES;POS;FINISH;ROUND;_
      HORSES←'LUCKY  ' 'THUNDER' 'SHADOW ' 'COMET  ' 'BLAZE  '
      POS←5⍴0
      FINISH←15
      ROUND←0
      ⎕←'══════════════════════════════════════════'
      ⎕←'           THE RACE IS ON!'
      ⎕←'══════════════════════════════════════════'
    LOOP:ROUND←ROUND+1
      ⎕←'--- ROUND ',(⍕ROUND),' ---'
      POS←POS+?5⍴3
      SHOWHORSES
      →DONE×⍳∨/POS≥FINISH
      →LOOP
    DONE:⎕←'WINNER: ',((⊃(POS=⌈/POS)/⍳5)⊃HORSES),'!'
    ∇
    

    Key APL Idioms

    Array creation:

    POS←5⍴0    ⍝ Create array of 5 zeros
    

    The (rho) operator reshapes. 5⍴0 means “reshape 0 into a 5-element array.”

    Random numbers:

    ?5⍴3       ⍝ Roll 5 dice, each 1-3
    

    The ? operator is “roll”—like rolling dice. ?5⍴3 rolls five 3-sided dice.

    Finding the winner:

    (⊃(POS=⌈/POS)/⍳5)⊃HORSES
    

    This reads right-to-left:

    • ⌈/POS — maximum of all positions
    • POS=⌈/POS — boolean mask: which horses are at max?
    • /⍳5 — compress: keep only those indices
    • — take the first one
    • ⊃HORSES — select that horse’s name

    One line. No loops. Pure array thinking.

    The Idiomatic Version

    APL programmers pride themselves on terseness. The idiomatic version does the same thing in fewer characters:

    HORSES←'LUCKY  ' 'THUNDER' 'SHADOW ' 'COMET  ' 'BLAZE  '
    
    ∇ SHOW;I
      I←1
    N:⎕←(I⊃HORSES),'│',((I⊃POS)⍴'░'),'▓'
      I←I+1
      →N×⍳I≤5
    ∇
    
    ∇ RACE;POS;_
      POS←5⍴0
      ⎕←'THE RACE IS ON!'
    L:_←⎕DL 0.3
      POS←POS+?5⍴3
      SHOW
      ⎕←''
      →L×⍳~∨/POS≥15
      ⎕←'WINNER: ',(⊃(POS=⌈/POS)/⍳5)⊃HORSES
    ∇
    

    The entire program fits on a single screen. This was the APL aesthetic: powerful ideas expressed concisely.

    Running It Today

    GNU APL implements ISO 13751 (Extended APL) and runs on modern systems:

    # macOS
    brew install gnu-apl
    
    # Arch Linux
    yay -S gnu-apl
    
    # Run the race
    git clone https://github.com/sw-comp-history/apl-horse-race
    cd apl-horse-race
    apl -f src/race.apl
    

    Sample output:

    ══════════════════════════════════════════
               THE RACE IS ON!
    ══════════════════════════════════════════
    
    --- ROUND 1 ---
    LUCKY   │▓▓▓◄
    THUNDER │▓▓◄
    SHADOW  │▓◄
    COMET   │▓▓▓◄
    BLAZE   │▓▓◄
    

    The horses advance randomly until one crosses the finish line.

    What APL Taught Me

    APL shaped how I think about programming in ways that persist today:

    1. Think in arrays, not loops.

    When I see a problem, I ask: can this be expressed as an operation on a whole collection? Languages like NumPy, R, and Julia carry this forward.

    2. Notation matters.

    Good notation can make complex ideas simple. Bad notation obscures them. APL’s symbols were controversial, but they made array operations visible in ways that verbose syntax doesn’t.

    3. The REPL is powerful.

    Interactive development—type an expression, see the result immediately—was central to APL decades before it became fashionable again with Jupyter notebooks and modern REPLs.

    4. Terseness has value.

    Not obfuscation for its own sake, but the ability to see an entire algorithm at once. When your program fits on one screen, you can reason about the whole thing.

    APL’s Legacy

    APL influenced many languages:

    Language Year APL Influence
    J 1990 Iverson’s ASCII-only redesign
    K/Q 1993 Powers financial systems at Kx
    A+ 1988 Morgan Stanley’s open-source APL
    BQN 2020 Modern APL with cleaner semantics
    NumPy 2006 Array operations in Python
    R 1993 Vector operations for statistics

    The ideas live on, even if the glyphs don’t.

    Implementation Details

    Metric Value
    Primary Language APL
    Source Files 2 .apl files
    Lines of Code ~50 lines total
    Runtime GNU APL
    Also Includes Documentation, PNG samples for Unicode issues

    Good for you if: You want to understand array programming origins, learn basic APL, or experience what programming felt like in the 1970s.

    Complexity: Low. The program is intentionally simple—a teaching example, not production code. The repo includes extensive documentation explaining every line.

    Why Throwback Thursday?

    Programming didn’t start with Python and JavaScript. Every abstraction we use today was invented by someone solving a real problem.

    TBT posts will explore:

    • Languages that shaped my thinking (APL, Lisp, Forth)
    • Technologies that were ahead of their time (CMS/TSO Pipelines, dataflow)
    • Ideas worth revisiting with modern tools

    Understanding where we came from helps us see where we’re going.

    Resources


    Have your own “first program” story? Find me on YouTube @SoftwareWrighter.

    Watch the Video

    Unmute to hear narration.

subscribe via RSS