Software Wrighter Lab Blog

March 4, 2026 • Software Wrighter

457 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Neural Collapse (late-stage geometric convergence of class representations), Grokking (sudden generalization after prolonged memorization), SAM (optimizing for flat loss regions under perturbations), Mechanistic Interpretability (analyzing internal circuits of neural networks), Self-Training Instability (feedback loops that amplify errors in self-generated data).

Five ML Concepts - #29

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #29

References

Concept	Reference
Neural Collapse	Prevalence of Neural Collapse (Papyan et al. 2020)
Grokking	Grokking: Generalization Beyond Overfitting (Power et al. 2022)
SAM	Sharpness-Aware Minimization (Foret et al. 2021)
Mechanistic Interpretability	Transformer Circuits (Anthropic 2021)
Self-Training Instability	Understanding Self-Training (Wei et al. 2020)

Today’s Five

1. Neural Collapse

In overparameterized networks trained to zero loss, class representations converge late in training to a symmetric, maximally separated structure. The last-layer features and classifiers align into a simplex equiangular tight frame.

This geometric phenomenon appears universally across architectures.

Like students settling into evenly spaced seats by the end of class.

2. Grokking

In some tasks, especially small algorithmic ones, models memorize quickly but only later suddenly generalize. The jump from memorization to understanding can happen long after training loss reaches zero.

Weight decay and longer training appear necessary for this phase transition.

Like cramming facts for an exam, then later realizing you truly understand.

3. SAM (Sharpness-Aware Minimization)

Instead of minimizing loss at a single point, SAM minimizes loss under small weight perturbations, finding flatter regions. Flatter minima tend to generalize better than sharp ones.

The optimizer seeks robustness to parameter noise.

Like choosing a wide hilltop instead of balancing on a sharp peak.

4. Mechanistic Interpretability

Researchers analyze activations and internal circuits to understand how specific computations are implemented inside models. The goal is reverse-engineering neural networks into understandable components.

This reveals attention heads, induction heads, and other interpretable patterns.

Like mapping the wiring of an unknown machine to see how it works.

5. Self-Training Instability

When models train on their own generated data, feedback loops can amplify small errors over time. Each iteration compounds mistakes, causing distributional drift.

Careful filtering and external grounding help mitigate this.

Like copying a copy repeatedly until the meaning drifts.

Quick Reference

Concept	One-liner
Neural Collapse	Late-stage geometric convergence of class representations
Grokking	Sudden generalization after prolonged memorization
SAM	Optimizing for flat loss regions under perturbations
Mechanistic Interpretability	Analyzing internal circuits of neural networks
Self-Training Instability	Feedback loops that amplify errors in self-generated data

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

March 3, 2026 • Software Wrighter

443 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Lottery Ticket Hypothesis (small winning subnetworks within large models), Sparse Activation (using only part of a model per input), Conditional Computation (dynamically routing inputs for efficiency), Inference Parallelism (distributing inference across devices), Compute Optimality (balancing model size, data, and compute).

Five ML Concepts - #28

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #28

References

Concept	Reference
Lottery Ticket Hypothesis	The Lottery Ticket Hypothesis (Frankle & Carlin 2019)
Sparse Activation	Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
Conditional Computation	Sparsely-Gated MoE + Switch Transformers
Inference Parallelism	Megatron-LM (Shoeybi et al. 2019)
Compute Optimality	Chinchilla Scaling Laws (Hoffmann et al. 2022)

Today’s Five

1. Lottery Ticket Hypothesis

Large neural networks contain smaller subnetworks that, when trained from the right initialization, achieve similar performance. These “winning tickets” exist before training begins.

The key insight: you can find and train just the winning subnetwork.

Like finding a winning lottery ticket hidden among many losing ones.

2. Sparse Activation

Only a subset of neurons activate for each input, even in models with many parameters. This allows large capacity without using everything at once.

Mixture-of-experts architectures explicitly design for this pattern.

Like a library where only relevant books light up for each query.

3. Conditional Computation

The model dynamically activates only certain components depending on the input. Different inputs route to different experts or pathways.

This improves efficiency and scalability without proportional compute increase.

Like routing patients to the right specialist instead of seeing every doctor.

4. Inference Parallelism

Model execution can be split across multiple devices to reduce latency or increase throughput. Tensor parallelism splits layers; pipeline parallelism splits stages.

Essential for serving large models in production.

Like dividing a puzzle so multiple people work on it simultaneously.

5. Compute Optimality Hypothesis

Empirical scaling laws suggest performance improves when model size, data, and compute are balanced. Adding only one resource may not yield optimal gains.

Chinchilla showed many models were undertrained relative to their size.

Like baking a cake where proportions matter more than just adding extra ingredients.

Quick Reference

Concept	One-liner
Lottery Ticket Hypothesis	Small winning subnetworks hidden in large models
Sparse Activation	Using only part of a model per input
Conditional Computation	Dynamically routing inputs for efficiency
Inference Parallelism	Distributing inference across devices
Compute Optimality	Balancing model size, data, and compute

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

March 2, 2026 • Software Wrighter

894 words • 5 min read • Abstract

A robust architecture: core model (rarely updated) + adapters (modular skills) + external memory (facts) + context manager (RLM-style) + logging and evaluation loop. Errors feed into memory first. Only recurring, validated improvements reach adapters.

How AI Learns Part 7: Designing a Continuous Learning Agent

A robust continuous learning agent contains:

Core model (rarely updated)
Adapters (modular skills)
External memory (facts)
Context manager (Recursive Language Model (RLM)-style)
Logging & evaluation loop

Resource	Link
Related	RLM \| Engram \| Sleepy Coder

The Layered Architecture

Four-layer architecture showing Context Manager, External Memory, Adapters, and Core Weights with feedback and evaluation loops — Continuous learning is layered coordination.

Layer by Layer

Layer 4: Core Weights (Bottom)

The foundation. Trained once, changed rarely.

Aspect	Details
Contains	General reasoning, language, base knowledge
Update frequency	Months or never
Update method	Full fine-tune or major consolidation
Risk of change	High (forgetting, capability shifts)

Rule: Don’t touch this unless you have a very good reason.

Layer 3: Adapters (Parameter-Efficient Fine-Tuning (PEFT) / Low-Rank Adaptation (LoRA))

Modular skills that plug into the base.

Aspect	Details
Contains	Task-specific capabilities
Update frequency	Weekly to monthly
Update method	Lightweight PEFT training
Risk of change	Medium (isolated, but validate)

Rule: Train adapters for validated, recurring patterns. Version them. Enable rollback.

Layer 2: External Memory

Facts, experiences, and retrieved knowledge.

Aspect	Details
Contains	Documents, logs, structured data
Update frequency	Continuous
Update method	Database writes
Risk of change	Low (doesn’t affect weights)

Rule: Store experiences here first. Memory is cheap and safe.

Layer 1: Context Manager (Top)

The RLM-style interface that rebuilds focus each step.

Aspect	Details
Contains	Current context, retrieved data, active state
Update frequency	Per call
Update method	Reconstruction from memory + query
Risk of change	None (ephemeral)

Rule: Don’t drag context forward. Rebuild it.

The Feedback Loop

Logging

Capture everything the agent does:

Prompts received
Actions taken
Tool calls made
Errors encountered
User signals

This is your training data.

Evaluation

Before any update reaches production:

Check	Purpose
Retention tests	Did old skills degrade?
Forward transfer	Did new skills improve?
Regression suite	Known failure cases
Safety checks	Harmful outputs?

Without evaluation, you’re updating blind.

Deployment

Updates should be:

Modular: Can isolate and rollback
Versioned: Know what changed when
Staged: Test before full rollout
Monitored: Track post-deployment metrics

The Error Flow

Where do errors go?

Error occurs
    ↓
Log it (immediate)
    ↓
Store in memory (same day)
    ↓
Pattern emerges over multiple occurrences
    ↓
Train adapter update (weekly/monthly)
    ↓
Validate update (before deployment)
    ↓
Deploy with rollback capability

Errors feed into memory first. Only validated, recurring improvements reach adapters. Core weights almost never change.

What This Architecture Achieves

Problem	Solution
Catastrophic forgetting	Core weights frozen; adapters isolated
Context rot	RLM rebuilds focus each step
Hallucination	Memory grounds responses
Slow adaptation	Memory updates continuously
Unsafe changes	Evaluation before deployment

Design Principles

1. Separate Storage from Reasoning

Facts belong in memory. Reasoning belongs in weights. Don’t blur them.

2. Separate Speed from Permanence

Fast learning (memory) is temporary. Slow learning (weights) is permanent. Match the update speed to the desired permanence.

3. Evaluate Before Consolidating

Every update to adapters or weights must be validated. Regressions are silent killers.

4. Enable Rollback

Version everything. If an update causes problems, you must be able to undo it.

5. Log Everything

You cannot improve what you cannot measure. Structured logging is the foundation of continuous learning.

The Big Picture

AI does not learn in one place.

It learns in layers:

Permanent (weights)
Modular (adapters)
External (memory)
Temporary (context)

Continuous learning is not constant weight updates.

It is careful coordination across time scales.

Continuous learning systems don’t constantly retrain. They carefully consolidate what works.

References

Concept	Paper
LoRA	LoRA: Low-Rank Adaptation (Hu et al. 2021)
RAG	Retrieval-Augmented Generation (Lewis et al. 2020)
RLM	Recursive Language Models (Zhou et al. 2024)
Share	Shared LoRA Subspaces (2025)
Engram	Engram: Conditional Memory (DeepSeek 2025)

Series Summary

Part	Key Insight
1. Time Scales	Learning happens at different layers and speeds
2. Forgetting vs Rot	Different failures need different fixes
3. Weight-Based	Change the brain carefully
4. Memory-Based	Store facts outside the brain
5. Context & RLM	Rebuild focus instead of dragging baggage
6. Continuous Learning	Learn in memory, consolidate in weights
7. Full Architecture	Layered coordination enables safe improvement

Continuous learning is layered coordination.

March 2, 2026 • Software Wrighter

419 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Elastic Weight Consolidation (protecting important parameters during new task learning), Replay Buffers (mixing past examples to prevent forgetting), Parameter Routing (activating task-specific parameter subsets), Memory-Augmented Networks (external memory modules for neural networks), Model Editing (targeted weight updates without full retraining).

Five ML Concepts - #27

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #27

References

Concept	Reference
Elastic Weight Consolidation	Overcoming catastrophic forgetting (Kirkpatrick et al. 2017)
Replay Buffers	Experience Replay for Continual Learning (Rolnick et al. 2019)
Parameter Routing	Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
Memory-Augmented Networks	Neural Turing Machines (Graves et al. 2014)
Model Editing	Editing Large Language Models (Yao et al. 2023)

Today’s Five

1. Elastic Weight Consolidation

Adding a penalty that discourages changing parameters important to previous tasks. Importance is estimated using Fisher information from prior training.

This helps models learn new tasks without catastrophic forgetting.

Like protecting well-worn neural pathways while building new ones.

2. Replay Buffers

Storing examples from earlier tasks and mixing them into new training. Past data is replayed alongside current examples during optimization.

This reinforces previous knowledge while learning new data.

Like reviewing old flashcards while studying new material.

3. Parameter Routing

Activating different subsets of model parameters depending on the task or input. Mixture-of-experts and conditional computation route inputs to specialized weights.

Enables specialization without fully separate models.

Like having different experts handle different questions.

4. Memory-Augmented Networks

Adding external memory modules that neural networks can read from and write to. The model learns to store and retrieve information during inference.

Extends beyond purely weight-based memory to explicit storage.

Like giving a calculator access to a notepad.

5. Model Editing

Targeted weight updates to modify specific behaviors without full retraining. Locate and adjust the parameters responsible for particular facts or behaviors.

Allows fast corrections and knowledge updates post-training.

Like editing a specific entry in an encyclopedia instead of rewriting the whole book.

Quick Reference

Concept	One-liner
Elastic Weight Consolidation	Protecting important parameters during new learning
Replay Buffers	Mixing past examples to prevent forgetting
Parameter Routing	Activating task-specific parameter subsets
Memory-Augmented Networks	External memory modules for neural networks
Model Editing	Targeted weight updates without full retraining

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

March 1, 2026 • Software Wrighter

691 words • 4 min read • Abstract

Continuous learning aims to absorb new information and skills over time without losing old capabilities. The key: learn often in memory, consolidate carefully in weights. Periodic consolidation, not constant updates.

How AI Learns Part 6: Toward Continuous Learning

Continuous learning aims to:

Learn new skills
Retain old skills
Avoid retraining from scratch
Avoid catastrophic forgetting

Resource	Link
Related	Sleepy Coder Part 1 \| Sleepy Coder Part 2

The Continuous Learning Loop

Flow diagram showing Agent to Logs to Evaluate to Cluster to Train to Validate to Deploy cycle, with Memory branch — Periodic consolidation, not constant updates.

The Core Tradeoff

Goal	Description
Plasticity	Learn new things quickly
Stability	Retain old things reliably

You cannot maximize both simultaneously. The art is in the balance.

Approaches to Continuous Learning

1. Replay-Based Methods

Keep (or synthesize) some old data. Periodically retrain on old + new.

How it works:

Store representative examples from each task
Mix old data into new training batches
Periodically consolidate

Recent work: FOREVER adapts replay timing using “model-centric time” (based on optimizer update magnitude) rather than fixed training steps.

Pros	Cons
Strong retention	Storage costs
Conceptually simple	Privacy concerns
Well-understood	Data governance complexity

2. Replay-Free Regularization

Constrain weight updates to avoid interference, without storing old data.

Efficient Lifelong Learning Algorithm (ELLA) (Jan 2026): Regularizes updates using subspace de-correlation. Reduces interference while allowing transfer.

Share (Feb 2026): Maintains a single evolving shared low-rank subspace. Integrates new tasks without storing many adapters.

Pros	Cons
No replay needed	Still active research
Privacy-friendly	Evaluation complexity
Constant memory	Subtle failure modes

3. Modular Adapters

Keep base model frozen. Train task-specific adapters. Merge or switch as needed.

Evolution:

Low-Rank Adaptation (LoRA): Individual adapters per task
Shared LoRA spaces: Adapters share subspace
Adapter banks: Library of skills to compose

Pros	Cons
Modular, versioned	Adapter proliferation
Low forgetting risk	Routing complexity
Easy rollback	Composition challenges

4. Memory-First Learning

Store experiences in external memory. Only consolidate to weights what’s proven stable.

Pattern:

New information → Memory (fast)
Validated patterns → Adapters (slow)
Fundamental capabilities → Weights (rare)

This separates the speed of learning from the permanence of changes.

The Practical Loop

A working continuous learning system:

Run agent (with Recursive Language Model (RLM) context management)
Collect traces: prompts, tool calls, outcomes, failures
Score outcomes: tests, static analysis, user signals
Cluster recurring failure patterns
Train lightweight updates (LoRA/adapters)
Validate retention (did old skills degrade?)
Deploy modular update (with rollback capability)

This is not real-time learning. It’s periodic consolidation.

Human analogy: Sleep. Process experiences, consolidate important patterns, prune noise.

Time Scales of Update

Frequency	What Changes	Method
Every query	Nothing (inference only)	-
Per session	Memory	Retrieval-Augmented Generation (RAG)/Engram
Daily	Adapters (maybe)	Lightweight Parameter-Efficient Fine-Tuning (PEFT)
Weekly	Validated adapters	Reviewed updates
Monthly	Core weights	Major consolidation

Most systems should:

Update memory frequently
Update adapters occasionally
Update core weights rarely

Evaluation Is Critical

Continuous learning without continuous evaluation is dangerous.

Required:

Retention tests (what got worse?)
Forward transfer tests (what got better?)
Regression detection
Rollback capability

Without these, you’re flying blind.

References

Concept	Paper
ELLA	Subspace Learning for Lifelong ML (2024)
Share	Shared LoRA Subspaces (2025)
FOREVER	Model-Centric Replay (2024)
EWC	Overcoming Catastrophic Forgetting (Kirkpatrick et al. 2017)

Coming Next

In Part 7, we’ll put it all together: designing a practical continuous learning agent with layered architecture, logging, feedback loops, and safety.

Learn often in memory. Consolidate carefully in weights.

March 1, 2026 • Software Wrighter

424 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Data Augmentation (expanding training data with transformations), Caching Strategies (reducing latency by reusing computation), Constitutional AI (training models to follow explicit principles), Goodhart's Law (optimizing metrics distorts objectives), Manifold Hypothesis (data lies on lower-dimensional structures).

Five ML Concepts - #26

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #26

References

Concept	Reference
Data Augmentation	A survey on Image Data Augmentation (Shorten & Khoshgoftaar 2019)
Caching Strategies	Systems engineering practice (no canonical paper)
Constitutional AI	Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022)
Goodhart’s Law	Goodhart’s Law and Machine Learning (Sevilla et al. 2022)
Manifold Hypothesis	An Introduction to Variational Autoencoders (Kingma & Welling 2019)

Today’s Five

1. Data Augmentation

Creating additional training examples using label-preserving transformations. Rotate, flip, crop, or color-shift images without changing what they represent.

Effectively increases dataset size and improves generalization.

Like practicing piano pieces at different tempos to build flexibility.

2. Caching Strategies

Storing previous computation results to reduce repeated work and latency. Cache embeddings, KV states, or frequently requested outputs.

Essential for production inference at scale.

Like keeping frequently used books on your desk instead of the library.

3. Constitutional AI

Training models to follow explicit written principles alongside other alignment methods. The constitution provides clear rules for behavior.

Models critique and revise their own outputs against these principles.

Like giving someone written house rules instead of vague instructions.

4. Goodhart’s Law

When a measure becomes a target, it can stop being a good measure. Optimizing for a proxy metric can diverge from the true objective.

A core challenge in reward modeling and evaluation design.

Like studying only for the test instead of learning the subject.

5. Manifold Hypothesis

The idea that real-world data lies on lower-dimensional structures within high-dimensional space. Images of faces don’t fill all possible pixel combinations.

This structure is what representation learning exploits.

Like faces varying along a few key features instead of every pixel independently.

Quick Reference

Concept	One-liner
Data Augmentation	Expanding training data with transformations
Caching Strategies	Reducing latency by reusing computation
Constitutional AI	Training models to follow explicit principles
Goodhart’s Law	Optimizing metrics distorts objectives
Manifold Hypothesis	Data lies on lower-dimensional structures

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 28, 2026 • Software Wrighter

697 words • 4 min read • Abstract

Continuing the music-pipe-rs story: a web demo with Bach and Baroque arrangements, the seq command for explicit note sequences, and GarageBand integration. Plus the generative music resources that inspired this project.

music-pipe-rs: Web Demo and Multi-Instrument Arrangements

Since the initial music-pipe-rs post, the project has grown. There’s now a web demo with playable examples, a new seq stage for explicit note sequences, and multi-instrument arrangements that work in GarageBand.

Resource	Link
Video	YouTube
Live Demo	music-pipe-rs Samples
Source	GitHub
Previous	Unix Pipelines for MIDI

Web Demo

The live demo showcases pre-built examples with playable audio:

Tab	Style	Description
Bach Toccata (Organ)	Classical	Multi-voice church organ with octave doubling and pedal bass
Bach Toccata (8-bit)	Chiptune	Gyruss-inspired arcade version with square wave
Bach-esque	Algorithmic	Procedurally generated baroque-style background music
Baroque Chamber	Ensemble	Six-channel piece with strings, harpsichord, and recorder

Each tab shows the pipeline script alongside playable audio. See exactly what commands produce each result.

The seq Stage

The new seq stage allows explicit note sequences instead of algorithmic generation:

seed | seq "C4/4 D4/4 E4/4 F4/4 G4/2" | to-midi --out scale.mid

Notation: NOTE/DURATION where duration is in beats. Combine with other stages:

seed | seq "D5/4 C#5/8 R/4 B4/4" | transpose --semitones 5 | humanize | to-midi --out melody.mid

The R represents rests. This enables transcribing existing melodies or composing precise phrases.

Multi-Instrument Arrangements

The Baroque chamber piece demonstrates six-channel composition:

{
    seed 42 | seq "..." --ch 0 --patch 48;  # Strings melody
    seed 42 | seq "..." --ch 1 --patch 6;   # Harpsichord
    seed 42 | seq "..." --ch 2 --patch 74;  # Recorder
    # ... additional voices
} | humanize | to-midi --out baroque.mid

Each instrument gets its own channel and General MIDI patch. The same seed ensures timing coherence across parts.

GarageBand Integration

Import the MIDI files directly into GarageBand:

Generate arrangement: ./examples/trio-demo.sh
Open GarageBand, create new project
Drag the .mid file into the workspace
GarageBand creates tracks for each channel
Assign software instruments to taste

The demo includes a jazz trio arrangement:

Piano: Bluesy melody with chords and swing
Bass: Walking bass line with acoustic bass patch
Drums: Hi-hat, snare, kick with dynamic variation

All generated from pipeline scripts.

Inspiration

This project was inspired by research into generative music tools and techniques:

References

Topic	Link
Analog Synthesizers	Code Self Study
Drum Synthesis	JavaScript Drum Synthesis
Generative Music	Code Self Study
Music Projects	Software and Hardware
FOSS Music Tools	Open Source Music Production
Eurorack Programming	Patch.Init() Tutorial
Opusmodus	Algorithmic Composition in Lisp

The key insight from Opusmodus: algorithmic composition isn’t random music—it’s programmable composition. Motif transformation, rule systems, deterministic generation. music-pipe-rs brings these ideas to Unix pipes.

What’s Next

The pipeline architecture makes extension natural:

More generators: Markov chains, L-systems, cellular automata
More transforms: Inversion, retrograde, quantization
Live mode: Real-time MIDI output with clock sync

Each new capability is just another stage in the pipeline.

Series: Personal Software (Part 5)

Previous: music-pipe-rs: Unix Pipelines

Disclaimer

You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

Watch the Video

Unmute to hear narration.

February 28, 2026 • Software Wrighter

900 words • 5 min read • Abstract

Expanding my home AI cluster from 10% to 20% brain power with a new X99 motherboard and RTX 3090. Adding VoxCPM voice cloning, FLUX text-to-image, and Wan 2.2 text-to-video capabilities.

Lucy 20%: Upgrading My Home AI Cluster

Lucy is getting an upgrade. I’m adding an X99 motherboard with an RTX 3090 to expand my AI cluster from 10% to 20% brain power.

Resource	Link
Video	Lucy 20% Upgrade
Previous	Lucy 10%

New Hardware: Queenbee

The cluster uses bee-themed naming. The new node is called queenbee:

Component	Specification
Motherboard	X99
CPU	Intel Xeon E5-2660 v4 (28 threads)
RAM	64 GB DDR4 ECC
GPU	RTX 3090 (24GB VRAM)
Storage	1TB NVMe SSD + 4TB HDD

New AI Capabilities

With queenbee online, Lucy gains several new abilities:

Capability	Model	What It Does
Voice Cloning	VoxCPM	High-quality text-to-speech with voice cloning
Text-to-Image	FLUX schnell	Fast image generation from text prompts
Text-to-Video	Wan 2.2	Generate video clips from text descriptions
Image-to-Video	SVD	Animate still images into video

The Active Cluster

Currently active for AI workloads:

Node	Role	GPU
hive	MuseTalk lip-sync	2x P40 (48GB total)
queenbee	Generative AI workloads	RTX 3090 (24GB)

Together, they handle the full pipeline: generate images, animate them to video, add lip-synced speech, and produce the final output. See the full apiary inventory below.

Why Local AI?

Running AI locally means:

Privacy - Data never leaves my network
No API costs - Unlimited generations after hardware investment
Customization - Full control over models and parameters
Learning - Deep understanding of how these systems work

The 24GB of VRAM on the 3090 opens up models that wouldn’t fit on smaller cards. FLUX schnell produces high-quality images in seconds. VoxCPM creates natural-sounding speech that can clone voices from short audio samples.

Bee-Themed Host Names

The full apiary (current and planned nodes):

Host	System	CPU	Cores	RAM	GPU
apiary	HPE DL360 G10	1x Xeon Gold 5188	12C/24T	188G	-
bees	HPE DL360 G9	2x E5-2650 v4	24C/48T	128G	-
brood	HPE DL380 G9	2x E5-2680 v4	28C/56T	64G	2x P100-16G
colony	Supermicro 6028U	2x E5-2680 v3	24C/48T	TBD	2x K80-24G
drones	HPE DL380 G9	2x E5-2620 v4	16C/32T	256G	-
hive	HPE DL380 G9	2x E5-2698 v3	32C/64T	128G	2x P40-24G
honeycomb	HPE DL180 G9	1x E5-2609 v4	8C/8T	TBD	-
queenbee	X99	1x E5-2660 v4	14C/28T	64G	RTX 3090-24G
swarm	HPE DL380 G9	2x E5-2698 v3	32C/64T	374G	2x P100-12G
workers	HPE DL560 G8	4x E5-4617 v1	TBD	640G	TBD

Notes: Some nodes pending upgrade or configuration. Workers may upgrade to 4x E5-4657L v2 (48C/96T). Honeycomb needs unbrick. K80 GPUs are old and difficult to configure (limited CUDA version support)—will be replaced with M40 GPUs.

Power and Control

Remote management is essential for a home datacenter. The HPE servers include iLO (Integrated Lights-Out) for out-of-band access to BIOS, diagnostics, monitoring, and power control—even when the OS is down.

Category	Technology	Purpose
Remote Management	HPE iLO	BIOS access, diagnostics, monitoring, power control
IP KVM	JetKVM, Sipeed KVM	Console access for non-HPE servers (planned)
Power Monitoring	Kill-A-Watt, clones	Per-outlet power consumption tracking
Smart Outlets	Home Assistant + Zigbee	Remote power control, scheduling, automation
Additional Circuits	Bluetti LFP power stations	Extra capacity to run more servers, remote control via BT/WiFi/Zigbee

The combination of iLO and smart outlets means I can remotely power-cycle any server, access its console, and monitor power draw—all from my phone or Home Assistant dashboard. The Bluetti stations primarily provide additional circuits so I can run more servers simultaneously—home electrical limits are a real constraint. More LFP power stations will be needed to power Lucy at 100%.

Networking

Each server has 3 or more NICs, segmented by purpose:

Speed	Purpose	Switch
1G	iLO/KVM management	1G switch
2.5G	SSH, SCP, Chrome Remote Desktop	2x 2.5G switches
10G fiber	Server-to-server data transfer (large models)	10G switch

The 10G backbone is essential for moving multi-gigabyte model files between nodes. Loading a 70B parameter model over 1G would take forever—10G fiber makes it practical. The 2.5G network handles interactive work and smaller transfers (using USB NICs where needed), while the 1G management network stays isolated for out-of-band access.

Additional networking notes:

WiFi 7 for wireless connectivity
Managed switches with VLANs planned for better network segmentation
Linux network bonding experiments to increase aggregate transfer rates
Sneaker net - most servers have hot-swap SAS SSDs and hard drives, so physically moving drives between nodes is sometimes the fastest option for very large transfers

What’s Next

The 20% milestone is just a step. Future upgrades could include:

Additional GPU nodes for parallel processing
Larger language models for local inference
Real-time video generation pipelines
Integration with more specialized models

The bee hive keeps growing.

Building AI infrastructure one node at a time.

Watch the Video

Unmute to hear narration.

February 28, 2026 • Software Wrighter

631 words • 4 min read • Abstract

Large context windows are not a complete solution. As context grows, attention dilutes and instructions drift. Recursive Language Models treat context as a dynamic environment, rebuilding focus each step instead of dragging everything forward.

How AI Learns Part 5: Context Engineering & Recursive Reasoning

Large context windows are not a complete solution.

As context grows:

Attention dilutes
Errors compound
Reasoning quality degrades

Resource	Link
Related	RLM \| ICL Revisited

The Context Problem

Transformers have finite attention. With limited attention heads and capacity, the model cannot attend equally to everything. As tokens accumulate:

Earlier instructions lose influence
Patterns average toward generic responses
Multi-step reasoning fails

This is context rot—not forgetting weights, but losing signal in noise.

In-Context Learning (ICL)

The model adapts temporarily via examples in the prompt.

Aspect	ICL
Updates weights?	No
Persists across sessions?	No
Speed	Instant
Mechanism	Activations, not gradients

ICL is powerful but ephemeral. It’s working memory, not learning.

Limitation: As context grows, ICL examples compete with other content for attention.

Recursive Language Models (RLM)

Circular flow diagram showing LLM connected to Tools, Memory, Context, and Evaluation in a recursive loop — Rebuild context each step instead of dragging it forward.

RLMs decompose reasoning into multiple passes. Instead of dragging entire context forward:

Query relevant memory
Retrieve what’s needed now
Execute tools
Evaluate results
Reconstruct focused context
Repeat

This treats context as a dynamic environment, not a static blob.

Why RLM Works

Traditional approach:

[System prompt + 50k tokens of history + query]

RLM approach:

[System prompt + retrieved relevant context + current query]

Each reasoning step starts fresh with focused attention.

Context Engineering Techniques

Technique	How It Helps
Summarization	Compress old context, preserve essentials
Chunking	Process in segments, aggregate results
Retrieval	Pull relevant content, not everything
Tool offloading	Store state externally, query on demand
Structured prompts	Clear sections, explicit priorities

Tool Use as Context Management

Tools aren’t just for actions—they’re for state management.

Instead of keeping everything in context:

Store in files, databases, or structured formats
Query when needed
Return focused results

This converts unbounded context into bounded queries.

The Agent Loop

Modern agents combine these ideas:

while not done:
    # 1. Assess current state
    relevant = retrieve_from_memory(query)

    # 2. Build focused context
    context = [system_prompt, relevant, current_task]

    # 3. Reason
    action = llm(context)

    # 4. Execute
    result = execute_tool(action)

    # 5. Update memory
    memory.store(result)

    # 6. Evaluate
    if goal_achieved(result):
        done = True

Each iteration rebuilds context. No rot accumulation.

Test-Time Adaptation

A related technique: temporarily update weights during inference.

Aspect	Test-Time Learning
Updates weights?	Yes, lightly (LoRA)
Persists?	No (rolled back)
Purpose	Adapt to input distribution

This sits between ICL (no updates) and fine-tuning (permanent updates).

Key Insight

Context is not a static buffer. It’s a dynamic workspace.

Systems that treat context as “append everything” will rot. Systems that actively manage context stay coherent.

References

Concept	Paper
RLM	Recursive Language Models (Zhou et al. 2024)
ICL	What Can Transformers Learn In-Context? (Garg et al. 2022)
Test-Time Training	TTT for Language Models (2024)
Chain-of-Thought	Chain-of-Thought Prompting (Wei et al. 2022)

Coming Next

In Part 6, we’ll connect all of this to continuous learning: replay methods, subspace regularization, adapter evolution, and consolidation loops.

Rebuild focus instead of dragging baggage.

February 28, 2026 • Software Wrighter

406 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Label Smoothing (softening targets to reduce overconfidence), Miscalibration (confidence not matching accuracy), Representation Learning (automatically learning useful features), Adversarial Examples (inputs crafted to cause errors), Double Descent (test error decreasing twice with model size).

Five ML Concepts - #25

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #25

References

Concept	Reference
Label Smoothing	Rethinking the Inception Architecture (Szegedy et al. 2015)
Miscalibration	On Calibration of Modern Neural Networks (Guo et al. 2017)
Representation Learning	Representation Learning: A Review (Bengio et al. 2013)
Adversarial Examples	Intriguing properties of neural networks (Szegedy et al. 2013)
Double Descent	Deep Double Descent (Nakkiran et al. 2019)

Today’s Five

1. Label Smoothing

Replacing hard one-hot labels with softened target distributions during training. Instead of 100% confidence in one class, distribute small probability to other classes.

Reduces overconfidence and can improve generalization.

Like allowing small uncertainty instead of absolute certainty.

2. Miscalibration

When predicted confidence does not match observed accuracy. A model that says “90% confident” should be right 90% of the time.

Modern neural networks tend to be overconfident. Temperature scaling can help.

Like a forecast that sounds certain but is often wrong.

3. Representation Learning

Learning useful internal features automatically from raw data. Instead of hand-crafting features, the model discovers what matters.

The foundation of deep learning’s success across domains.

Like detecting edges before recognizing full objects.

4. Adversarial Examples

Inputs modified to cause incorrect predictions. Small, often imperceptible changes can flip model outputs.

A security concern and a window into model vulnerabilities.

Like subtle changes that fool a system without obvious differences.

5. Double Descent

Test error that decreases, increases, then decreases again as model capacity grows. The classical bias-variance tradeoff captures only the first part.

Modern overparameterized models operate in the second descent regime.

Like getting worse before getting better—twice.

Quick Reference

Concept	One-liner
Label Smoothing	Softening targets to reduce overconfidence
Miscalibration	Confidence not matching accuracy
Representation Learning	Automatically learning useful features
Adversarial Examples	Inputs crafted to cause errors
Double Descent	Test error decreasing twice with model size

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 27, 2026 • Software Wrighter

627 words • 4 min read • Abstract

Modern AI systems increasingly rely on external memory. RAG, CAG, and Engram-style modules shift 'learning' away from weights. The brain stays stable. The notebook grows.

How AI Learns Part 4: Memory-Based Learning

Modern AI systems increasingly rely on external memory.

This shifts “learning” away from parameters.

Resource	Link
Related	Engram \| Engram Revisited \| Multi-hop RAG

The Memory Paradigm

Diagram showing brain (model) connected to notebook (memory) with RAG, CAG, and Engram types — Store facts outside the brain.

Why External Memory?

Most “learning new facts” should not modify weights.

Weights are for generalization. They encode reasoning patterns, language structure, and capability.

Memory is for storage. It holds specific facts, documents, and experiences.

If you store everything in weights:

You create interference
You risk forgetting
You must retrain

If you store facts in memory:

No forgetting
Fast updates
Survives model upgrades

Retrieval-Augmented Generation (RAG)

Documents are embedded into vectors. At query time:

Embed the query
Search the vector database
Retrieve relevant documents
Inject into prompt
Generate grounded response

The model does not need to remember facts internally. It retrieves them on demand.

RAG Benefits

Benefit	Description
No forgetting	External storage, not weights
Persistent	Survives restarts and model changes
Scalable	Add documents without retraining
Verifiable	Can cite sources

RAG Challenges

Retrieval precision (wrong docs = bad answers)
Latency (search takes time)
Index maintenance
Chunk boundaries

Cache-Augmented Generation (CAG)

Instead of retrieving from vector DB, cache previous context or KV states.

Use cases:

Repeated knowledge tasks
Multi-turn conversations
Pre-computed context windows

Benefits over RAG:

Often faster (no embedding + search)
More deterministic
Good for structured repeated workflows

Trade-offs:

Less flexible
Cache management complexity

Engram-Style Memory

Recent proposals (e.g., DeepSeek research) introduce conditional memory modules with direct indexing.

Instead of scanning long context or searching vectors:

Memory slots indexed directly
O(1) lookup instead of O(n) attention
Separates static knowledge from dynamic reasoning

The goal: Constant-time memory access that doesn’t scale with context length.

This changes the compute story:

Don’t waste attention on “known facts”
Reserve compute for reasoning
Avoid context rot

Model Editing

A related technique: surgically patch specific facts without full fine-tuning.

Example: The model says “The capital of Australia is Sydney.” You edit the specific association to “Canberra” without retraining.

Pros:

Targeted fixes
Fast

Cons:

Side effects possible
Consistency not guaranteed

The Key Distinction

Aspect	Weight Learning	Memory Learning
Location	Parameters	External storage
Persistence	Model lifetime	Storage lifetime
Forgetting risk	High	None
Update speed	Slow (training)	Fast (database)
Survives model change?	No	Yes

When to Use What

Situation	Approach
Need new reasoning capability	Weight-based (fine-tune)
Need to know new facts	Memory-based (RAG)
Need domain expertise	Weight-based (LoRA)
Need to cite sources	Memory-based (RAG)
Frequently changing data	Memory-based (RAG/CAG)

References

Concept	Paper
RAG	Retrieval-Augmented Generation (Lewis et al. 2020)
Engram	Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
REALM	REALM: Retrieval-Augmented Pre-Training (Guu et al. 2020)
Model Editing	Editing Factual Knowledge (De Cao et al. 2021)

Coming Next

In Part 5, we’ll examine context engineering and recursive reasoning: ICL, RLM, and techniques that prevent context rot during inference.

The brain stays stable. The notebook grows.

February 27, 2026 • Software Wrighter

426 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Warmup (gradually increasing learning rate at start), Data Leakage (training on unavailable deployment info), Mode Collapse (limited generative output variety), Blue/Green Deployment (switching between parallel production environments), Reward Hacking (exploiting reward function flaws).

Five ML Concepts - #24

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #24

References

Concept	Reference
Warmup	Accurate, Large Minibatch SGD (Goyal et al. 2017)
Data Leakage	Leakage in Data Mining (Kaufman et al. 2012)
Mode Collapse	Generative Adversarial Nets (Goodfellow et al. 2014)
Blue/Green Deployment	MLOps best practice (no canonical paper)
Reward Hacking	Concrete Problems in AI Safety (Amodei et al. 2016)

Today’s Five

1. Warmup

Gradually increasing the learning rate at the start of training as part of a learning rate schedule. This helps stabilize early training when gradients can be noisy.

Warmup is especially important for large batch training.

Like stretching before a sprint instead of starting at full speed.

2. Data Leakage

When information unavailable at deployment accidentally influences model training. This creates artificially high validation scores that don’t reflect real-world performance.

Common sources include future data, preprocessing on full dataset, or duplicate samples.

Like memorizing test answers instead of learning the material.

3. Mode Collapse

When a generative model produces limited output diversity. The generator learns to produce only a few outputs that fool the discriminator.

A major challenge in GAN training that various architectures attempt to address.

Like a musician who only plays one song no matter the request.

4. Blue/Green Deployment

Maintaining two production environments and switching traffic between them. One serves live traffic while the other is updated and tested.

Enables instant rollback if problems occur.

Like having a backup stage ready so the show never stops.

5. Reward Hacking

When agents exploit reward functions in unintended ways. The agent optimizes the reward signal rather than the intended objective.

A key challenge in reinforcement learning and AI alignment.

Like gaming the grading rubric instead of learning the material.

Quick Reference

Concept	One-liner
Warmup	Gradually increasing learning rate at start
Data Leakage	Training on unavailable deployment info
Mode Collapse	Limited generative output variety
Blue/Green Deployment	Switching between parallel environments
Reward Hacking	Exploiting reward function flaws

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 26, 2026 • Software Wrighter

1231 words • 7 min read • Abstract

A browser-based IBM 1130 system emulator with authentic console panel indicator lights, keypunch, printer, and assembly game. Experience the full 1965 minicomputer ecosystem through interactive simulations. Work in progress.

TBT (5/?): IBM 1130 System Emulator - Experience 1960s Computing

The IBM 1130, introduced in 1965, was a 16-bit minicomputer that brought computing to universities and small businesses. This browser-based system emulator recreates the complete experience: console panel with authentic indicator lights, keypunch, printer, and assembly programming.

Status: Work in progress. Core features functional, enhancements planned.

Resource	Link
Live Demo	IBM 1130 System Emulator
Source	GitHub
Video	IBM 1130 System Emulator
IBM Docs	Functional Characteristics (GA26-5881)
More Docs	Bitsavers IBM 1130 Collection

The System

This isn’t just an assembly emulator—it’s a full system visualization:

Component	What It Does
Console Panel	Authentic indicator lights, toggle switches, speed control
Assembler Game	Write and execute IBM 1130 code with real-time visualization
Keypunch	IBM 029 text cards and 1442 object deck visualization
Printer	IBM 1131 console printer with greenbar paper

Console Panel

The console panel recreates the physical operator interface with all indicator light groups documented in IBM’s Functional Characteristics manual.

Register Display (6 rows × 16 positions)

Row	Register	Bits Shown	Purpose
1	IAR	15	Instruction Address Register (program counter)
2	SAR	15	Storage Address Register (memory access)
3	SBR	16	Storage Buffer Register (data word)
4	AFR	16	Arithmetic Factor Register (operand)
5	ACC	16	Accumulator (main arithmetic register)
6	EXT	16	Extension (double-precision, multiply/divide)

Right-Side Indicators

Beyond the register displays, the console shows:

Operation Register (5 bits) - Binary op-code of current instruction
Format/Tag Indicators - Long instruction format, index register selection
Cycle Control (T0-T7) - Internal timing pulses for debugging
Status Lights - Wait, Run, Fetch, Execute, Indirect Address

Control Panel Lights

Light	Purpose
DISK UNLOCK	Safe to swap 2315 disk cartridge
FILE READY	Disk drive up to speed
FORMS CHECK	Printer out of paper
RUN	CPU executing instructions
PARITY	Memory parity error
FREEZE	Fatal hardware error

Operator Controls

16-bit toggle switches for manual data entry
7-position speed knob - Single Step, SMC, INT RUN, RUN, SI, DISP, LOAD
Lamp test to verify all indicators function
Emergency stop button

Assembler Game

Learn the IBM 1130 instruction set interactively:

Complete instruction set - LD, STO, LDX, STX, A, S, AND, OR, SLA, SRA, BSC, BSI, WAIT
Memory-mapped index registers - XR1-3 at addresses 1, 2, 3 (historically accurate)
Step-by-step execution with change highlighting
Interactive examples covering arithmetic, indexing, shifts
Progressive challenges with validation

Keypunch

The keypunch simulation supports two card types:

IBM 029 Text Cards

Hollerith encoding - Standard character-to-punch mapping
Visual card display - Watch holes appear as you type
Multi-card decks - Manage multiple cards

IBM 1130 Object Deck (1442 Output)

Binary card visualization - Machine code punch patterns
Object deck format - Matches authentic assembler output
No character printing - Pure binary data cards

The IBM 029 Keypunch produced human-readable text cards. For binary object decks (compiled programs), the IBM 1442 Card Read-Punch would create cards with arbitrary punch patterns that don’t map to characters.

Printer

The IBM 1131 Console Printer simulation:

Greenbar paper rendering - Authentic line printer output
Typewriter-style characters - Period-appropriate appearance
Console output - System messages and program output

Technology

Component	Choice
Language	Rust
Target	WebAssembly
UI Framework	Yew
Build Tool	Trunk
Hosting	GitHub Pages

Planned Enhancements

This is a work in progress. Planned features include:

Additional challenges (10 total)
Code save/load functionality
URL sharing of programs
Breakpoints and memory watches
Keyboard shortcuts
Full 1442 Card Read-Punch integration

IBM Documentation References

Document	Description
GA26-5881	Functional Characteristics - Console panel details
GA26-5717	Operating Procedures - Operator instructions
GA26-5914	Physical Planning - System dimensions
Bitsavers Collection	Complete IBM 1130 documentation archive

Project Goals

This is an early proof-of-concept for trying out components that could be extended to produce a more realistic system of devices that could actually run programs. The modular architecture allows each peripheral (console, keypunch, printer) to be developed and refined independently.

A key goal is educational challenges that teach assembly language step by step. The assembler game provides progressive exercises that build understanding from basic load/store operations through arithmetic, indexing, and control flow.

Historical Significance

The IBM 1130 was the first computer for many programmers in the late 1960s and 1970s. Its clean architecture and accessible price point (~$32,000) made it ideal for education.

A Transitional Technology

The IBM 1130 arrived after mechanical calculators and vacuum tube computers, but before dense integrated circuits and microprocessors. This was a unique moment in computing history when machines were complex enough to be powerful, yet simple enough to be fully understood by one person.

The system shipped with complete schematics and diagnostic listings. A field engineer could use an oscilloscope to probe the pins on every transistor. The “integrated circuit” of the era was a small can with a 4×4 pin grid containing just two transistors, mounted on a pluggable card connected via a wire-wrapped backplane. When something failed, you could see it, touch it, and replace it.

Non-Volatile Core Memory

One remarkable feature: magnetic core memory was non-volatile. You could stop the system, power down overnight, come back in the morning, power up, and start your program exactly where it left off—without reloading from cards, tape, or disk.

Each bit was stored as the magnetic polarity of a tiny ferrite ring. No electricity required to maintain state. This made the 1130 remarkably resilient and practical for environments where power wasn’t guaranteed.

Notable fact: The Forth programming language was developed on the IBM 1130 by Charles Moore in the late 1960s.

Personal Experience

In the late 1970s, I worked as an IBM Customer Engineer maintaining a large number of IBM 1130 and 1800 systems used primarily by IBM manufacturing facilities in Kingston, Poughkeepsie, and East Fishkill, New York.

Field service on these machines was hands-on in ways that seem almost unimaginable today. I would often hand-assemble code on paper, converting mnemonics to binary, then enter machine code via the console toggle switches to create a small program. That program’s job? To punch another program onto a card.

I could then insert that punched card into a diagnostic deck to loop on an error condition while I used an oscilloscope and logic schematics to diagnose a failing circuit card. The blinking lights weren’t decoration—they were essential debugging tools that showed exactly what the CPU was doing at each moment.

This emulator recreates that experience: the same indicator lights, the same toggle switches, the same intimate connection between human and machine that made these systems so memorable to work with.

Experience 1960s computing in your browser. Work in progress.

Watch the Video

Unmute to hear narration.

February 26, 2026 • Software Wrighter

649 words • 4 min read • Abstract

Weight-based learning modifies the neural network itself. Pretraining, fine-tuning, LoRA, alignment methods, distillation---each changes the brain permanently. Slow to change, but forms the stable core.

How AI Learns Part 3: Weight-Based Learning

Weight-based learning modifies the neural network itself.

It is slow. It is powerful. It is dangerous.

Resource	Link
Related	Sleepy Coder: When Fine-Tuning Fails \| 5MLC #3: LoRA

The Weight-Based Methods

Diagram showing LoRA adapters, distillation flow, and alignment pipeline — Weight-based learning modifies the brain itself.

Pretraining

This creates the base model.

It encodes language structure, reasoning patterns, and general world knowledge. The process:

Trains on terabytes of text
Uses self-supervised learning (predict next token)
Runs for weeks or months
Costs millions of dollars

This learning is rarely repeated for cost reasons. The result is a foundation that everything else builds upon.

Fine-Tuning

Fine-tuning adapts models for specific tasks.

Standard Fine-Tuning

Adjust some or all weights using task-specific data.

Pros:

Can significantly change behavior
Works with small datasets

Cons:

Risk of catastrophic forgetting
Expensive if you modify all weights
Hard to undo

Supervised Fine-Tuning (SFT)

Train on instruction → response pairs.

This teaches the model to:

Follow directions
Produce helpful outputs
Maintain conversation structure

Risk: Can reduce other capabilities if data is narrow.

Preference Optimization

Instead of “correct answers,” train from comparisons: preferred vs rejected responses.

Method	Description
Reinforcement Learning from Human Feedback (RLHF)	Reward model + reinforcement learning
Direct Preference Optimization (DPO)	Simpler alternative to RLHF
RLAIF	AI-generated preferences

Pros: Strong style/safety/helpfulness steering

Cons: Can drift (“over-align”), may conflict with domain competence

Parameter-Efficient Fine-Tuning (PEFT)

Instead of changing all weights, inject small trainable modules.

LoRA (Low-Rank Adaptation)

Insert small low-rank matrices into transformer layers. Only train these matrices.

Benefits:

Faster training: Fewer parameters to update
Modular: Can swap adapters
Version control: Different adapters for different tasks
Lower forgetting risk: Base weights frozen

Other PEFT Methods

Prompt tuning: Learn soft prompts
Prefix tuning: Prepend learned vectors
Adapters: Small bottleneck layers
IA³: Learned vectors that scale activations

Shared LoRA Subspaces

Multiple tasks share adapter subspaces to reduce interference.

Recent work (ELLA, Share) maintains evolving shared low-rank subspaces that:

Reduce interference between tasks
Enable continual learning
Keep memory constant

Distillation

Train a smaller model using a larger model as teacher.

Aspect	Teacher	Student
Size	Large	Small
Cost	High inference	Low inference
Knowledge	Full	Compressed

Distillation benefits:

Speeds up inference
Often improves consistency
Can reduce hallucination
Makes deployment cheaper

This is not runtime learning—it’s offline structural learning.

The Alignment Pipeline

Modern models typically go through:

Pretraining → General competence
SFT → Follow instructions
RLHF/DPO → Align with preferences
Safety fine-tuning → Reduce harmful outputs

Each step modifies weights. Each step risks forgetting previous capabilities.

Key Insight

Fine-tuning changes the brain. RAG changes the notes on the desk.

Weight-based learning is the core capability layer. It’s slow to change, expensive to update, and risky to modify—but it forms the stable foundation that everything else builds upon.

References

Concept	Paper
LoRA	LoRA: Low-Rank Adaptation (Hu et al. 2021)
RLHF	Training LMs with Human Feedback (Ouyang et al. 2022)
DPO	Direct Preference Optimization (Rafailov et al. 2023)
Distillation	Distilling Knowledge in Neural Networks (Hinton et al. 2015)
Adapters	Parameter-Efficient Transfer Learning (Houlsby et al. 2019)

Coming Next

In Part 4, we’ll explore memory-based learning: RAG, CAG, Engram, and other techniques that learn without touching weights.

Change the brain carefully.

February 26, 2026 • Software Wrighter

440 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Emergent Behavior (capabilities appearing at scale), Tool Use (AI calling external tools), Loss Surface Sharpness (flatter minima generalize better), Learning Rate Schedules (adjusting learning rate during training), Canary Deployment (gradually rolling out new models safely).

Five ML Concepts - #23

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #23

References

Concept	Reference
Emergent Behavior	Emergent Abilities of Large Language Models (Wei et al. 2022)
Tool Use	Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al. 2023)
Loss Surface Sharpness	On Large-Batch Training for Deep Learning (Keskar et al. 2016)
Learning Rate Schedules	SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov & Hutter 2016)
Canary Deployment	MLOps best practice (no canonical paper)

Today’s Five

1. Emergent Behavior

Some capabilities appear only when models reach sufficient scale. These behaviors were not directly programmed but arise from learned representations.

Emergence is a key phenomenon in large language models.

Like a child learning words and then suddenly understanding full sentences.

2. Tool Use

Modern AI systems can generate structured commands to call external tools. These include search engines, calculators, or code interpreters.

This extends model capabilities beyond internal knowledge.

Like asking a librarian to look something up instead of guessing.

3. Loss Surface Sharpness

Sharp minima are sensitive to small weight changes. Flatter minima tend to be more robust and often generalize better.

Training methods that find flatter regions can improve test performance.

Like standing on a plateau instead of balancing on a narrow peak.

4. Learning Rate Schedules

Instead of keeping the learning rate constant, training often starts high and gradually reduces it. Schedules like step decay or cosine annealing improve convergence.

Warm restarts can help escape local minima.

Like running fast at first, then slowing down to finish precisely.

5. Canary Deployment

A new model version is rolled out to a small percentage of users first. If problems appear, rollout stops before affecting everyone.

Essential MLOps practice for safe production updates.

Like tasting food before serving it to all your guests.

Quick Reference

Concept	One-liner
Emergent Behavior	Capabilities appearing at sufficient scale
Tool Use	AI calling external tools
Loss Surface Sharpness	Flatter minima generalize better
Learning Rate Schedules	Adjusting learning rate during training
Canary Deployment	Gradually rolling out new models safely

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 25, 2026 • Software Wrighter

641 words • 4 min read • Abstract

Two fundamentally different failure modes plague AI systems. Catastrophic forgetting destroys old knowledge when learning new skills. Context rot loses early instructions in long conversations. Different problems, different solutions.

How AI Learns Part 2: Catastrophic Forgetting vs Context Rot

There are two fundamentally different failure modes in modern AI systems.

They are often confused. They should not be.

Resource	Link
Related	Sleepy Coder: Routing Prevents Forgetting \| RLM

The Two Failures

Split diagram showing catastrophic forgetting (weight interference) vs context rot (attention dilution) — Two different failure modes require two different solutions.

Catastrophic Forgetting (Weight-Space Failure)

When you fine-tune a model on new tasks, performance on older tasks may degrade.

This happens because gradient descent updates overlap in parameter space. The model does not “know” which weights correspond to which task. It optimizes globally.

Example: Fine-tune a model on medical text. Its ability to write code degrades. The new learning overwrote old capabilities.

Why It Happens

Neural networks store knowledge distributed across many weights. When you update those weights for Task D, you modify the same parameters that encoded Task A. The old knowledge gets overwritten.

This is the stability vs plasticity tradeoff:

Plasticity: Learn new things quickly
Stability: Retain old things reliably

You cannot maximize both simultaneously.

Solutions

Method	How It Helps
Replay	Train on old + new data
Subspace regularization	Constrain weight updates to avoid interference
Shared Low-Rank Adaptation (LoRA) spaces	Modular updates that don’t overwrite base weights
Freezing base weights	Keep foundation stable, train adapters only

Context Rot (Inference-Time Failure)

Context rot is not weight damage.

It happens when:

Prompts grow too large
Earlier instructions get diluted
Attention spreads thin
The model begins averaging patterns instead of reasoning

Example: A 50,000 token conversation. The original system prompt is still there, but the model stops following it. Earlier context gets “forgotten” even though it’s technically present.

Why It Happens

Transformer attention is finite. With limited attention heads and capacity, the model cannot attend equally to everything. As context grows, earlier tokens receive less attention weight.

This creates:

Instruction drift: Original instructions lose influence
Pattern averaging: The model reverts to generic responses
Lost coherence: Multi-step reasoning fails

Solutions

Method	How It Helps
Retrieval-based context	Pull relevant passages, not everything
Recursive Language Models (RLM)	Rebuild context each step
Summarization	Compress old context
Memory indexing	Constant-time lookup instead of linear attention
Structured tool calls	Offload state to external systems

The Critical Distinction

Aspect	Catastrophic Forgetting	Context Rot
Where	Weights	Prompt window
When	During training	During inference
Persists?	Permanently	Session only
Analogy	Brain damage	Working memory overload

Why This Matters

If you confuse these failure modes, you apply the wrong fix.

Forgetting problem? Don’t add more context. Fix your training.
Context rot problem? Don’t retrain. Fix your context management.

Many “AI agents that forget” discussions conflate both. Modern systems need solutions for both simultaneously.

References

Concept	Paper
Catastrophic Forgetting	Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017)
Continual Learning Survey	A Comprehensive Survey of Continual Learning (Wang et al. 2023)
ELLA	ELLA: Subspace Learning for Lifelong Machine Learning (2024)
Share	Share: Shared LoRA Subspaces for Continual Learning (2025)
RLM	Recursive Language Models (Zhou et al. 2024)

Coming Next

In Part 3, we’ll examine weight-based learning in detail: pretraining, fine-tuning, LoRA, alignment methods, and distillation.

Different failures need different fixes.

February 25, 2026 • Software Wrighter

472 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: RSFT (rejection sampling fine-tuning with filtered outputs), Model Steerability (adjusting behavior at inference time), LSTM (long short-term memory for sequences), Why More Data Beats Better Models (data scale trumps architecture tweaks), System Reliability vs Model Quality (balancing accuracy with uptime).

Five ML Concepts - #22

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #22

References

Concept	Reference
RSFT	Scaling Relationship on Learning Mathematical Reasoning (Yuan et al. 2023)
Model Steerability	Controllable Generation from Pre-trained Language Models (Zhang et al. 2023)
LSTM	Long Short-Term Memory (Hochreiter & Schmidhuber 1997)
More Data Beats Better Models	The Unreasonable Effectiveness of Data (Halevy et al. 2009)
System Reliability vs Quality	MLOps best practice (no canonical paper)

Today’s Five

1. RSFT (Rejection Sampling Fine-Tuning)

A method where many model outputs are generated, weaker ones are filtered out, and the best samples are used for further fine-tuning. It improves output quality without full reinforcement learning.

The model learns from its own best attempts.

Like practicing many attempts and studying only your best ones.

2. Model Steerability

The ability to adjust a model’s behavior through prompts, parameters, or control mechanisms. This allows flexible behavior without retraining.

Steerable models can adapt to different tasks or styles at inference time.

Like steering a car instead of letting it move in a fixed direction.

3. LSTM (Long Short-Term Memory)

A recurrent neural network architecture with gates that regulate memory flow. It was designed to mitigate vanishing gradient problems in sequence modeling.

LSTMs decide what to remember and what to forget at each time step.

Like a notebook where you choose what to keep and what to forget.

4. Why More Data Beats Better Models

In many cases, adding high-quality data improves performance more than small architecture improvements. Data scale often matters as much as model design.

This is sometimes called “the unreasonable effectiveness of data.”

Like practicing with many real conversations instead of perfecting one grammar rule.

5. System Reliability vs Model Quality

A slightly less accurate model that runs reliably can outperform a fragile but slightly better one. Engineers balance uptime, latency, and stability against pure accuracy.

Production systems need both correctness and dependability.

Like choosing a reliable car over a faster one that breaks down often.

Quick Reference

Concept	One-liner
RSFT	Fine-tuning on filtered best outputs
Model Steerability	Adjusting behavior at inference time
LSTM	Gated memory for sequence modeling
More Data Beats Better Models	Data scale trumps architecture tweaks
System Reliability vs Quality	Balancing accuracy with uptime

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 25, 2026 • Software Wrighter

1393 words • 7 min read • Abstract

Expanding many-eyes learning with intrinsic rewards and a new web visualization. CuriousScout uses count-based novelty, OptimisticScout uses optimistic initialization. The key trade-off: diversity helps during exploration, but once Q-values converge, all scouts should follow the same optimal policy. Strategy quality matters more than diversity in simple environments.

Many-Eyes Learning: Intrinsic Rewards and Diversity

In Part 1, we demonstrated that multiple scouts dramatically improve learning in sparse-reward environments. Five scouts achieved 60% success where a single scout achieved 0%.

This post explores how scouts explore: intrinsic rewards that drive novelty-seeking behavior, and what happens when you mix different exploration strategies.

Resource	Link
Code	many-eyes-learning
Part 1	Solving Sparse Rewards with Many Eyes
Video	Many-Eyes Learning: Watch AI Scouts Explore

Recap: The Many-Eyes Architecture

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Scout 1   │  │   Scout 2   │  │   Scout N   │
│ (strategy A)│  │ (strategy B)│  │ (strategy N)│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       v                v                v
┌─────────────────────────────────────────────────┐
│              Experience Buffer                   │
└─────────────────────────────────────────────────┘
                       │
                       v
┌─────────────────────────────────────────────────┐
│               Shared Learner                     │
└─────────────────────────────────────────────────┘

Scouts are information gatherers, not independent learners. They explore with different strategies, pool their discoveries, and a shared learner benefits from the combined experience.

New Scout Strategies

CuriousScout: Count-Based Novelty

IRPO formalizes intrinsic rewards as the mechanism that drives scout exploration. CuriousScout implements count-based curiosity:

class CuriousScout(Scout):
    def __init__(self, bonus_scale: float = 1.0):
        self.state_counts = defaultdict(int)
        self.bonus_scale = bonus_scale

    def intrinsic_reward(self, state):
        count = self.state_counts[state]
        return self.bonus_scale / sqrt(count + 1)

How it works:

Track how many times each state has been visited
Reward = bonus_scale / √(count + 1)
Novel states get high rewards; familiar states get diminishing returns

The intuition: A curious scout is drawn to unexplored territory. The first visit to a state is exciting (reward = 1.0). The fourth visit is mundane (reward = 0.5). This creates natural pressure to explore widely.

OptimisticScout: Optimism Under Uncertainty

A different philosophy: assume unknown states are valuable until proven otherwise.

class OptimisticScout(Scout):
    def __init__(self, optimism: float = 10.0):
        self.optimism = optimism

    def initial_q_value(self):
        return self.optimism  # Instead of 0

How it works:

Initialize all Q-values to a high value (e.g., 10.0)
The agent is “optimistic” about unvisited state-action pairs
As it explores and receives actual rewards, Q-values decay toward reality

The intuition: If you’ve never tried something, assume it might be great. This drives exploration without explicit novelty bonuses.

Strategy Comparison

Strategy	Mechanism	Best For
Random	Uniform random actions	Baseline, maximum coverage
Epsilon-Greedy	Random with probability ε, greedy otherwise	Balancing exploit/explore
CuriousScout	Novelty bonus for unvisited states	Systematic coverage
OptimisticScout	High initial Q-values	Early exploration pressure

The Diversity Experiment

Does mixing strategies help, or is it enough to have multiple scouts with the same good strategy?

Setup

7x7 sparse grid, 100 training episodes
All configurations use exactly 5 scouts (fair comparison)
5 random seeds for statistical significance

Configurations

Homogeneous Random: 5 identical random scouts
Homogeneous Epsilon: 5 identical epsilon-greedy scouts (ε=0.2)
Diverse Mix: Random + 2 epsilon-greedy (ε=0.1, 0.3) + CuriousScout + OptimisticScout

Results

Configuration	Success Rate
Random baseline	7%
Homogeneous random	20%
Homogeneous epsilon	40%
Diverse mix	40%

Analysis

Finding: Strategy quality matters more than diversity in simple environments.

Epsilon-greedy (homogeneous or mixed) outperforms pure random
Diverse mix performs the same as homogeneous epsilon-greedy
Having 5 good scouts beats having 5 diverse but weaker scouts

Why doesn’t diversity help here?

In a simple 7x7 grid, the exploration problem is primarily about coverage, not strategy complementarity. Five epsilon-greedy scouts with different random seeds already explore different regions due to stochastic action selection.

Diversity likely provides more benefit in:

Complex environments with multiple local optima
Tasks requiring different behavioral modes
Environments with deceptive reward structures

Web Visualization

The web visualization demonstrates Many-Eyes Learning with real-time parallel scout movement. (The upcoming video walks through this demo—the post focuses on the underlying mechanism.)

Many-Eyes Web Visualization

How It Works

The web version uses Q-learning with a shared Q-table (simpler than DQN for clarity). All scouts contribute to the same Q-table—the core “many eyes” concept: more explorers = faster Q-value convergence.

Scout	Role	Epsilon	Behavior
Random	Baseline	1.0 (constant)	Always random, never follows policy
Scouts 1-N	Learning Agents	0.5-0.8 → 0.01	Epsilon-greedy with decay

Exploration Modes

The UI provides a dropdown to select different exploration strategies:

Mode	Heatmap Diversity	Learning Performance
Shared Policy	Low (identical paths)	Best (lowest avg steps)
Diverse Paths	High (distinct paths)	Worse (biases override optimal)
High Exploration	High	Worst (never fully exploits)
Boltzmann	Moderate	Moderate

The Diversity vs Performance Trade-off

There’s a fundamental trade-off between visual diversity and learning performance:

Shared Policy wins on performance: The “many eyes” benefit comes from diverse exploration during learning (finding the goal faster). But once Q-values converge, all scouts should follow the same optimal policy.
Diverse Paths sacrifices performance for visuals: Scout-specific directional biases (Scout 1 prefers right, Scout 2 prefers down) create visually interesting heatmaps but suboptimal behavior.
High Exploration never converges: Fixed 50% random actions means scouts never fully exploit the learned policy.

Key insight: For best learning, use Shared Policy. Use other modes to visualize how different exploration strategies affect the learning process, but expect higher average steps.

Learning Phases

Phase	Episodes	Avg Steps	Behavior
Random	1-5	~70	All scouts exploring randomly
Early Learning	5-15	40-60	Policy starts forming
Convergence	15-30	15-25	Clear optimal path emerges
Stable	30+	12-18	Near-optimal with random scout noise

Why “Average Steps to Goal”?

Success rate is coarse-grained—with 5 scouts, only 6 values are possible (0%, 20%, 40%, 60%, 80%, 100%). After ~10 episodes, all scouts typically reach the goal. Average steps shows continued policy refinement, dropping from ~70 (random) to ~8 (optimal).

Running the Visualization

./scripts/serve.sh   # Open http://localhost:3200

Yew/WASM frontend with FastAPI backend
Speed control from 1x to 100x
Replay mode to step through recorded training

What’s Next

Potential future directions:

Direction	Why It Matters
Larger environments	Test scaling to 15x15, 25x25 grids
Scout communication	Real-time sharing vs passive pooling
Adaptive intrinsic rewards	Learn the reward function (closer to full IRPO)
Multi-goal environments	Multiple sparse rewards to discover

Key Takeaways

Intrinsic rewards drive exploration. CuriousScout and OptimisticScout implement different philosophies: novelty bonuses vs optimistic initialization.
Strategy quality > diversity in simple environments. Five good scouts beat five diverse but weaker scouts.
Diversity during learning, convergence after. The “many eyes” benefit comes from diverse exploration during learning. Once Q-values converge, all scouts should follow the same optimal policy.
Shared Q-table enables collective learning. All scouts contribute to one Q-table—more explorers means faster convergence.
Visual diversity costs performance. Modes like “Diverse Paths” create interesting heatmaps but suboptimal behavior. Use “Shared Policy” for best learning results.

References

Concept	Paper
IRPO	Intrinsic Reward Policy Optimization (Cho & Tran 2026)
Reagent	Reasoning Reward Models for Agents (Fan et al. 2026)
ICM	Curiosity-driven Exploration (Pathak et al. 2017)

Diverse exploration, convergent execution. Many eyes see more, but the best path is the one they all agree on.

Watch the Video

Unmute to hear narration.

February 24, 2026 • Software Wrighter

592 words • 3 min read • Abstract

When people say 'AI learned something,' they usually mean one of four very different things. Understanding these time scales---from milliseconds to years---is essential for building AI systems that improve safely over time.

How AI Learns Part 1: The Many Meanings of Learning

When people say, “AI learned something,” they usually mean one of at least four very different things.

Large Language Models (LLMs) do not learn in one single way. They learn at different time scales, in different locations, and with very different consequences. To understand modern AI systems—especially agents—we need to separate these layers.

Resource	Link
Related	ICL Revisited \| RLM \| Engram

Four Time Scales of Learning

1. Pretraining (Years)

This is the foundation.

The model trains on massive datasets using gradient descent. The result is a set of weights—billions of parameters—encoding statistical structure of language and knowledge.

This learning:

Is slow and expensive
Persists across restarts
Cannot easily be reversed
Is vulnerable to interference if modified later

Think of this as long-term biological memory.

2. Fine-Tuning (Days to Weeks)

Fine-tuning modifies the weights further, but with narrower data.

This includes:

Instruction tuning (following directions)
Alignment methods (Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO))
Domain adaptation
Parameter-efficient methods like Low-Rank Adaptation (LoRA)

This is still weight-based learning.

It persists across restarts. It risks catastrophic forgetting. It modifies the brain itself.

3. Memory-Based Learning (Seconds to Minutes)

This is where many modern systems shift.

Instead of changing weights, they store information externally:

RAG (Retrieval-Augmented Generation)
CAG (Cache-Augmented Generation)
Vector databases
Engram-style memory modules

The model retrieves relevant memory per query.

The brain stays stable. The notebook grows.

This learning:

Persists across restarts
Survives model upgrades
Does not cause forgetting
Is fast

4. In-Context Learning (Milliseconds)

This is temporary reasoning scaffolding.

Information exists only in the prompt window.

It:

Does not update weights
Does not persist across sessions
Is powerful but fragile
Suffers from context rot

This is working memory.

Why This Matters

Most discussions collapse all of this into “the model learned.”

But:

Updating weights risks forgetting
Updating memory does not
Updating prompts does not persist
Updating adapters can be modular and reversible

Continuous learning systems must coordinate all four.

Persistence Comparison

Mechanism	Persists Across Chat?	Persists Across Restart?	Persists Across Model Change?
Pretraining	Yes	Yes	No
Fine-tune	Yes	Yes	No
LoRA	Yes	Yes	Usually
Distillation	Yes	Yes	No
ICL	No	No	No
RAG	Yes	Yes	Yes
Engram	Yes	Yes	Yes
CAG	Yes	Yes	Yes

That last column is subtle but powerful for agents.

References

Concept	Paper
LoRA	LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
RAG	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
ICL	What Can Transformers Learn In-Context? (Garg et al. 2022)
Engram	Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
DPO	Direct Preference Optimization (Rafailov et al. 2023)

Coming Next

In Part 2, we’ll examine the two fundamental failure modes that arise from confusing these layers: catastrophic forgetting and context rot.

Learning happens in layers of permanence.

February 24, 2026 • Software Wrighter

1173 words • 6 min read • Abstract

Personal Software continues. music-pipe-rs takes the Unix philosophy to MIDI composition---small tools connected by pipes. Start with a seed, generate motifs, transform, visualize, convert to MIDI. Deterministic output from a single seed at the pipeline head.

music-pipe-rs: Unix Pipelines for MIDI Composition

After building midi-cli-rs for quick mood-based generation, I wanted something more surgical. What if music generation worked like Unix commands—small tools connected by pipes?

Resource	Link
Code	music-pipe-rs
Related	midi-cli-rs
Next	Web Demo and Multi-Instrument

The Unix Philosophy for Music

Most generative music tools are monolithic. You get one application with a closed workflow. If you want to inspect intermediate results, you can’t. If you want to swap one transformation for another, you rebuild everything.

Unix solved this decades ago: small tools that do one thing well, connected by pipes. Each tool reads from stdin, writes to stdout. You can inspect any point in the pipeline with head, filter with grep, transform with jq.

music-pipe-rs applies this philosophy to MIDI composition.

A Pipeline in Action

seed 12345 | motif --notes 16 --bpm 120 | humanize | to-midi --out melody.mid

Four stages:

seed establishes the random seed for the entire pipeline
motif generates a melodic pattern (using the pipeline seed)
humanize adds timing and velocity variation (using the same seed)
to-midi converts the event stream to a standard .mid file

The output plays in any DAW.

Seed-First Architecture

The seed stage goes at the head of the pipeline:

# Explicit seed for reproducibility
seed 12345 | motif --notes 16 | humanize | to-midi --out melody.mid

# Auto-generated seed (printed to stderr)
seed | motif --notes 16 | humanize | to-midi --out melody.mid
# stderr: seed: 1708732845

All downstream stages read the seed from the event stream. No --seed arguments scattered across the pipeline. One seed, set once, used everywhere.

This means:

Same seed = identical output across all random stages
Different seed = different composition with same structure
Reproducibility is trivial: just save the seed number

JSONL: The Intermediate Format

Between stages, events flow as JSONL (JSON Lines). Each line is a complete event:

{"type":"Seed","seed":12345}
{"type":"NoteOn","t":0,"ch":0,"key":60,"vel":80}
{"type":"NoteOff","t":480,"ch":0,"key":60}

This format is human-readable and tool-friendly:

# See the first 10 events
seed 42 | motif --notes 8 | head -10

# Count how many NoteOn events
seed 42 | motif --notes 16 | grep NoteOn | wc -l

# Pretty-print with jq
seed 42 | motif --notes 4 | jq .

No binary formats to decode. No proprietary protocols. Just text.

Visualization with viz

The viz stage prints a sparkline to stderr while passing events through:

seed 12345 | motif --notes 16 | viz | humanize | to-midi --out melody.mid

Output on stderr:

▃▅▇▅▃▁▂▄▆▇▆▄▂▁▃▅

For more detail, use piano roll mode:

seed 12345 | motif --notes 16 | viz --roll

 G6 │···█············│
F#6 │·····█··········│
 F6 │····█···········│
 G5 │·██·········█···│
 F5 │···········█····│
 E5 │·········██···█·│
 C5 │█·····███····█·█│

The visualization goes to stderr; the JSONL events continue to stdout. You can inspect the music without breaking the pipeline.

Available Stages

Stage	Type	Description
`seed`	Start	Establish random seed for pipeline
`motif`	Generate	Create melodic patterns
`euclid`	Generate	Euclidean rhythm generation
`transpose`	Transform	Shift notes by semitones
`scale`	Transform	Constrain notes to a scale
`humanize`	Transform	Add timing/velocity variation
`viz`	Inspect	Print sparkline visualization
`to-midi`	Output	Convert to .mid file

Each stage is a separate binary. Mix and match as needed.

Euclidean Rhythms

The euclid stage generates Euclidean rhythms—mathematically optimal distributions of hits across steps:

# 3 hits distributed across 8 steps (Cuban tresillo)
seed | euclid --pulses 3 --steps 8 --note 36 | to-midi --out kick.mid

# 4-on-the-floor kick pattern
seed | euclid --pulses 4 --steps 16 --note 36 | to-midi --out four-floor.mid

These patterns appear in music worldwide because they “feel right”—the spacing is as even as possible.

Scale Locking

The scale stage constrains notes to a musical scale:

seed 42 | motif --notes 16 | scale --root C --mode minor | to-midi --out c-minor.mid

No wrong notes. Every pitch fits the harmonic context.

Layering Streams

Generate drum and melody separately, then combine:

{
    seed 100 | euclid --pulses 4 --steps 16 --note 36 --ch 9;
    seed 100 | motif --notes 16 | scale --root C --mode pentatonic;
} | to-midi --out layered.mid

Channel 9 is General MIDI drums. Same seed ensures coherence between parts. Both streams merge into a single MIDI file.

Why Not Just Use midi-cli-rs?

Different tools for different needs:

Tool	Strength	Use Case
midi-cli-rs	Quick mood presets	“Give me 5 seconds of jazz”
music-pipe-rs	Compositional control	“Generate a motif, constrain to scale, add swing”

midi-cli-rs is high-level: pick a mood, get music. music-pipe-rs is low-level: build compositions from primitive operations.

Both are useful. Both work with AI coding agents.

The Personal Software Pattern

This continues the theme: build small tools that compose well. Don’t try to solve everything in one application. Let Unix handle orchestration.

The best part? Standard tools still work. head, grep, jq, wc—all participate in the pipeline. No special music knowledge required to inspect the data.

Series: Personal Software (Part 4)

Previous: midi-cli-rs: Custom Mood Packs

Next: music-pipe-rs: Web Demo

Disclaimer

You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

February 24, 2026 • Software Wrighter

447 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Prompt Injection (malicious instructions overriding AI behavior), Jailbreaks (bypassing safety constraints), GRU (gated recurrent units for sequences), Planning vs Prediction (action evaluation vs forecasting), Production Rollbacks (reverting to stable model versions).

Five ML Concepts - #21

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #21

References

Concept	Reference
Prompt Injection	Prompt Injection attack against LLM-integrated Applications (Liu et al. 2023)
Jailbreaks	Jailbroken: How Does LLM Safety Training Fail? (Wei et al. 2023)
GRU	Empirical Evaluation of Gated Recurrent Neural Networks (Chung et al. 2014)
Planning vs Prediction	Between accurate prediction and poor decision making (Zaffalon et al. 2023)
Production Rollbacks	MLOps best practice (no canonical paper)

Today’s Five

1. Prompt Injection

Malicious instructions embedded in user input that override intended system behavior. An attacker crafts text that tricks an AI into ignoring its original instructions.

This is a major security concern for LLM-integrated applications.

Like slipping a forged instruction into a trusted document.

2. Jailbreaks

Techniques that attempt to bypass safety constraints in AI systems. These attacks exploit gaps between a model’s capabilities and its safety training.

Safety training can fail due to competing objectives or mismatched generalization.

Like convincing a guard to bend the rules.

3. GRU (Gated Recurrent Unit)

A recurrent neural network unit with gates that control memory flow. GRUs decide what information to keep and what to discard at each time step.

Simpler than LSTM but designed for similar sequence modeling tasks.

Like a notepad where you decide what to keep and what to erase.

4. Planning vs Prediction

Prediction forecasts likely outcomes. Planning evaluates actions across possible futures. Accurate predictions don’t guarantee good decisions—you also need to model how actions affect outcomes.

This is a key gap in many AI/ML systems.

Like knowing it will rain versus deciding whether to bring an umbrella.

5. Production Rollbacks

Reverting to a previous stable model version after deployment issues. When a new model causes problems in production, rolling back quickly minimizes impact.

Essential MLOps practice for maintaining system reliability.

Like reloading a saved game state when something breaks.

Quick Reference

Concept	One-liner
Prompt Injection	Malicious instructions overriding AI behavior
Jailbreaks	Bypassing safety constraints
GRU	Gated memory for sequence modeling
Planning vs Prediction	Action evaluation vs forecasting
Production Rollbacks	Reverting to stable model versions

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 23, 2026 • Software Wrighter

1300 words • 7 min read • Abstract

Personal Software grows. midi-cli-rs now supports custom mood packs---TOML files that extend built-in moods with your own musical variations. No Rust required. Define tempo, key, intensity, and let the generators handle the rest.

midi-cli-rs: Extending with Custom Mood Packs

Personal Software doesn’t stop at “it works.” It evolves. After building midi-cli-rs for AI agents to generate music, I wanted more moods—without recompiling Rust every time.

The solution: a plugin system that lets anyone create custom mood packs using simple TOML files.

Resource	Link
Examples	Listen to Samples
Wiki	Plugin Documentation
Video	Plugin Moods Explainer
Code	midi-cli-rs

The Problem with Built-in Only

The original midi-cli-rs shipped with a handful of mood presets: suspense, eerie, upbeat, calm, ambient, jazz. Useful, but limited. What if you want synthwave? Chillout? Something faster or in a different key?

Hardcoding every possible mood isn’t practical. And asking users to modify Rust source code isn’t friendly.

Three Levels of Extensibility

	Level	What You Get	What You Change	Skill Required
✓	Built-in Moods	9 curated generators	Nothing—use as-is	None
✓	Plugin Moods	Parameter variations	TOML config files	Text editing
✗	Custom Generators	New musical patterns	Rust source code	Programming (future)

This post covers Plugin Moods—the middle tier. You can preset combinations of tempo, key, and intensity, but you’re still using the built-in generators’ musical logic. Want a “smooth-jazz” preset (slower, mellower)? Plugin mood. Want bebop or Latin jazz with different chord progressions? That requires a custom generator.

Custom generators (writing new Rust code) will be covered in a future post when the plugin editor ships.

The Plugin Architecture

Custom moods live in ~/.midi-cli-rs/moods/ as TOML files. Each file is a “mood pack” that can define multiple moods. The CLI discovers them automatically.

Here’s how it works:

~/.midi-cli-rs/
└── moods/
    ├── electronic.toml    # Your synthwave, techno, etc.
    ├── cinematic.toml     # Epic, tension, wonder
    └── seasonal.toml      # Holiday themes

Creating a Mood Pack

A plug-in mood pack has two parts: pack metadata and mood definitions.

[pack]
name = "electronic"
version = "1.0.0"
author = "Your Name"
description = "Electronic music styles"

[[moods]]
name = "synthwave"
base_mood = "upbeat"
default_tempo = 118
default_key = "Am"
default_intensity = 65
description = "80s synthwave vibes"
tags = ["electronic", "retro"]

[[moods]]
name = "chillout"
base_mood = "ambient"
default_tempo = 85
default_key = "Em"
default_intensity = 40
description = "Relaxed electronic"

Each mood delegates to a built-in generator (base_mood) but overrides specific parameters. You get the musical logic of the built-in mood with your customizations applied.

Available Base Moods

Your custom moods can extend any of the nine built-in generators:

Base Mood	Character
`suspense`	Tense, building
`eerie`	Dark, unsettling
`upbeat`	Energetic, positive
`calm`	Peaceful, slow
`ambient`	Atmospheric, textural
`jazz`	Swing, improvisation
`chiptune`	8-bit, retro gaming
`orchestral`	Classical instruments
`show`	Broadway, theatrical

Configuration Options

Each mood definition supports these overrides:

Field	Description	Example
`name`	CLI name (required)	`"synthwave"`
`base_mood`	Built-in to extend (required)	`"upbeat"`
`default_tempo`	BPM	`118`
`default_key`	Musical key	`"Am"`, `"C"`, `"Eb"`
`default_intensity`	0-100 energy level	`65`
`description`	Human-readable description	`"80s vibes"`
`tags`	Discovery tags	`["electronic", "retro"]`

How Seeds Create Variation

Seeds aren’t random—they’re deterministic variation selectors. The same mood + same seed always produces identical output. But different seeds create observable musical differences across multiple dimensions:

Parameter	Variation Range
Tempo	±15% from base
Layer inclusion	Which instruments appear
Melodic contour	16 different phrase shapes
Note density	0.6x to 1.4x
Rest probability	0% to 35% silence
Phrase length	3-8 notes
Velocity	-15 to +15 offset

The system uses hash-based mixing with unique salts for each parameter. This means adjacent seeds (42 vs 43) produce completely different outputs—no gradual transitions between seeds.

When you combine plugin moods with seed variation, you get a matrix: your custom tempo/key/intensity settings applied across different seed-driven variations of the underlying generator’s patterns.

Using Custom Moods

Once your TOML file is in place, the mood appears automatically:

# List all moods (built-in + plugins)
midi-cli-rs moods

# Generate with your custom mood
midi-cli-rs preset -m synthwave -d 5 -s 42 -o output.wav

The seed system still works—same mood + same seed = identical output.

Example: Electronic Pack

Here’s a complete pack with four electronic moods:

[pack]
name = "electronic"
version = "1.0.0"
description = "Electronic music styles"

[[moods]]
name = "synthwave"
base_mood = "upbeat"
default_tempo = 118
default_key = "Am"
default_intensity = 65

[[moods]]
name = "chillout"
base_mood = "ambient"
default_tempo = 85
default_key = "Em"
default_intensity = 40

[[moods]]
name = "techno"
base_mood = "upbeat"
default_tempo = 130
default_key = "Dm"
default_intensity = 85

[[moods]]
name = "8bit"
base_mood = "chiptune"
default_tempo = 140
default_key = "C"
default_intensity = 70

Drop this in ~/.midi-cli-rs/moods/electronic.toml and you have four new moods.

What’s Next

This plugin system handles mood variations—different tempos, keys, and intensities applied to existing generators. A future update will add a plugin editor for creating entirely new musical patterns without writing Rust.

For now, the delegation model covers most use cases: want faster jazz? Darker ambient? Major-key suspense? Create a TOML file and you’re done.

The Personal Software Pattern

This follows the Personal Software philosophy: start with something that works, then extend it as needs emerge. The plugin system wasn’t in the original design. It grew from actual use—wanting more moods without recompiling.

Good personal software leaves room to grow.

Series: Personal Software (Part 3)

Previous: midi-cli-rs: Music for AI Agents

Next: music-pipe-rs: Unix Pipelines

Disclaimer

You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

Watch the Video

Unmute to hear narration.

February 23, 2026 • Software Wrighter

456 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: VAEs (generative with structured latents), Uncertainty Estimation (know when you don't know), Interpretability (distributed representations resist explanation), Gradient Noise (mini-batch variation), Human-in-the-Loop (human oversight for critical decisions).

Five ML Concepts - #20

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #20

References

Concept	Reference
VAEs	Auto-Encoding Variational Bayes (Kingma & Welling 2013)
Uncertainty Estimation	What Uncertainties Do We Need in Bayesian Deep Learning? (Kendall & Gal 2017)
Interpretability	Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez & Kim 2017)
Gradient Noise	Stochastic Gradient Descent as Approximate Bayesian Inference (Mandt et al. 2017)
Human-in-the-Loop	Human-in-the-Loop Machine Learning (Monarch 2021)

Today’s Five

1. Variational Autoencoders (VAEs)

VAEs are probabilistic autoencoders that learn a structured latent distribution. By sampling from that distribution, they can generate new examples similar to the training data.

The key innovation is regularizing the latent space to be smooth and continuous.

Like learning not just to summarize books, but to create new ones in a similar style.

2. Uncertainty Estimation

Models can estimate how confident they should be in predictions. Some uncertainty comes from noisy data (aleatoric), and some from limited knowledge (epistemic).

Knowing when a model is uncertain enables safer decision-making.

Like a weather forecast giving seventy percent chance of rain instead of a simple yes or no.

3. Why Interpretability Is Hard

Neural networks represent information across many interacting parameters. No single component cleanly maps to a human concept.

Distributed representations enable powerful learning but resist simple explanations.

Like trying to explain a dream by pointing to individual neurons.

4. Gradient Noise

When training with mini-batches, gradients vary from step to step. A little noise can help exploration, but too much can slow convergence.

Batch size, learning rate, and gradient clipping all influence this noise level.

Like getting slightly different directions each time you ask for help.

5. Human-in-the-Loop Systems

Humans review, supervise, or override model decisions in critical workflows. This improves safety and accountability in high-stakes applications.

The approach combines model efficiency with human judgment where it matters most.

Like a pilot monitoring autopilot and stepping in when necessary.

Quick Reference

Concept	One-liner
VAEs	Generative models with structured latent spaces
Uncertainty Estimation	Know when you don’t know
Interpretability	Distributed representations resist explanation
Gradient Noise	Mini-batch variation in training
Human-in-the-Loop	Human oversight for critical decisions

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 22, 2026 • Software Wrighter

643 words • 4 min read • Abstract

ICL evolved from emergent surprise (2020) to mechanistic understanding (2022) to engineered capability (2026). Transformers implement implicit gradient descent during inference---they learn without weight updates. The frontier: models learning from their own feedback. Not magic. Meta-learning in plain sight.

In-Context Learning Revisited: From Mystery to Engineering

It was 2020 when GPT-3 shocked everyone. It could learn from examples in the query—without updating its weights. We called it In-Context Learning. But was it magic, or was it doing something deeper?

Resource	Link
Video	ICL Revisited
Papers	4 References

Phase 1: The Empirical Discovery (2020)

The GPT-3 paper showed that large models could perform few-shot learning. Give them examples, and they generalize. No gradient updates. No retraining. Just forward passes.

The surprising part was that scaling alone seemed to unlock it.

Paper: Language Models are Few-Shot Learners

ELI5: Show a big language model a few examples of a task in your prompt, and it figures out how to do the task—without any retraining. Nobody told it to do this. It just emerged when models got big enough.

Main idea: Scale unlocks emergent capabilities. ICL was discovered, not designed.

Phase 2: Mechanistic Explanations (2022)

By 2022, researchers began probing the internal mechanisms. Several papers proposed that transformers implement implicit meta-learning. The model appears to learn during inference by performing gradient-descent-like operations internally.

Paper: What Explains In-Context Learning in Transformers?

ELI5: When you give a transformer examples, its attention layers do something that looks like fitting a simple linear model to those examples—on the fly, during the forward pass. It’s not memorizing; it’s computing a mini-solution.

Main idea: ICL works because attention can simulate linear regression internally.

Paper: Transformers Learn In-Context by Gradient Descent

ELI5: The transformer’s forward pass is secretly doing something similar to training. The attention mechanism acts like one step of gradient descent over the examples you provided. Learning happens inside inference.

Main idea: ICL is implicit gradient descent—learning hidden inside prediction.

Phase 3: Engineering the Effect

Once researchers understood that ordering and structure affect ICL, prompt design became less of an art and more of an optimization problem. The quality and arrangement of demonstrations directly shape performance.

ICL became tunable. Researchers could now deliberately improve it rather than just observe it.

Phase 4: Interactive ICL (2026)

Recent work pushes this further. Models are trained to predict natural language critiques and feedback. If a model can predict what a teacher would say, it can internalize that signal. External correction becomes an internal capability.

Paper: Improving Interactive In-Context Learning from Natural Language Feedback

ELI5: Train a model to guess what feedback a human would give. Now the model has internalized the “teacher” and can improve itself without needing the actual teacher present. Self-correction without weight updates.

Main idea: Models can learn to learn from feedback, making ICL interactive and self-improving.

Beyond Language

Newer work applies ICL to neuroscience discovery, showing that the mechanism is not limited to text tasks. It becomes a flexible reasoning substrate across domains. That’s when you know a concept has matured.

The Arc

Phase	Era	Key Insight
Discovery	2020	Emerges from scale
Explanation	2022	Implicit gradient descent
Engineering	2023-24	Prompt design as optimization
Self-improvement	2026	Learning from feedback

The Deeper Insight

In-Context Learning started as an emergent surprise. Now it’s becoming an engineered learning substrate inside transformers.

It was not magic. It was meta-learning hiding in plain sight.

References

Paper	Link
Language Models are Few-Shot Learners (GPT-3)	arXiv:2005.14165
What Explains In-Context Learning in Transformers?	arXiv:2202.12837
Transformers Learn In-Context by Gradient Descent	arXiv:2212.07677
Improving Interactive ICL from Natural Language Feedback	arXiv:2602.16066

ICL started as “whoa, it works.” Now we understand “why it works.” Next: engineering it deliberately.

Watch the Video

Unmute to hear narration.

February 22, 2026 • Software Wrighter

451 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Autoencoders (compress and reconstruct), Correlation vs Causation (co-occurrence isn't cause), Curriculum Learning (easy to hard), Failure Analysis (categorize errors), Covariate Shift (new inputs, same task).

Five ML Concepts - #19

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #19

References

Concept	Reference
Autoencoders	Reducing the Dimensionality of Data with Neural Networks (Hinton & Salakhutdinov 2006)
Correlation vs Causation	Causality (Pearl 2009)
Curriculum Learning	Curriculum Learning (Bengio et al. 2009)
Failure Analysis	Practical Machine Learning for Computer Vision (Lakshmanan et al. 2021)
Covariate Shift	Dataset Shift in Machine Learning (Quinonero-Candela et al. 2009)

Today’s Five

1. Autoencoders

Autoencoders are neural networks trained to compress inputs into a smaller representation and reconstruct them. The bottleneck forces the model to capture essential structure.

This learned compression is useful for dimensionality reduction, denoising, and feature learning.

Like summarizing a book into key points and then rebuilding the story from that summary.

2. Correlation vs Causation

Two variables can move together without one causing the other. Models typically learn correlations present in data, not true cause-and-effect relationships.

This matters because interventions based on correlation alone may not produce intended effects.

Like noticing umbrella sales rise with rain—umbrellas don’t cause rain.

3. Curriculum Learning

Training starts with easier examples and gradually introduces harder ones. This can improve stability and learning speed in some settings.

The approach mirrors how humans learn complex subjects incrementally.

Like teaching math by starting with addition before moving to calculus.

4. Failure Analysis

Failure analysis groups model errors into categories to understand where performance breaks down. This helps target improvements instead of guessing.

Systematic error analysis often reveals actionable patterns invisible in aggregate metrics.

Like a teacher reviewing which types of questions students miss most often.

5. Covariate Shift

Covariate shift occurs when the input distribution changes between training and deployment, while the task itself remains the same. The model may underperform because it sees unfamiliar inputs.

Monitoring input distributions helps detect this shift early.

Like training a driver in sunny weather and testing them in snow.

Quick Reference

Concept	One-liner
Autoencoders	Compress and reconstruct to learn structure
Correlation vs Causation	Co-occurrence isn’t cause
Curriculum Learning	Start easy, progress to hard
Failure Analysis	Categorize errors to guide fixes
Covariate Shift	New inputs, same task

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 21, 2026 • Software Wrighter

2241 words • 12 min read • Abstract

JSON is everywhere, but it's not the only option. This post explores data formats beyond basic JSON—JSONL for streaming, JSONB for fast queries, Protocol Buffers for compact wire formats, YAML/TOML for human editing, and TOON for LLM efficiency. Each has trade-offs: pick two of readability, compactness, or speed.

JSON et al: A Deep Dive into Data Serialization Formats

JSON is everywhere. APIs. Logs. Databases. Configuration files. But it’s not alone. A whole ecosystem of formats exists—each optimizing for different tradeoffs.

This post expands on the JSON et al short, providing technical depth on each format: when it was created, where it’s specified, and what problems it solves.

The Tradeoff Triangle

Before diving in, understand the fundamental constraint. Data formats balance three competing goals:

Goal	Description
Human Readability	Can a developer read and edit it directly?
Compactness	How many bytes does it take to represent data?
Query Performance	How fast can you access specific fields?

You usually only get two. JSON optimizes readability. Protobuf optimizes compactness. JSONB optimizes query performance. No format wins everywhere.

JSON: The Ubiquitous Baseline

Created: 2001 (discovered/formalized by Douglas Crockford) Specification: ECMA-404 (2013), RFC 8259 (2017) File Extension: .json

JSON (JavaScript Object Notation) emerged from JavaScript’s object literal syntax but became language-agnostic. Crockford didn’t invent it—he “discovered” it already existing in JavaScript and formalized the specification.

Technical Details

Encoding: UTF-8 text (UTF-16/32 allowed but rare)
Data Types: Objects {}, arrays [], strings, numbers, booleans, null
Schema: None required
Comments: Not allowed in strict JSON

Strengths

Universal parser support (every language has one)
Human readable without tools
Web-native (JavaScript parses it natively)
Simple specification (fits on a business card)

Weaknesses

Verbose (field names repeated for every object)
No binary data type (must base64-encode)
No comments (frustrating for config files)
Parsing overhead (tokenization + string decoding every time)

ELI5

Like typing a long email instead of sending a terse text. Every message spells everything out—clear, but verbose.

When to Use

REST APIs, configuration (when comments aren’t needed), data interchange between systems, anywhere human readability matters more than efficiency.

JSONL / NDJSON: Streaming JSON

Created: ~2013 (formalized) Specification: JSON Lines, NDJSON File Extension: .jsonl, .ndjson

JSONL (JSON Lines) and NDJSON (Newline-Delimited JSON) are the same concept: one valid JSON object per line, separated by newlines.

Technical Details

{"name": "Alice", "score": 95}
{"name": "Bob", "score": 87}
{"name": "Carol", "score": 92}

No wrapping array. Each line is independently parseable.

Strengths

Streaming: Process line-by-line without loading entire file
Append-only: Add records without rewriting the file
Parallel processing: Split by line, distribute to workers
Fault-tolerant: One corrupt line doesn’t invalidate the file

Weaknesses

Not valid JSON (can’t parse with standard JSON parser)
Still text-based (same verbosity as JSON)
No random access by index

ELI5

Like removing one comma per line to save some typing. Each line is self-contained, so you can grab and process them one at a time.

When to Use

Log files, big data pipelines (Spark, Pandas), ML datasets, event streams, anywhere you need to process records incrementally.

JSONB: Binary JSON for Databases

Created: 2014 (PostgreSQL 9.4) Specification: Implementation-specific (no universal standard) Storage: Database column type

JSONB isn’t a file format—it’s a database storage optimization. PostgreSQL’s JSONB differs from MongoDB’s BSON, which differs from other implementations.

PostgreSQL JSONB Details

Parsed once: Text converted to binary on INSERT
Keys sorted: Deterministic ordering for indexing
Duplicates removed: Last value wins
Offset table: O(log n) field lookup instead of O(n) text scanning

MongoDB BSON

Specification: bsonspec.org

BSON (Binary JSON) is MongoDB’s serialization format. Unlike PostgreSQL’s JSONB, BSON is a standalone binary format:

Type-prefixed values
Supports additional types (Date, Binary, ObjectId)
Length-prefixed for fast skipping
~10-15% smaller than JSON typically

Strengths

Fast queries without re-parsing
Indexable (GIN indexes on JSONB in PostgreSQL)
Type coercion at storage time

Weaknesses

Not portable (implementation-specific)
Not human-readable
INSERT overhead (parsing cost upfront)

ELI5

Instead of cooking from scratch every time, you heat a pre-made meal. The prep work happens once (on INSERT), so serving (queries) is fast.

When to Use

Database storage where you query into JSON structures. PostgreSQL JSONB + GIN indexes enable fast @> containment queries.

Protocol Buffers: Google’s Schema-First Format

Created: 2001 (internal Google), 2008 (open-sourced) Specification: developers.google.com/protocol-buffers File Extension: .proto (schema), binary wire format

Protocol Buffers (Protobuf) is Google’s language-neutral, schema-required serialization format. It powers gRPC.

Technical Details

Schema definition:

message Sensor {
  int32 temperature = 1;
  int32 humidity = 2;
}

Wire format uses field numbers, not names:

Field 1: 72
Field 2: 40

Key Features

Varint encoding: Small integers use fewer bytes
Field numbers: Enable backward compatibility
Code generation: .proto → language-specific classes
No self-description: Receiver must know schema

Strengths

Extremely compact (3-10x smaller than JSON typically)
Fast serialization/deserialization
Strong versioning semantics
gRPC integration

Weaknesses

Requires schema agreement
Not human-readable
Tooling required for debugging
Schema evolution has rules

ELI5

Everyone agrees upfront what “field 1” means. You don’t waste space spelling out “temperature”—you just send the number 1 and the value. Both sides know the code.

When to Use

Microservices (gRPC), internal APIs, anywhere bandwidth and latency matter more than debuggability.

ASN.1: The Telecom Veteran

Created: 1984 (ITU-T X.208) Specification: ITU-T X.680-X.683 Encoding Rules: BER, DER, PER, XER, and more

ASN.1 (Abstract Syntax Notation One) predates all modern formats. It defines both schema and encoding, with multiple encoding rules for different use cases.

Encoding Rules Comparison

Rule	Use Case
BER (Basic Encoding Rules)	Flexible, general purpose
DER (Distinguished Encoding Rules)	Deterministic, for cryptography
PER (Packed Encoding Rules)	Most compact, for bandwidth-constrained
XER (XML Encoding Rules)	XML-based, for interop

Where You See ASN.1

X.509 certificates (SSL/TLS certs are DER-encoded ASN.1)
LDAP (directory services)
SNMP (network management)
Telecom protocols (SS7, GSM, LTE)

Strengths

Bit-level precision
Proven over 40 years
Multiple encoding options
Formal verification possible

Weaknesses

Complex specification
Steep learning curve
Tooling can be expensive
Security vulnerabilities in parsers (historically)

ELI5

Same idea as Protobuf—everyone agrees upfront what each field number means. ASN.1 just got there 20 years earlier and handles even more edge cases.

When to Use

You probably won’t choose ASN.1 for new projects. You’ll encounter it in cryptography, certificates, and legacy telecom systems.

YAML: Human-Friendly Configuration

Created: 2001 (Clark Evans, Ingy döt Net, Oren Ben-Kiki) Specification: yaml.org/spec/1.2.2 File Extension: .yaml, .yml

YAML (YAML Ain’t Markup Language) prioritizes human readability. It’s a superset of JSON—any valid JSON is valid YAML.

Technical Details

# Comments allowed!
server:
  host: localhost
  port: 8080
  features:
    - auth
    - logging

Key Features

Indentation-based: Whitespace matters
Comments: # for single-line
Anchors/aliases: &name and *name for references
Multiple documents: --- separator

Strengths

Highly readable
Comments supported
Multi-line strings without escaping
Complex data structures

Weaknesses

“Norway problem”: NO parses as boolean false
Whitespace sensitivity causes errors
Multiple ways to express same data
Security concerns (arbitrary code execution in some parsers)

ELI5

Optimized for clarity, not bandwidth. YAML is for humans editing config files—not for machines exchanging data over networks.

When to Use

Configuration files (Kubernetes, Docker Compose, CI/CD), anywhere humans edit data directly and comments help.

TOML: Minimal Configuration

Created: 2013 (Tom Preston-Werner) Specification: toml.io File Extension: .toml

TOML (Tom’s Obvious Minimal Language) emerged as a reaction to YAML’s complexity. It’s used by Rust (Cargo.toml), Python (pyproject.toml), and others.

Technical Details

[server]
host = "localhost"
port = 8080

[server.features]
auth = true
logging = true

Key Features

Explicit typing: Dates, times, arrays have clear syntax
Sections: [section] and [section.subsection]
No anchors: Intentionally simpler than YAML
Deterministic: Same data = same representation

Strengths

Easy to read and write
Unambiguous parsing
Clear error messages
Growing ecosystem support

Weaknesses

Less expressive than YAML
Nested structures can be verbose
Smaller ecosystem than JSON/YAML

ELI5

Same goal as YAML—clarity for humans, not bandwidth for machines—but with stricter rules so you make fewer mistakes.

When to Use

Configuration files where YAML’s complexity isn’t needed. Rust projects (mandatory). Python packaging (pyproject.toml).

TOON: Token-Optimized for LLMs

Created: October 2025 (toon-format organization) Specification: github.com/toon-format/toon (v3.0) File Extension: .toon Media Type: text/toon (provisional)

TOON (Token Oriented Object Notation) is the newest format in this list, designed specifically for LLM input. It’s a lossless representation of JSON that minimizes tokens.

Technical Details

TOON combines YAML-style indentation for nested objects with CSV-like tabular layouts for uniform arrays:

users[2]{name,age}:
Alice,25
Bob,30

Equivalent JSON:

{"users": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]}

Key Features

Header-based: Field names declared once, values follow
40% fewer tokens: Than equivalent JSON typically
Lossless: Round-trips to JSON perfectly
UTF-8 always: No encoding ambiguity

Performance

Metric	JSON	TOON
Accuracy	69.7%	73.9%
Efficiency (acc/1K tokens)	15.3	26.9

Strengths

Significant token savings at scale
Better context window utilization
Lower API costs for LLM applications
Human-readable (unlike binary formats)

Weaknesses

New format (October 2025)
Limited tooling compared to JSON
Requires conversion layer for existing systems
Not yet widely adopted

ELI5

Like having one header row for each column in a table instead of repeating the column name for every single row. You declare field names once, then just list the values.

When to Use

LLM prompts with structured data, RAG applications, anywhere token efficiency matters. Especially useful for large datasets with uniform object arrays.

Implementations

TypeScript: Reference implementation
Python: toons (Rust-based, fast)
Go, Rust, .NET: Available via toon-format org

Alternatives Not in the Video

MessagePack

Created: 2008 (Sadayuki Furuhashi) Specification: msgpack.org

Binary JSON without schema. Type-prefixed values, efficient numeric encoding.

Use when: You want JSON semantics but smaller/faster.

CBOR

Created: 2013 (IETF) Specification: RFC 8949

Concise Binary Object Representation. Designed for constrained environments (IoT).

Use when: Resource-constrained devices, need a standard binary format.

Apache Avro

Created: 2009 (Apache, Doug Cutting) Specification: avro.apache.org

Schema-based, row-oriented binary format. Schema embedded or stored separately. Strong schema evolution support.

Use when: Big data pipelines (Hadoop, Kafka), schema evolution is critical.

Apache Parquet

Created: 2013 (Twitter + Cloudera) Specification: parquet.apache.org

Columnar storage format. Not for serialization—for analytics storage.

Use when: Large-scale analytics, data warehousing, Spark/Pandas workflows.

Cap’n Proto

Created: 2013 (Kenton Varda, ex-Protobuf author) Specification: capnproto.org

Zero-copy serialization. The serialized form is the in-memory form.

Use when: Extreme performance requirements, inter-process communication.

FlatBuffers

Created: 2014 (Google) Specification: google.github.io/flatbuffers

Zero-copy like Cap’n Proto but with better tooling. Used in games, mobile.

Use when: Games, mobile apps, anywhere memory allocation matters.

Quick Reference

Format	Year	Schema	Binary	Human-Readable	Best For
JSON	2001	No	No	Yes	APIs, interchange
JSONL	2013	No	No	Yes	Logs, streaming
JSONB	2014	No	Yes	No	Database queries
Protobuf	2008	Yes	Yes	No	Microservices
ASN.1	1984	Yes	Yes	No	Crypto, telecom
YAML	2001	No	No	Yes	Config files
TOML	2013	No	No	Yes	Simple config
TOON	2025	No	No	Yes	LLM prompts
MessagePack	2008	No	Yes	No	Fast JSON
CBOR	2013	Optional	Yes	No	IoT
Avro	2009	Yes	Yes	No	Big data

Key Takeaways

No “best” format exists. Each optimizes for different constraints.
Text formats favor humans. JSON, YAML, TOML prioritize readability over efficiency.
Binary formats favor machines. Protobuf, MessagePack, CBOR prioritize compactness and speed.
Schema formats favor correctness. Protobuf, Avro, ASN.1 catch errors at compile time.
The tradeoff triangle is real. Readability, compactness, query performance—pick two.

The question isn’t “which format wins?” The question is: what problem are you solving?

Resources

Data formats are design decisions. Choose based on your constraints, not trends.

Questions? Find me on YouTube @SoftwareWrighter.

Watch the Video

Unmute to hear narration.

February 21, 2026 • Software Wrighter

444 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Preference Learning (train from comparisons), Ensembling (combine models for robustness), ML Fragility (breaks on distribution shift), Epoch (one pass through data), Cost vs Quality (bigger isn't always better).

Five ML Concepts - #18

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #18

References

Concept	Reference
Preference Learning	Learning to summarize from human feedback (Stiennon et al. 2020)
Ensembling	Ensemble Methods in Machine Learning (Dietterich 2000)
ML Fragility	Distribution Shift (Quinonero-Candela et al. 2009)
Epoch	Deep Learning (Goodfellow et al. 2016), Chapter 8
Cost vs Quality	Efficient Transformers: A Survey (Tay et al. 2022)

Today’s Five

1. Preference Learning

Instead of learning from fixed labels, models are trained from comparisons between outputs. This helps align model behavior with human judgments.

The approach works well when absolute quality is hard to define but relative preferences are easier to express.

Like learning to cook by asking which dish tastes better.

2. Ensembling

Ensembling combines predictions from multiple models. Different models make different errors, and combining them can improve robustness.

Common strategies include voting, averaging, and stacking models together.

Like asking several experts and averaging their opinions.

3. Why ML Is Fragile

Models rely on statistical patterns learned from data. When those patterns shift, performance can degrade quickly.

This fragility emerges because models optimize for training distributions, not arbitrary future scenarios.

Like a spell checker that works on common words but struggles with unusual ones.

4. Epoch

An epoch is one complete pass through the training dataset. Multiple epochs allow the model to refine its weights over repeated passes.

Training typically continues for many epochs until validation performance stops improving.

Like reading a textbook from beginning to end more than once.

5. Cost vs Quality Tradeoffs

Increasing model size or compute often improves performance, but also increases cost and latency. Engineers balance quality against budget and responsiveness.

Production systems often use smaller, faster models rather than the largest available.

Like choosing between a luxury car and an economy car depending on your needs.

Quick Reference

Concept	One-liner
Preference Learning	Train from comparisons, not labels
Ensembling	Combine models for robustness
ML Fragility	Statistical models break on distribution shift
Epoch	One pass through training data
Cost vs Quality	Bigger isn’t always better in production

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 20, 2026 • Software Wrighter

1063 words • 6 min read • Abstract

Personal Software via Vibe Coding: a music tool for AI agents. midi-cli-rs provides mood presets (suspense, upbeat, calm, jazz) so agents can generate complete audio compositions from simple commands. No music theory required.

midi-cli-rs: Music Generation for AI Coding Agents

AI coding agents can write code, generate images, and produce text. But what about music? When I needed background audio for explainer videos, I wanted a tool that AI agents could use directly—no music theory required.

Resource	Link
Video	midi-cli-rs Explainer
Examples	Listen to Samples
Code	midi-cli-rs

The Problem

Generating music programmatically is hard. Traditional approaches require understanding music theory, MIDI specifications, instrument mappings, and audio synthesis. That’s a lot to ask of an AI agent that just needs a 5-second intro.

I wanted something simpler: a CLI tool where an agent could say “give me 5 seconds of suspenseful music” and get a usable WAV file.

The Solution: Mood Presets

midi-cli-rs solves this with mood presets—curated musical generators that produce complete compositions from a single command:

# Generate a 5-second suspenseful intro
midi-cli-rs preset --mood suspense --duration 5 -o intro.wav

# Upbeat outro with specific key
midi-cli-rs preset -m upbeat -d 7 --key C --seed 42 -o outro.wav

Six moods are available:

Mood	Character
`suspense`	Low drones, tremolo strings, tension
`eerie`	Sparse tones, diminished harmony
`upbeat`	Rhythmic chords, energetic
`calm`	Warm pads, gentle arpeggios
`ambient`	Textural drones, pentatonic bells
`jazz`	Walking bass, brushed drums, piano trio

Each mood generates multi-layer compositions with appropriate instruments, rhythms, and harmonies. The --seed parameter ensures reproducibility—same seed, same output. Different seeds produce meaningful variations in melody contour, rhythm patterns, and instrument choices.

Melodic Variation

The presets don’t just randomize notes—they use a contour-based variation system. Changing the seed produces melodies that follow different shapes (ascending, descending, arch, wave) while staying musically coherent. This means you can generate multiple versions of a mood and pick the one that fits best.

How It Works

The tool generates MIDI programmatically, then renders to WAV using FluidSynth:

Mood Preset → MIDI Generation → FluidSynth → WAV Output

MIDI generation uses the midly crate to create standard MIDI files. Each preset generates multiple tracks with different instruments, note patterns, and dynamics.

Audio rendering calls FluidSynth as a subprocess with a SoundFont (instrument samples). This avoids LGPL licensing complications—subprocess execution doesn’t trigger copyleft.

Note-Level Control

When presets aren’t enough, you can specify exact notes:

# Note format: PITCH:DURATION:VELOCITY[@OFFSET]
midi-cli-rs generate \
    --notes "C4:0.5:80@0,E4:0.5:80@0.5,G4:0.5:80@1,C5:1:90@1.5" \
    -i piano -t 120 -o arpeggio.wav

Or use JSON for complex multi-track arrangements:

echo '{"tempo":90,"instrument":"piano","notes":[
  {"pitch":"C4","duration":0.5,"velocity":80,"offset":0},
  {"pitch":"E4","duration":0.5,"velocity":80,"offset":0.5},
  {"pitch":"G4","duration":1,"velocity":90,"offset":1}
]}' | midi-cli-rs generate --json -o output.wav

Web UI

For interactive composition, there’s a browser-based interface:

midi-cli-rs serve  # Starts on http://127.0.0.1:3105

The Presets tab lets you adjust mood, key, duration, intensity, and tempo with immediate audio preview. Click the clock button to generate a time-based seed for unique but reproducible results.

The Melodies tab provides note-by-note composition with keyboard shortcuts:

a-g for note pitch
[ / ] to adjust duration
+ / - to change octave
Tab to navigate between notes

For AI Agents

The CLI is designed for AI agent usage:

Simple commands: One line generates complete audio
Reproducible: Seed values ensure consistent output
Self-documenting: --help includes agent-specific instructions
Composable: Generate tracks separately, combine with ffmpeg

# AI agent workflow
midi-cli-rs preset -m suspense -d 5 --seed 1 -o intro.wav
midi-cli-rs preset -m upbeat -d 10 --seed 2 -o main.wav
ffmpeg -i intro.wav -i main.wav -filter_complex concat=n=2:v=0:a=1 final.wav

SoundFont Quality Matters

The quality of generated audio depends heavily on the SoundFont used. SoundFonts are collections of audio samples for each instrument—a tiny SoundFont with compressed samples will sound thin and artificial, while a larger one with high-quality recordings produces professional results.

SoundFont	Size	Quality	License
TimGM6mb	~6MB	Basic	GPL v2
GeneralUser GS	~30MB	Good	Permissive
FluidR3_GM	~140MB	Very Good	MIT
MuseScore_General	~200MB	Excellent	MIT

For anything beyond quick prototypes, use a quality SoundFont. The difference is dramatic—the same MIDI file can sound like a toy keyboard or a real instrument depending on the samples.

The tool auto-detects SoundFonts in common locations (~/.soundfonts/, /opt/homebrew/share/soundfonts/, etc.), or specify one explicitly with --soundfont.

Technical Details

Built with Rust 2024 edition using permissively licensed dependencies:

Crate	Purpose
midly	MIDI file generation
clap	CLI argument parsing
serde	JSON serialization
rand	Randomization for presets
axum	Web server (for `serve` command)

FluidSynth is called as a subprocess for WAV rendering, keeping the main codebase MIT-licensed.

Try It

Listen to sample outputs, or build locally:

git clone https://github.com/softwarewrighter/midi-cli-rs.git
cd midi-cli-rs
cargo build --release
./target/release/midi-cli-rs preset -m jazz -d 5 -o jazz.wav

Requires FluidSynth for WAV output (brew install fluid-synth on macOS).

Series: Personal Software (Part 2)

Previous: cat-finder: Local ML in Rust

Next: midi-cli-rs: Custom Mood Packs

Disclaimer

You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

Watch the Video

Unmute to hear narration.

February 20, 2026 • Software Wrighter

472 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Benchmark Leakage (test data contamination), Concept vs Data Drift (changed relationships vs inputs), Weight Decay (L2 penalty for simplicity), Scaling Laws (predictable performance growth), Shadow Deployment (test alongside production).

Five ML Concepts - #17

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #17

References

Concept	Reference
Benchmark Leakage	Rethinking the Inception Architecture for Computer Vision (Szegedy et al. 2016)
Concept/Data Drift	Learning under Concept Drift: A Review (Lu et al. 2018)
Weight Decay	Decoupled Weight Decay Regularization (Loshchilov & Hutter 2019)
Scaling Laws	Scaling Laws for Neural Language Models (Kaplan et al. 2020)
Shadow Deployment	Reliable Machine Learning (Cathy Chen et al. 2022)

Today’s Five

1. Benchmark Leakage

When benchmark or test data influences training, tuning, or model selection, evaluation results become unreliable. This inflates reported performance beyond real-world capability.

Strict separation between development and evaluation data is essential for honest assessment.

Like practicing with the exact questions that will appear on the final exam.

2. Concept Drift vs Data Drift

Data drift occurs when input distributions change. Concept drift occurs when the relationship between inputs and outputs changes. Both can degrade model performance over time.

Data drift: customers buy different products. Concept drift: what “good” means has changed.

Like customers buying different products versus products changing what they mean.

3. Weight Decay

A regularization method that penalizes large weights, often implemented as L2 regularization. This encourages simpler models that generalize better.

Weight decay adds a term proportional to the squared magnitude of weights to the loss function.

Like encouraging shorter, simpler answers instead of overly complicated ones.

4. Scaling Laws

Empirical relationships showing how performance tends to improve as model size, data, or compute increase. These relationships follow predictable power-law curves.

Scaling laws help predict resource requirements for target performance levels.

Like noticing that adding horsepower often increases a car’s speed, but with diminishing returns.

5. Shadow Deployment

Running a new model in parallel with production without affecting live user decisions. The shadow model processes real traffic but its outputs are only logged, not served.

This allows safe evaluation before full deployment.

Like a new chef preparing the same dishes in the back kitchen before serving customers.

Quick Reference

Concept	One-liner
Benchmark Leakage	Test data contaminating training/selection
Concept vs Data Drift	Changed relationships vs changed inputs
Weight Decay	L2 penalty discourages large weights
Scaling Laws	Performance scales predictably with resources
Shadow Deployment	Test safely alongside production

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 19, 2026 • Software Wrighter

1069 words • 6 min read • Abstract

ToonTalk is a 1995 visual programming environment where you train robots by showing them what to do. I vibe coded tt-rs, a Rust/WebAssembly reimplementation with boxes, scales, birds, nests, and robots---programming by demonstration for the browser.

TBT (4/?): ToonTalk - Teaching Robots to Program

I first discovered ToonTalk during the Windows XP era—probably around 2003 or 2004. It was unlike anything I’d seen: a programming environment disguised as a video game where you trained robots by showing them what to do. The concept stuck with me for two decades.

Resource	Link
Video	ToonTalk in Rust
tt-rs Demo	Live Demo
tt-rs Repo	tt-rs

What is ToonTalk?

ToonTalk is a visual programming environment created by Ken Kahn in 1995. The “Toon” stands for cartoon—every abstract programming concept is mapped to a concrete, animated metaphor:

Concept	ToonTalk Metaphor
Variables	Boxes with numbered holes
Values	Numbers, text, images in boxes
Comparison	Scales that tip when values differ
Functions	Robots that watch and learn
Message passing	Birds that carry items to nests
Garbage collection	Trucks that haul away unused items

The design was influenced by games like The Legend of Zelda and Robot Odyssey—the kind of games that made you think while you played.

Programming by Demonstration

The core idea is radical: you don’t write code, you show a robot what to do.

Create a robot and put it in “training mode”
Perform actions while the robot watches (move items, compare values, etc.)
The robot records your actions as a program
Give the robot a box matching the training pattern—it executes the learned behavior

This is programming by demonstration. The robot generalizes from your example, matching patterns and applying transformations. It’s the same conceptual model as teaching a child: “Watch what I do, then you try.”

Three Generations

ToonTalk has existed in three forms:

Version	Era	Technology
Original ToonTalk	1995-2009	C++, 3D desktop application
ToonTalk Reborn	2014-2017	JavaScript/jQuery web app
tt-rs	2025-2026	Rust/WebAssembly/Yew

The original was a full 3D world—cities, houses, helicopters, even bombs for debugging. Ken Kahn later created ToonTalk Reborn, a simplified JavaScript version that runs in browsers.

Why I Built tt-rs

When I rediscovered ToonTalk Reborn a few years ago, I wanted to experiment with the concepts myself. But diving into a large jQuery codebase wasn’t appealing. So I did what any reasonable person would do: I vibe coded my own version in Rust.

tt-rs is a modern reimplementation using:

Rust for core logic
WebAssembly for browser execution
Yew for reactive UI
SVG/CSS for graphics and animations

It’s not a port—it’s a fresh implementation inspired by the same ideas. Building it myself lets me understand the concepts deeply and experiment with variations.

Three Learning Levels

The demo introduces concepts progressively through three levels:

Level	Concepts	Widgets
tt1	Basics	Numbers, boxes, scales, wand, vacuum
tt2	Messaging	Birds and nests for communication
tt3	Automation	Sensors (time, random) + robots

Level one covers the fundamentals: numbers with arithmetic, boxes as containers, scales for comparison, and tools for copying and removing. Level two adds asynchronous messaging—birds carry items to their paired nests. Level three brings sensors that produce values and robots that automate actions.

Current Features

The live demo includes:

Widgets:

Numbers: Rational arithmetic with +, -, *, / operators
Boxes: Configurable containers with 0-9 holes (resize with keyboard)
Text: Basic text display
Scales: Visual comparison that tips when values differ
Robot: Training mode, action recording, execution
Bird/Nest: Message passing with pairing and delivery
Sensors: Time (milliseconds) and random number generation

Tools:

Wand: Copy any widget
Vacuum: Remove widgets
Magnifier: Inspect nest message queues and robot actions

Interactions:

Drag-and-drop with visual feedback
Box joining (drop box on edge of another)
Box splitting (drop box on a number)
Contextual help panel with level-specific content
Puzzle system with animated “Show Me” demos

Robot Training

The core feature is programming by demonstration:

Click robot to enter training mode (yellow glow indicates “I’m watching”)
Perform actions while the robot records (arithmetic, copy, remove, move to box)
Click robot again to stop training
Click robot to replay—it executes the recorded sequence

The tutorials demonstrate this workflow step by step. In the “Train Robot” tutorial, you teach a robot to move a number into a box. In “Robot Sensors,” you train a robot to generate random numbers, apply modulo, and send results to a nest via a bird.

Interactive Tutorials

Each tutorial has two parts:

Show Me: Watch an animated demonstration where a cursor walks through the solution
Practice: Try it yourself with the same widgets

The tutorials cover:

Fill a box with numbers
Add numbers together
Copy widgets with the wand
Send messages with birds and nests
Train your first robot
Combine robots with sensors

What’s Next

The immediate priorities:

Pattern matching - Robot generalizes from specific values to “any number”
Watched execution - See robot work step-by-step with animated cursor
Persistence - Save and load workspaces

Long term, I’d like to add the 3D elements from the original—the cities, the houses, the helicopter view. But that’s a much larger project.

The Enduring Appeal

What makes ToonTalk fascinating isn’t just the visual metaphors—it’s the computational model. Under the hood, ToonTalk implements concurrent constraint logic programming. The robots are essentially guarded Horn clauses. The birds and nests implement the actor model.

Heavy concepts, but you don’t need to know any of that to use it. You just train robots by example. The abstraction is complete.

That’s why it stuck with me for twenty years. Good abstractions are rare. When you find one, it’s worth understanding deeply.

References

Resource	Link
ToonTalk Website	toontalk.com
ToonTalk on Wikipedia	Wikipedia
ToonTalk Reborn (JS)	github.com/ToonTalk/ToonTalk
ToonTalk Reborn Demo	toontalk.github.io/ToonTalk
ToonTalk Reborn Wiki	Wiki
Ken Kahn’s Page	Ken Kahn
Original Paper (1995)	ERIC - ToonTalk: An Animated Programming Environment
Ken Kahn’s Research	Academia.edu

Some ideas are worth rediscovering. ToonTalk is one of them.

Watch the Video

Unmute to hear narration.

February 19, 2026 • Software Wrighter

468 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Train/Val/Test Split (separate data roles), Overconfidence (high probability wrong predictions), Batch Normalization (stable training), Optimization vs Generalization (low train loss doesn't mean good test), A/B Testing (compare with experiments).

Five ML Concepts - #16

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #16

References

Concept	Reference
Train/Val/Test Split	Deep Learning (Goodfellow et al. 2016), Chapter 5
Overconfidence	On Calibration of Modern Neural Networks (Guo et al. 2017)
Batch Normalization	Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy 2015)
Optimization vs Generalization	Understanding Deep Learning Requires Rethinking Generalization (Zhang et al. 2017)
A/B Testing	Controlled Experiments on the Web (Kohavi et al. 2009)

Today’s Five

1. Train / Validation / Test Split

Data is divided into training, validation, and test sets. Training learns patterns, validation tunes hyperparameters, test evaluates final performance.

Never use test data for any decisions during development—it should only be touched once.

Like practicing on homework, checking with practice tests, then taking the real exam.

2. Overconfidence

Models can assign very high probabilities to incorrect predictions. This is often related to poor calibration and can be dangerous in high-stakes applications.

Temperature scaling and other calibration methods can help align confidence with accuracy.

Like a student who is absolutely certain of a wrong answer.

3. Batch Normalization

Normalizes layer activations during training to improve stability and convergence. Each mini-batch’s activations are normalized to have zero mean and unit variance.

This reduces internal covariate shift and often allows higher learning rates.

Like keeping everyone on a similar pace during training so no one runs too far ahead.

4. Optimization vs Generalization

Training loss can decrease while test performance does not improve. Good optimization does not guarantee good generalization.

A model can perfectly fit training data while failing on new examples—this is overfitting.

Like memorizing last year’s exam instead of understanding the subject.

5. A/B Testing Models

Comparing two model versions using controlled live traffic experiments. Users are randomly assigned to see predictions from model A or model B.

Statistical analysis determines which model performs better on real-world metrics.

Like taste-testing two recipes with real customers to see which works better.

Quick Reference

Concept	One-liner
Train/Val/Test	Separate data for learning, tuning, and evaluation
Overconfidence	High probability on wrong predictions
Batch Normalization	Normalize activations for stable training
Optimization vs Generalization	Low train loss ≠ good test performance
A/B Testing	Compare models with live experiments

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 18, 2026 • Software Wrighter

796 words • 4 min read • Abstract

RSFT on easy examples made performance worse---27% vs 37% SFT baseline. Training distribution must match evaluation distribution. Easy examples teach shortcuts that fail on hard problems. The fix is one flag change.

Multi-Hop Reasoning (2/2): The Distribution Trap

In Part 1, a tiny 135M model achieved 75% accuracy on multi-hop reasoning. This time we scale up to 360M—and discover that RSFT on easy examples makes performance worse.

Resource	Link
Paper	KG-Guided RAG (arXiv)
Code	multi-hop-reasoning
ELI5	eli5.md
Demo	Live Demo
Explainer	Coming soon

Scaling Up: SmolLM-360M

Part 1 used the 135M model. For better reasoning traces and demo quality, we trained the 360M variant:

Model	Parameters	Platform
SmolLM-135M-Instruct	135M	MLX (macOS)
SmolLM-360M-Instruct	360M	MLX + Unsloth (cross-platform)

The 360M model produces more coherent traces and is used by the live inference demo.

The Distribution Trap

Here’s what happened when we trained RSFT on the “easy” training data:

Phase	Training Data	Accuracy	Notes
Base	—	0%	No format compliance
SFT (500 iters)	Easy (1-3 hop)	37%	Learns TRACE + ANSWER format
RSFT	Easy (1-3 hop)	27%	Worse than SFT!

RSFT on easy examples performed worse than the SFT baseline.

Why?

The training examples (1-3 hops) don’t match the evaluation distribution (4-5 hops). The model learns shortcuts that work on easy problems but fail on hard ones.

Training Distribution	Eval Distribution	Result
Easy (1-3 hop)	Hard (4-5 hop)	27% (worse)
Hard (4-5 hop)	Hard (4-5 hop)	75% (Part 1 result)

The rejection sampling “winners” from easy examples teach strategies that don’t generalize.

The Key Finding

Rejection sampling must match your target distribution.

This is counterintuitive. You might expect that training on more examples (even easy ones) would help. Instead:

Easy winners use shortcuts (fewer reasoning steps)
Hard eval requires full chain reasoning
Model learns the wrong patterns

The fix: train RSFT on eval.jsonl (hard examples), not train.jsonl (easy examples).

Demo Improvements

The demo now includes four interactive tabs:

Tab	Feature
Training	Animated SFT→RSFT visualization with KG scoring
Inference	Pre-recorded inference examples
Try It	Live inference with 360M model
Distribution	Interactive visualization of the key finding

Try It: Live Inference

Ask DevOps troubleshooting questions and watch the model reason:

Question: What causes TLSHandshakeError?

TRACE: TLSHandshakeError is caused by ClockSkew,
and ClockSkew leads to CertificateExpired,
and CertificateExpired is fixed by RenewCert...
ANSWER: B

The knowledge graph scores the reasoning path during training, but at inference the model reasons independently.

Cross-Platform Support

The pipeline now runs on both platforms:

Platform	Framework	Command
macOS (Apple Silicon)	MLX	`make train-360m`
Linux (NVIDIA CUDA)	Unsloth	`make train-360m-unsloth`

Unsloth provides 2x faster training with 60% less memory on NVIDIA GPUs.

Current Status

Component	Status
SFT training (360M)	Complete
RSFT (wrong distribution)	Complete (27%)
RSFT (correct distribution)	Next step
Live demo with Try It	Complete
Cross-platform support	Complete

Next Steps

Priority	Task	Expected Result
High	Retrain RSFT on eval.jsonl	75%+ accuracy
Medium	Update demo to use corrected model	Better live inference
Medium	Curriculum learning (easy→hard)	Smoother training
Low	Larger models (1B+)	Higher ceiling

The corrected RSFT training:

python3 -m core.rsft \
  --examples data/eval.jsonl \  # Hard examples!
  --kg data/kg.json \
  --sft-adapter data/runs/run_360m/models/sft \
  --output data/runs/run_360m/models/rsft_eval \
  --model HuggingFaceTB/SmolLM-360M-Instruct \
  --k-samples 8 \
  --max-examples 50

Lessons Learned

1. Distribution Matching is Non-Negotiable

This isn’t a minor optimization—it’s the difference between 27% and 75% accuracy. Wrong distribution = wrong winners = wrong model.

2. Easy Examples Can Hurt

More training data isn’t always better. Easy examples teach shortcuts that fail on hard problems.

3. Verify Your Pipeline

We trained a full RSFT model before realizing the distribution mismatch. Always check that training data matches eval distribution.

4. The Fix is Simple

Once identified, the fix is one flag change: --examples data/eval.jsonl instead of train.jsonl.

Resources

Training distribution matters. Easy examples teach easy shortcuts.

February 18, 2026 • Software Wrighter

775 words • 4 min read • Abstract

Part 2 of implementing the Share algorithm: after fixing critical bugs (zero-gradient saddle point, half-parameter training), routing-based coefficient selection achieves zero regressions. Result handling improved 40% to 50%. We're 60% through verifying the paper's claims.

Towards Continuous LLM Learning (2): Routing Prevents Forgetting

In Part 1, naive LoRA fine-tuning caused catastrophic forgetting. Now we’re implementing the Share algorithm properly—and we’re about 60% of the way to verifying the paper’s claims.

Resource	Link
Code	sleepy-coder
Part 1	When Fine-Tuning Fails
ELI5	eli5.md
Share Paper	arXiv:2602.06043

Paper Claims vs Implementation Status

We’re systematically verifying the claims from the Share and UWSH papers:

Paper Claim	Infrastructure	Demonstrated?
Shared basis via SVD	Complete	Yes
~100x parameter reduction	Complete (76x)	Yes
Task routing beats averaging	Tested (Exp 1b)	Partial
Prevents catastrophic forgetting	Tested (Exp 1b)	Partial
Sequential learning	Not tested	No
UWSH subspace stability	Not tested	No

Overall: ~60% complete. Infrastructure is solid. Routing tested. Sequential learning remains.

What We Built

The full Share algorithm implementation:

Phase 1: SVD-based subspace extraction from 51 LoRA adapters (60% variance threshold)
Phase 2: Coefficient-only training with frozen basis (83K params vs 1.6M full LoRA)
Phase 3: Basis merging and updates
Routing: Error pattern classification for coefficient selection

Bug Fixes That Unlocked Progress

Two critical bugs blocked proper Phase 2 training:

Bug 1: Zero-Gradient Saddle Point

Both coefficient matrices initialized to zero:

eps_beta = 0, eps_alpha = 0
→ delta_W = 0 @ 0 = 0
→ zero gradients, no learning

Fix: Dual small-random initialization.

Bug 2: Half-Parameter Training

LoRA-style initialization only trained one coefficient set:

Before: 112/224 parameters getting gradients
After:  224/224 parameters getting gradients

Fix: Both coefficient matrices need random initialization.

Experiment 1b: Routing Works

With gradient-trained v4 coefficients and proper routing:

Strategy	Pass Rate	BC	RH	TB	Regressions
Baseline (no LoRA)	46.7%	70%	40%	30%	–
Averaged	50.0%	70%	40%	40%	1
Routed	50.0%	70%	50%	30%	0

Result handling improved 40% → 50%. Zero regressions. This is the first positive transfer from Share coefficients.

The Forgetting Heatmap

We applied each coefficient individually to all 30 koans:

Koan       BL  mut_bc dbl_mt ret_lr mis_cl mis_hs mis_or opt_ok res_me ROUTED
bc_001-009 P   P      P      P      P      P      P      P      P      P
bc_003,5,10.   .      .      .      .      .      .      .      .      .
rh_002     .   .     +GAIN   .      .     +GAIN  +GAIN  +GAIN  +GAIN  +GAIN
rh_008     P  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST   P
tb_005     P   P      P      P      P     -LOST   P      P      P      P

Key finding: rh_008 regresses under every coefficient applied globally. But routing saves it by falling back to the base model when no pattern matches.

This is exactly what the Share paper predicts: task-specific coefficients improve targeted patterns without interfering with unrelated ones.

What the Papers Claim vs What We’ve Verified

Verified

Shared basis via SVD — We extract principal components from 51 adapters. Works.
76x parameter reduction — 83K coefficient parameters vs 1.6M full LoRA. Verified.
Routing prevents forgetting — Zero regressions with routed inference. The fragile rh_008 koan survives because it falls back to base model.
Positive transfer possible — Result handling improved 40% → 50% with routed coefficients.

Not Yet Verified

Sequential learning — The core continual learning claim. Train task 1 → eval → train task 2 → eval (verify task 1 still passes). This is next.
UWSH subspace stability — Do different adapter subsets converge to similar subspaces? Grassmann distance measurement needed.

Next Experiments

Priority	Experiment	Target
High	Sequential learning curve	No degradation on prior tasks
High	Fix k_alpha=32 (paper recommends)	Match paper exactly
Medium	UWSH verification	>70% subspace overlap
Medium	Add rank update vectors	Full algorithm

The Architecture

Day:   Agent attempts to fix Rust errors
       ↓
       Successes and failures logged
       ↓
Night: Train coefficients (frozen basis)
       ↓
       83K params per task
       ↓
Eval:  Route to appropriate coefficients
       ↓
       Pattern-matched inference
       ↓
(repeat)

The key insight: train small, route smart. The shared basis captures common structure. Per-task coefficients specialize without interference.

Resources

60% of the way to verifying the papers. Sequential learning is next.

February 18, 2026 • Software Wrighter

470 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Perplexity (how surprised by data), Catastrophic Forgetting (new learning erases old), Weight Initialization (starting values matter), Curse of Dimensionality (high-D makes data sparse), Monitoring (track performance and drift).

Five ML Concepts - #15

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #15

References

Concept	Reference
Perplexity	A Neural Probabilistic Language Model (Bengio et al. 2003)
Catastrophic Forgetting	Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017)
Weight Initialization	Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010)
Curse of Dimensionality	The Elements of Statistical Learning (Hastie et al. 2009), Chapter 2
Monitoring & Drift	Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift (Rabanser et al. 2019)

Today’s Five

1. Perplexity

A metric for language models that reflects how well the model predicts the next token. Lower perplexity means better predictive performance.

Perplexity is the exponentiated average negative log-likelihood per token.

Like a test where lower scores mean you found the answers easier to guess.

2. Catastrophic Forgetting

When training on new tasks causes a model to lose performance on previously learned tasks. This is a key challenge in continual learning.

Techniques like elastic weight consolidation help preserve important weights.

Like learning a new phone number and forgetting the old one.

3. Weight Initialization

The starting values of model weights influence how well training progresses. Poor initialization can cause vanishing or exploding gradients.

Xavier and He initialization are common strategies for setting initial weights appropriately.

Like starting a race from a good position instead of stuck in a ditch.

4. Curse of Dimensionality

In high-dimensional spaces, data becomes sparse and distances behave differently, making learning harder. Points that seem close in low dimensions can be far apart in high dimensions.

Feature selection and dimensionality reduction help mitigate this effect.

Like searching for a friend in a city versus across the entire universe.

5. Monitoring & Drift Detection

Continuously tracking model performance and detecting shifts in input data distributions. Production models can degrade silently without proper monitoring.

Automated alerts help catch problems before they affect users.

Like a weather station alerting you when conditions change.

Quick Reference

Concept	One-liner
Perplexity	How surprised the model is by the data
Catastrophic Forgetting	New learning erases old knowledge
Weight Initialization	Starting values affect training stability
Curse of Dimensionality	High dimensions make data sparse
Monitoring & Drift	Track performance and data changes

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 17, 2026 • Software Wrighter

448 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: ROC/AUC (performance across thresholds), Spurious Correlations (coincidental patterns), Gradient Clipping (limit gradients for stability), Loss Landscapes (error surface over parameters), Cold Start (no history for new users).

Five ML Concepts - #14

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #14

References

Concept	Reference
ROC/AUC	An Introduction to ROC Analysis (Fawcett 2006)
Spurious Correlations	Unbiased Look at Dataset Bias (Torralba & Efros 2011)
Gradient Clipping	On the Difficulty of Training Recurrent Neural Networks (Pascanu et al. 2013)
Loss Landscapes	Visualizing the Loss Landscape of Neural Nets (Li et al. 2018)
Cold Start	Addressing Cold Start in Recommender Systems (Schein et al. 2002)

Today’s Five

1. ROC / AUC

ROC curves plot true positive rate against false positive rate across all classification thresholds. AUC (Area Under the Curve) summarizes overall ranking performance in a single number.

AUC of 0.5 means random guessing; 1.0 means perfect ranking.

Like judging a student by considering every possible passing grade cutoff.

2. Spurious Correlations

Coincidental patterns in training data that don’t reflect true relationships. Models that rely on them can fail when the coincidence disappears.

Dataset curation and diverse evaluation help detect spurious features.

Like assuming umbrellas cause rain because you always see them together.

3. Gradient Clipping

Limiting the size of gradients during backpropagation. This helps prevent exploding gradients and unstable training, especially in recurrent networks.

Clipping can be by value or by global norm.

Like putting a speed limit on a car so it doesn’t lose control.

4. Loss Landscapes

How model error changes across different parameter settings. Training is like navigating this surface toward regions of lower loss.

Flat minima may generalize better than sharp ones.

Like hiking through mountains searching for the lowest valley, feeling the slope beneath your feet.

5. Cold Start Problems

Difficulty predicting for new users or items with no history. Without prior data, personalization becomes difficult.

Solutions include content-based features, popularity fallbacks, or asking initial questions.

Like a librarian trying to recommend books to someone who just walked in.

Quick Reference

Concept	One-liner
ROC / AUC	Classifier performance across thresholds
Spurious Correlations	Coincidental patterns that don’t generalize
Gradient Clipping	Limit gradient size for stability
Loss Landscapes	Error surface over parameter space
Cold Start	No history for new users/items

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 16, 2026 • Software Wrighter

448 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Calibration (predicted probabilities match outcomes), Shortcut Learning (exploiting spurious patterns), Early Stopping (halt when validation plateaus), Universal Approximation (NNs can fit any function), Checkpointing (save model state).

Five ML Concepts - #13

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #13

References

Concept	Reference
Calibration	On Calibration of Modern Neural Networks (Guo et al. 2017)
Shortcut Learning	Shortcut Learning in Deep Neural Networks (Geirhos et al. 2020)
Early Stopping	Early Stopping - But When? (Prechelt 1998)
Universal Approximation	Approximation by Superpositions of a Sigmoidal Function (Cybenko 1989)
Checkpointing	Training Deep Nets with Sublinear Memory Cost (Chen et al. 2016)

Today’s Five

1. Calibration

How well a model’s predicted probabilities match real-world outcomes. If a model predicts 70% confidence many times, it should be correct about 70% of those cases.

Well-calibrated models enable better decision-making under uncertainty.

Like a weather forecast that predicts rain 30% of the time and is right about 30% of those forecasts.

2. Shortcut Learning

When models rely on superficial patterns instead of meaningful features. For example, identifying cows by detecting grass and failing when cows appear indoors.

Shortcuts can inflate benchmark scores while masking poor real-world performance.

Like passing a test by memorizing answer positions instead of learning the material.

3. Early Stopping

Training is stopped when validation performance stops improving. This helps prevent overfitting by halting before the model memorizes training data.

Patience hyperparameters control how long to wait before stopping.

Like knowing when to stop practicing before you start reinforcing mistakes.

4. Universal Approximation

The theorem stating that neural networks can approximate any continuous function, given enough capacity. In practice, finding the right weights through optimization is the challenge.

The theorem guarantees existence, not learnability.

Like having enough Lego blocks to build almost any shape—assembly is still hard.

5. Checkpointing

Saving the model’s state during training. This allows recovery from interruptions and comparison across training stages.

Checkpoints also enable selecting the best model rather than just the final one.

Like saving your game progress so you can reload if something goes wrong.

Quick Reference

Concept	One-liner
Calibration	Predicted probabilities match outcomes
Shortcut Learning	Exploiting spurious patterns
Early Stopping	Stop when validation plateaus
Universal Approximation	NNs can approximate any function
Checkpointing	Save model state during training

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 15, 2026 • Software Wrighter

488 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Precision vs Recall (correct positives vs finding all), OOD Inputs (data unlike training), Batch Size (examples per update), Inductive Bias (built-in assumptions), Latency vs Throughput (speed vs capacity).

Five ML Concepts - #12

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #12

References

Concept	Reference
Precision/Recall	The Truth of the F-Measure (Sasaki 2007)
OOD Detection	A Baseline for Detecting Misclassified and Out-of-Distribution Examples (Hendrycks & Gimpel 2017)
Batch Size	On Large-Batch Training for Deep Learning (Goyal et al. 2017)
Inductive Bias	Relational Inductive Biases, Deep Learning, and Graph Networks (Battaglia et al. 2018)
Latency/Throughput	Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al. 2021)

Today’s Five

1. Precision vs Recall

Precision measures how often positive predictions are correct. Recall measures how many actual positives are successfully found. Improving one often reduces the other.

The tradeoff depends on your application: spam filters favor precision, medical screening favors recall.

Like a search party: you can find everyone but raise false alarms, or be very certain and miss some people.

2. OOD Inputs (Out-of-Distribution)

Data that differs significantly from what the model saw during training. Models may fail silently or produce confident but wrong answers.

Detecting OOD inputs is an active research area for safer AI deployment.

Like asking a chef trained only in Italian food to make sushi.

3. Batch Size

The number of training examples processed before updating model weights. Larger batches can be more efficient computationally, but may generalize worse.

Finding the right batch size involves balancing speed, memory, and model quality.

Like grading tests one at a time or waiting to grade a full stack.

4. Inductive Bias

The assumptions built into a model that guide how it learns from data. Without inductive bias, models cannot generalize beyond training examples.

CNNs assume spatial locality; transformers assume tokens can attend to any position.

Like expecting nearby houses to have similar prices before looking at the data.

5. Latency vs Throughput

Latency is how long a single request takes. Throughput is how many requests can be handled per second. Optimizing one often comes at the expense of the other.

Batching improves throughput but increases latency for individual requests.

Like a restaurant serving one table quickly or many tables at once.

Quick Reference

Concept	One-liner
Precision vs Recall	Correct positives vs finding all positives
OOD Inputs	Data unlike training distribution
Batch Size	Examples per weight update
Inductive Bias	Built-in learning assumptions
Latency vs Throughput	Speed per request vs total capacity

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 15, 2026 • Software Wrighter

1048 words • 6 min read • Abstract

Personal Software for education: a neural network platform where every step is visible---no framework magic. CLI with progress bars, web UI with real-time loss charts, WASM for browser execution. Built via Vibe Coding to watch XOR training reveal why hidden layers matter.

Neural-Net-RS: An Educational Neural Network Platform

I wanted a neural network implementation where every step is visible—no framework magic hiding the math. Something I could use to teach the fundamentals, with a CLI for quick experiments and a web UI for visual demonstrations. Claude Code built it.

This is Personal Software for education: a complete neural network training platform with multiple interfaces, all from a single Rust codebase.

Resource	Link
Repo	neural-net-rs
Video	Neural-Net-RS Explainer

Why Build Your Own Neural Network?

Frameworks like PyTorch and TensorFlow are production-ready, but they hide the fundamentals. When teaching or learning, you want to see:

How weights and biases actually change during training
Why XOR needs a hidden layer when AND doesn’t
What backpropagation really computes

Neural-Net-RS exposes all of this. No autograd magic—every gradient is computed explicitly. No tensor abstractions—just matrices with clear row-major storage.

What Got Built

A modular Rust workspace with multiple interfaces to the same core:

neural-net-rs/
├── matrix/              # Linear algebra foundation
├── neural-network/      # Core ML implementation
├── neural-net-cli/      # Command-line interface
├── neural-net-server/   # REST API with SSE streaming
└── neural-net-wasm/     # WebAssembly for browser

One codebase, three ways to interact:

CLI: Train from terminal with progress bars
Web UI: Visual training with real-time loss charts
WASM: Run entirely in browser, no server needed

The Classic Problems

The platform includes 8 built-in examples that teach ML concepts progressively:

Problem	Architecture	Key Concept
AND, OR	2→2→1	Linear separability
XOR	2→3→1	Why hidden layers matter
Parity3	3→6→1	Scaling non-linearity
Quadrant	2→8→4	Multi-class classification
Adder2	4→8→3	Learning arithmetic
Iris	4→8→3	Real-world dataset
Pattern3x3	9→6→4	Visual pattern recognition

The XOR Problem

XOR is the canonical neural network problem. AND and OR are linearly separable—a single line can divide the outputs. XOR isn’t. You need a hidden layer.

AND: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1  ← One line separates
XOR: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0  ← No line works

Watch XOR training and you see why neural networks are powerful: they learn to create intermediate representations that make non-linear problems separable.

Implementation Details

Feed-Forward with Backpropagation

pub struct Network {
    pub layers: Vec<usize>,      // [input, hidden..., output]
    pub weights: Vec<Matrix>,    // Learned connections
    pub biases: Vec<Matrix>,     // Per-neuron offsets
    pub learning_rate: f64,      // Training step size
}

Forward pass: Each layer computes activation(weights × input + bias)

Backward pass: Gradients flow backward using the chain rule, updating weights to reduce error.

The sigmoid activation function maps any input to (0, 1):

σ(x) = 1 / (1 + e^(-x))

Custom Matrix Library

Educational clarity over maximum performance:

pub struct Matrix {
    rows: usize,
    cols: usize,
    data: Vec<f64>,  // Row-major storage
}

Operations: dot product, transpose, element-wise multiply, map. Everything visible, nothing hidden.

Checkpoint System

Training can be interrupted and resumed:

# Train for 5000 epochs, save checkpoint
neural-net-cli train xor --epochs 5000 --checkpoint model.json

# Resume from checkpoint
neural-net-cli train xor --epochs 10000 --resume model.json

Checkpoints include version metadata to prevent loading incompatible models.

CLI Usage

# List available examples
neural-net-cli examples

# Train XOR with progress bar
neural-net-cli train xor --epochs 10000 --learning-rate 0.5

# Predict with trained model
neural-net-cli predict model.json --input "0,1"

# Run web UI
neural-net-cli serve --port 8080

The CLI uses indicatif for real-time progress bars:

Training XOR [=========>   ] 7500/10000 (75%) Loss: 0.0023

Web Interface

The server embeds all assets at compile time—one binary serves everything:

Training panel: Select problem, set hyperparameters, watch loss decrease
Network visualization: See layer structure and connection strengths
Prediction panel: Test the trained model interactively
Loss chart: Real-time plotting via Server-Sent Events

Two training modes:

Local (WASM): Runs entirely in browser
Remote (API): Server-side with streaming progress

Technology Choices

Component	Purpose
Rust	Performance, safety, single-binary distribution
Axum	Lightweight async web framework
wasm-bindgen	Rust → WebAssembly compilation
Indicatif	Terminal progress bars
Serde	JSON serialization for checkpoints

The WASM module is ~248KB after optimization.

Test Coverage

136+ tests across the workspace:

Matrix operations (unit tests)
Network training (integration tests)
CLI commands (integration tests)
Server endpoints (integration tests)
WASM bindings (unit tests)

Zero clippy warnings. Reproducible results via seeded RNG.

References

Resource	Link
Backpropagation	Learning representations by back-propagating errors (Rumelhart et al. 1986)
Multi-Layer Perceptron	Multilayer perceptron (Wikipedia)
XOR Problem	Perceptrons (Minsky & Papert 1969)
Weight Initialization	Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010)
Inspired by	codemoonsxyz/neural-net-rs

The Vibe Coding Process

This project grew through iterative conversation with Claude Code:

“Build a basic neural network in Rust with backpropagation”
“Add a CLI with progress bars”
“Add a web UI with real-time training visualization”
“Compile to WASM so it runs in the browser”
“Add checkpoint save/resume”
“Include classic ML examples with educational documentation”

Each request built on the previous. The AI handled architecture decisions, chose appropriate crates, and maintained test coverage throughout.

When you want to understand how neural networks actually work, sometimes you need to see every weight update. That’s what this platform provides—education through transparency.

Watch the Video

Unmute to hear narration.

February 14, 2026 • Software Wrighter

914 words • 5 min read • Abstract

Personal Software via Vibe Coding: I needed to find cat photos scattered across my system. Instead of cloud services or app stores, I described what I wanted to Claude Code and got a working Rust CLI tool using YOLOv8 and ONNX Runtime. Privacy-first, locally-run, and mine to modify.

Cat Finder: Personal Software via Vibe Coding

I needed to find cat photos scattered across my system. Instead of searching the app store, signing up for a cloud service, or uploading my personal photos to someone else’s servers, I asked Claude Code to build me the tool I needed. An hour later, I had it.

This is Personal Software—software that exists because you needed it, built the way you want it, running entirely under your control.

Resource	Link
Repo	cat-finder
Video	Cat Finder Explainer

The Vibe Coding Approach

Vibe Coding is about describing what you want and letting AI handle the implementation details. No boilerplate, no Stack Overflow rabbit holes, no fighting with build systems. You focus on the what, the AI handles the how.

For Cat Finder, the conversation went something like:

“I want a CLI tool that scans directories for images containing cats. Run locally, no cloud. Use YOLO for detection. Output just the file paths so I can pipe them to other commands.”

Claude Code chose the tech stack (Rust, YOLOv8n, ONNX Runtime), handled the tensor math, figured out the COCO class IDs, and produced a working tool. I guided the direction; the AI wrote the code.

Why Personal Software?

The traditional options for “find cat photos” would be:

Cloud service: Upload photos to Google/Apple/Amazon, let them scan everything, hope they respect your privacy
Desktop app: Find something in an app store, hope it does what you want, deal with subscription fees or ads
Write it yourself: Spend days learning YOLO integration, tensor formats, image preprocessing

Personal Software offers a fourth path: describe what you need, get exactly that, own the result completely.

Cat Finder runs entirely on my machine. No accounts, no uploads, no subscriptions, no ads. The code is mine to modify, extend, or share.

What Got Built

A Rust CLI tool using YOLOv8n (the nano variant) through ONNX Runtime:

Directory Traversal → Image Preprocessing → YOLO Inference → Cat Detection → Output

The Detection Pipeline

Walk directories recursively, finding image files (jpg, png, gif, webp, etc.)
Preprocess each image: resize to 640×640, normalize to 0.0-1.0, convert to NCHW tensor format
Run inference through the YOLOv8n ONNX model
Parse output for class ID 15 (cat in COCO ordering) above confidence threshold
Print matching paths to stdout for easy piping to other tools

Unix Philosophy

# stdout: just paths (machine-parseable)
# stderr: logging and progress

cat-finder ~/Photos | xargs -I {} cp {} ~/CatPhotos/

This separation enables composable Unix pipelines. The tool does one thing well and plays nicely with others.

Technology Stack

Component	Purpose
Rust	Memory-safe, high-performance core
YOLOv8n	Lightweight object detection (12MB model)
ONNX Runtime	Cross-platform inference engine
clap	CLI argument parsing
ndarray	Tensor operations
walkdir	Recursive directory traversal

Total footprint: ~80MB (runtime + model + binary)

I didn’t choose this stack—Claude Code did, based on the requirements. It made good choices.

Usage

# Basic usage
cat-finder ~/Photos

# Adjust confidence threshold
cat-finder --confidence 0.5 ~/Photos

# Verbose output with timestamps
cat-finder -v -t ~/Photos

# Copy all cat photos to a new folder
cat-finder ~/Photos | xargs -I {} cp {} ~/CatAlbum/

Honest About Limitations

The README documents failure cases transparently:

Image Type	Detection Success
Clear photographs	High
Artistic/stylized images	Low
Cats in clothing	Low
Small/partial cats	Variable
Low quality/blurry	Variable

Test results: 7 of 9 cat images detected (77.8% recall). Oil paintings and anthropomorphized cats confuse models trained on photographs. This is documented, not hidden.

Bonus Features

The project grew organically based on related needs:

Duplicate Finder: A second binary for finding duplicate images using size-based filtering followed by SHA-256 checksums.

find-duplicates ~/Photos

Web Demo: A Flask-based interface for visual feedback with real-time progress via Server-Sent Events.

These emerged from “while you’re at it…” requests during development. Vibe coding makes feature additions nearly frictionless.

Setup

git clone https://github.com/sw-ml-study/cat-finder
cd cat-finder
./scripts/setup.sh  # Downloads model, builds project
./cat-finder ~/Photos

The Personal Software Philosophy

Privacy-first: All processing happens locally. No cloud APIs, no external services, no data leaving your machine.

Ownership: The code is yours. Modify it, extend it, share it, delete it.

Fit-for-purpose: Built for exactly what you need, nothing more, nothing less.

Transparency: Known limitations documented. No marketing spin.

References

Resource	Link
YOLOv8	Ultralytics YOLOv8 - State-of-the-art object detection
ONNX Runtime	ONNX Runtime - Cross-platform inference engine
ort crate	ort - Rust bindings for ONNX Runtime
COCO Dataset	COCO Classes - Class ID 15 = cat

You don’t always need an app store or a cloud service. Sometimes you just need to describe what you want and let an AI build it for you. That’s vibe coding. That’s personal software.

Watch the Video

Unmute to hear narration.

February 14, 2026 • Software Wrighter

503 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: RNN (sequential processing with memory), Chain of Thought (step-by-step reasoning), Softmax (scores to probabilities), MoE (route inputs to specialists), Distribution Shift (training vs deployment mismatch).

Five ML Concepts - #11

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #11

References

Concept	Reference
RNN	Learning representations by back-propagating errors (Rumelhart et al. 1986)
Chain of Thought	Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)
Softmax	Deep Learning (Goodfellow et al. 2016), Chapter 6
MoE	Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al. 2017)
Distribution Shift	Dataset Shift in Machine Learning (Quiñonero-Candela et al. 2009)

Today’s Five

1. RNN (Recurrent Neural Network)

Networks designed for sequential data that maintain a hidden state carrying information across time steps. This makes them useful for language, time series, and audio.

LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are improved variants that better handle long-range dependencies.

Like reading a story while keeping mental notes about characters and plot as you go.

2. Chain of Thought

A prompting technique that encourages step-by-step reasoning in language models. Instead of producing an answer immediately, the model generates intermediate steps.

This can improve performance on math, logic, and multi-step problems.

Like showing your work on a math test instead of just writing the final answer.

3. Softmax

Converts a vector of scores into a probability distribution where each output falls between zero and one, and all outputs sum to one. It is commonly used in classification models.

Softmax makes raw scores easier to interpret as probabilities.

Like turning test scores into percentages that add up to 100%.

4. MoE (Mixture of Experts)

Instead of one large network, the model contains many smaller expert networks with a routing mechanism that selects which experts process each input. This allows models to scale capacity while keeping computation efficient.

Only a subset of experts activates for any given input.

Like a hospital with specialists where a receptionist directs you to the right doctor.

5. Distribution Shift

Occurs when deployment data differs from training data, causing a model trained on one environment to perform poorly in another. Common causes include seasonal changes, user behavior shifts, or new populations.

Monitoring for drift and retraining helps maintain performance.

Like a weather model trained on summer data struggling to predict winter storms.

Quick Reference

Concept	One-liner
RNN	Sequential processing with memory across time
Chain of Thought	Step-by-step reasoning in prompts
Softmax	Scores to normalized probabilities
MoE	Route inputs to specialized experts
Distribution Shift	Training vs deployment data mismatch

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 13, 2026 • Software Wrighter

995 words • 5 min read • Abstract

When data won't fit in a context window, RLM expands the workspace instead. The MIT paper achieves 87-91% accuracy where standard prompting scores 0%. My Rust implementation provides four capability levels from DSL commands to WASM sandboxing to LLM delegation.

RLM: Recursive Language Models for Massive Context

What happens when your data won’t fit in a context window? RLM expands the workspace instead of cramming everything into limited memory. This post covers the MIT paper, my Rust implementation, and six video demonstrations.

Resource	Link
Paper	arXiv:2512.24601
Code	rlm-project
Playlist	RLM Implementations

The Problem: Context Limits

Large language models have a hard limit. They can only process so much text at once.

Imagine a cookie jar that holds 100 cookies. What if you need to search through ten thousand? When you force too much in, the model forgets things—this is called context rot.

Bigger models help, but the limit always exists. We need a different approach.

The RLM Solution

Recursive Language Models flip the problem. Instead of bigger jars, use better tools.

The data stays in a context box. The model gets tools to peek inside:

Tool	Purpose
`slice`	Get a character range
`find`	Search for text
`regex`	Pattern matching
`count`	Count occurrences
`llm_query`	Ask a sub-LLM to analyze a chunk

Small, focused, deliberate. The model thinks about what it needs, then asks for just that.

The Results

From the MIT paper—on tasks that don’t fit in context:

Approach	Accuracy
Standard prompting	0%
RLM	87-91%

Results hold across GPT-4, Claude, Llama, Mistral, and Gemini.

My Implementation: Four Capability Levels

I built a Rust implementation with four capability levels:

Level	Name	Description
L1	DSL	Built-in commands (find, regex, count)
L2	WASM	LLM generates Rust → compiles to WebAssembly sandbox
L3	CLI	LLM generates Rust → compiles to native binary
L4	LLM	Recursive delegation to sub-LLMs

Each level trades off safety for capability:

L1 is instant but limited to predefined operations
L2 runs custom code but in a sandboxed environment
L3 breaks free for large datasets that would timeout in WASM
L4 uses LLM reasoning for semantic analysis

The Video Series

Six videos demonstrate RLM in action:

1. RLM Explained

The foundational video. Covers the MIT paper, the cookie jar analogy, and benchmark results showing 0% → 91% accuracy improvement.

Key insight: Expand the workspace, not the context.

2. War and Peace Demo

Can AI read all of War and Peace to find a hidden secret? The full text is 3.2 MB with 65,666 lines—way too big for any context window.

RLM finds “the password to Prince Andrei’s secret vault” in just 2 iterations using only 3,000 tokens. That’s 100% savings compared to sending the full document.

3. WASM Sandboxing

What if your LLM could write custom analysis code on the fly? Level 2 demonstrates WebAssembly sandboxing.

The LLM writes Rust code that compiles to WASM and runs in a secure sandbox. Demos include:

Error ranking in logs
Response time percentiles
Unique IP counting

Trade-offs: ASCII only, 64MB memory limit, subset of Rust.

4. Native CLI Binaries

When 5,000 lines would timeout in WASM, Level 3 breaks free. Native Rust binaries process massive datasets with no limits.

Four CLI demos:

Error ranking: Hash map counts error types
Unique IPs: Hash set finds distinct addresses
Percentiles: Sort and index for p50/p95/p99
Word frequency: Tokenize, filter stop words, count

5. Detective Mystery Demo

A murder at the manor. Seven suspects. Dozens of clues. Can an LLM solve it?

Level 4 delegates reasoning to sub-LLMs. Instead of code execution, the model calls other models to:

Analyze witness statements
Compare alibis
Draw conclusions

Watch as L4 examines each suspect and identifies the killer.

6. Large Context Processing

War and Peace is 3MB—far too large for any context window. This video shows Level 4 extracting noble family relationships from the entire novel.

The process:

L3 extracts relationship sentences (father, mother, son, daughter…)
L4 analyzes filtered data with sub-LLMs
Final output: structured family trees

Three million characters → structured family trees in ~90 seconds.

Architecture

┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   Client    │────▶│  RLM Server     │────▶│  Root LLM   │
│  /visualize │     │  (Rust/Axum)    │     │  (DeepSeek) │
└─────────────┘     └────────┬────────┘     └─────────────┘
                             │
                    ┌────────▼────────┐
                    │ Command Executor │
                    │  slice, find,   │
                    │  regex, count,  │
                    │  llm_query...   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │  Ollama  │  │  Ollama  │  │  Ollama  │
        │ (local)  │  │ (remote) │  │ (other)  │
        └──────────┘  └──────────┘  └──────────┘
              Sub-LM Pool (for llm_query)

Quick Start

cd rlm-orchestrator

# Configure providers in config.toml
export DEEPSEEK_API_KEY="your-key"

# Run the server
cargo run --bin rlm-server

# Open visualizer
open http://localhost:8080/visualize

Think of it like this:

Old way: Dump everything on the table, then dig through the mess
RLM way: Use a scoop—grab just the cookies you need

The key insight is simple: expand the workspace, not the context.

Resources

RLM Paper (arXiv:2512.24601) - Zhang, Kraska, Khattab (MIT CSAIL)
rlm-project Repository
rlm-project Wiki
RLM Implementations Playlist
ELI5: What is RLM?

When context windows aren’t enough, RLM gives your LLM tools to explore. Six videos, four capability levels, one insight: expand the workspace, not the context.

Watch the Video

Unmute to hear narration.

February 13, 2026 • Software Wrighter

499 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: CNN (sliding filters for image features), Encoder-Decoder (compress then generate), RAG (retrieve context before generating), Few-shot Learning (learn from prompt examples), Distillation (small student mimics large teacher).

Five ML Concepts - #10

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #10

References

Concept	Reference
CNN	ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al. 2012)
Encoder-Decoder	Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
RAG	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
Few-shot Learning	Language Models are Few-Shot Learners (Brown et al. 2020)
Distillation	Distilling the Knowledge in a Neural Network (Hinton et al. 2015)

Today’s Five

1. CNN (Convolutional Neural Network)

Networks designed for image data that use small filters sliding across an image to detect edges, textures, and shapes. Early layers find simple patterns, while deeper layers recognize complex objects.

CNNs are a foundation of modern computer vision.

Like scanning a photo with a magnifying glass that learns to recognize patterns at different scales.

2. Encoder-Decoder

A model architecture with two parts: the encoder compresses input into a representation, and the decoder generates an output from that representation. This pattern is common in translation, summarization, and speech systems.

The representation acts as a bottleneck that captures essential information.

Like summarizing a book into notes, then writing a new version from those notes.

3. RAG (Retrieval-Augmented Generation)

Instead of relying only on learned parameters, the model retrieves relevant documents and uses them during generation. This helps ground responses in external information and can reduce hallucinations.

RAG combines the strengths of retrieval systems and generative models.

Like an open-book exam where you can look up facts instead of relying purely on memory.

4. Few-shot Learning

Adapting behavior from just a few examples provided directly in the prompt. Instead of retraining, the model infers the pattern and applies it to new inputs.

Zero-shot learning relies only on instructions, without examples.

Like learning a card game by watching a few hands before playing.

5. Distillation

Transferring knowledge from a large teacher model to a smaller student. The student learns to match the teacher’s outputs, not its internal weights.

This produces models that are smaller and cheaper while retaining much of the original capability.

Like an apprentice learning by imitating a master’s finished work, not by copying their brain.

Quick Reference

Concept	One-liner
CNN	Sliding filters for hierarchical image features
Encoder-Decoder	Compress input, then generate output
RAG	Retrieve context before generating
Few-shot Learning	Learn from examples in the prompt
Distillation	Small student mimics large teacher

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 13, 2026 • Software Wrighter

1633 words • 9 min read • Abstract

Before pixels, there were vectors. Vibe Coding classic arcade games (Asteroids, BattleZone, Tempest) in Rust/WebAssembly with wgpu rendering---from my first encounter with an IBM 2250 to playable browser demos, all built in one day with Claude Code.

TBT (3/?): Vector Graphics Games

Before pixels, there were vectors. This Throwback Thursday explores the evolution of vector graphics gaming—from military radar displays to arcade classics—and my attempt to recreate them in Rust and WebAssembly.

Resource	Link
Live Demo	Play in Browser
Video	TBT Vector Graphics Games
Games	vectorcade-games
Shared	vectorcade-shared
Fonts	vectorcade-fonts
Renderer	vectorcade-render-wgpu
Web	vectorcade-web-yew

My First Vector Display: The IBM 2250

IBM 2250 Graphics Display Unit with light pen, October 1969 — IBM 2250 at Brown University, 1969. Photo credit

My first encounter with vector graphics was an IBM 2250 Graphics Display Unit—introduced in 1964, costing around $280,000 in period dollars. It connected to an IBM 1130 that acted as a graphics controller for an IBM S/370 mainframe where the graphical applications ran. At that price, nobody was playing games on it—Computer Aided Design was the killer app.

The 2250’s specifications were impressive for its era:

Specification	Value
Display	21-inch P39 phosphor CRT
Resolution	1024 × 1024 addressable points
Usable area	12” × 12” (square aspect)
Refresh rate	~40 frames/second
Input	Light pen for direct interaction
Vector drawing	Hardware character generator optional

The CRT drew lines by steering an electron beam directly—no pixel grid, no rasterization. Just pure geometry traced in phosphor glow. The green P39 phosphor had long persistence, reducing flicker but creating ghostly trails on moving objects.

The light pen was revolutionary: you could point directly at displayed geometry and the system knew which vector you were touching. Interactive graphics in 1964.

The Arcade Era

Vector displays found their way into arcades, where they defined a visual style that’s still recognizable today:

Game	Year	Innovation
Lunar Lander	1979	Physics simulation, thrust/gravity
Asteroids	1979	Wrap-around space, particle effects
BattleZone	1980	Green wireframe 3D, first-person tanks
Tempest	1981	Multi-colored vectors, pseudo-3D depth

(Note: Pong (1972) was actually a raster game using discrete logic, but its simple geometry makes it a natural fit for vector recreation.)

Each generation built on the last. White vectors on black screens gave way to green wireframes, then full color. The hardware pushed boundaries that feel primitive now but were revolutionary then.

The Vectorcade Project

Vectorcade recreates these mechanics using modern tools:

Rust for game logic and rendering
WebAssembly for browser deployment
wgpu for GPU-accelerated vector rendering
Yew for the web frontend

Multi-Repo Architecture

The project architecture emerged from a design session with ChatGPT, exploring how to structure a multi-agent development workflow. The result: a DAG of repositories, each with clear ownership boundaries:

vectorcade-shared/      (Pure Rust API contracts)
    ↓
vectorcade-fonts/       (Vector font styles)
    ↓
vectorcade-games/       (Game logic: Pong, Asteroids, etc.)
    ↓
vectorcade-render-wgpu/ (wgpu + lyon tessellation)
    ↓
vectorcade-web-yew/     (Yew web shell)

This DAG structure allows parallel development with assigned agent roles:

Agent	Repo	Focus
A	vectorcade-shared	Core API steward: minimal, stable, pure
B	vectorcade-fonts	Font stylist: 3-5 distinct vector styles
C	vectorcade-games	Game logic: Pong → Asteroids → Lunar Lander
D	vectorcade-render-wgpu	Renderer: lyon tessellation → wgpu triangles
E	vectorcade-web-yew	Integrator: UI, mobile controls, PWA

Each agent works against stable interfaces—the DrawCmd display list and Game trait—so they don’t step on each other.

The Display List Model

Games don’t render directly. They emit draw commands that the renderer interprets:

pub enum DrawCmd {
    Clear { color: Rgba },
    Line(Line2),
    Polyline { pts: Vec<[f32;2]>, closed: bool, stroke: Stroke },
    Text { pos: [f32;2], s: String, size_px: f32, color: Rgba },
    PushTransform(Transform2),
    PopTransform,
}

This keeps game logic portable. The same Asteroids code can render through wgpu on desktop, WebGPU in browsers, or even a software rasterizer.

Vector Fonts

Classic arcade games had distinctive lettering. Vectorcade includes multiple font styles to match:

Style	Look	Games
`ATARI`	Boxy, utilitarian	Asteroids, Lunar Lander
`CINEMATRONICS`	Thin, angular	Star Castle
`MIDWAY`	Slightly rounded	Defender
`VECTOR_SCANLINE`	Broken segments	“Beam jitter” effect

Each font is pure vector geometry—no bitmaps, no texture atlases.

3D Projection

BattleZone and Tempest need 3D-to-2D projection. Instead of a full 3D renderer, Vectorcade uses a “2.5D pipeline”:

pub struct Camera3 {
    pub pos: [f32;3],
    pub yaw: f32,
    pub pitch: f32,
    pub fov_y_rad: f32,
}

pub fn project_polyline(cam: &Camera3, pts3: &[[f32;3]]) -> Vec<[f32;2]>;

Games maintain 3D geometry; the core projects it to 2D lines. Depth-based brightness gives the classic “farther = dimmer” effect.

Why Rust + WASM?

The combination solves several problems:

Performance: Games need consistent frame rates; Rust delivers
Portability: Same code runs native and in browsers
Safety: No dangling pointers in the game loop
Modern tooling: Cargo, wasm-pack, Trunk make deployment straightforward

The wgpu + lyon stack provides cross-platform GPU rendering with proper thick-line support (WebGL’s lineWidth is notoriously inconsistent).

Current Status

Component	Status
vectorcade-shared	Functional
vectorcade-fonts	Functional
vectorcade-games	Playable (5 demos)
vectorcade-render-wgpu	Functional
vectorcade-web-yew	Functional

The core architecture works. All five demos are playable in the browser. Polish and audio remain.

The Demos

The video showcases five demonstrations, progressing from static display to full gameplay:

1. IBM 2250 Chessboard

A static image rendered in the style of the original IBM 2250. The 2250 was mainly used for Computer Aided Design, but programmers did create games on it—this chessboard pays tribute to that era.

2. Pong (Playable)

A vector implementation of the classic. The original Pong (1972) wasn’t actually a vector game—it used discrete logic and a raster display—but some clones used vector hardware. This recreation captures the pure-geometry aesthetic.

3. Asteroids (Playable)

One of the most popular vector arcade games. Rotate, thrust, and shoot to survive. The ship and asteroids wrap around screen edges, creating the classic “infinite space” feel.

4. BattleZone (Playable)

Green wireframe 3D tanks. Drive through a battlefield, shooting enemies and dodging missiles. One of the first games with true 3D perspective—rendered entirely with vectors.

5. Tempest (Playable)

The pinnacle of vector arcade hardware. Move around the edge of geometric tubes, shooting enemies that climb up from the depths. Each level changes the tube shape and color scheme.

Implementation

Each game implements the same Game trait:

pub trait Game {
    fn metadata(&self) -> GameMeta;
    fn reset(&mut self, ctx: &mut GameCtx);
    fn update(&mut self, ctx: &mut GameCtx, dt: f32);
    fn render(&mut self, ctx: &mut GameCtx, out: &mut Vec<DrawCmd>);
}

This makes games drop-in replaceable in the web shell—no renderer changes needed.

TODO

The demos are playable but not finished. Remaining work:

GPU rendering: Switch from Canvas 2D emulation to actual wgpu GPU rendering [Ed. Completed 2/13]
Music and sound effects: Authentic arcade audio
More aggressive opponents: AI improvements for challenge
Additional levels/difficulties: Progression and replay value
More animations: Explosions, transitions, effects

Resources

Before pixels, there were vectors. Vectorcade brings them back—in Rust, for the browser, with phosphor glow optional.

Credits

Role	Credit
Director	Mike Wright
Research & Architecture	ChatGPT
vectorcade-shared	Claude Code CLI agent
vectorcade-fonts	Claude Code CLI agent
vectorcade-games	Claude Code CLI agent
vectorcade-render-wgpu	Claude Code CLI agent
vectorcade-web-yew	Claude Code CLI agent
Explainer Video	Claude Code
Blog Post	Claude Code

Timeline: First pass vibe coded in one day (February 12, 2026)

First commit: 11:08 AM PST
Last commit: 5:08 PM PST
Total commits: 52 across 4 repositories
WGPU support added February 13, 2026

References

IBM 2250 Photo: “HES IBM 2250 Console grlloyd Oct1969” by Gregory Lloyd, October 1969. Brown University Hypertext Editing System (HES) demonstration. Licensed under CC BY-SA 4.0. Used with attribution.

Watch the Video

Unmute to hear narration.

February 12, 2026 • Software Wrighter

781 words • 4 min read • Abstract

When multiple AI agents work together, fixed communication patterns fail at scale. DyTopo rebuilds the graph each round based on semantic similarity between what agents need and what they can offer, preventing context explosion while enabling adaptive collaboration.

DyTopo: Dynamic Topology for Multi-Agent AI

When multiple AI agents work together, how should they communicate? Fixed patterns fail at scale. DyTopo rebuilds the communication graph each round based on what agents need and what they can offer.

Resource	Link
Video	DyTopo
Paper	arXiv:2505.16128
Code	dytopo-rs

The Problem: Fixed Topologies Don’t Scale

Multi-agent systems need communication patterns. The obvious approaches have problems:

Topology	Problem
All-to-all	Context explosion—every agent reads every message
Chain	Bottlenecks—one slow agent blocks everyone
Star	Single point of failure at the hub

As agent count grows, fixed topologies either explode in messages or create chokepoints.

The DyTopo Solution: Dynamic Routing

DyTopo (Dynamic Topology) solves this by reconstructing the communication graph each round. The key insight: agents know what they need and what they can offer.

Each round, every agent emits:

Query: What information do I need?
Key: What can I contribute?

The router computes semantic similarity between all keys and queries, then builds a sparse directed graph:

score(sender → receiver) = cosine(sender.key, receiver.query)

High-scoring pairs connect. Low-scoring pairs are ignored. The result: efficient, adaptive communication.

How It Works

Round N:
  1. Manager broadcasts goal
  2. Each agent produces:
     - Query (what I need)
     - Key (what I offer)
     - Draft (my current contribution)
  3. Router embeds keys and queries
  4. Similarity matrix → sparse graph (top-K per receiver)
  5. Messages flow along edges
  6. Trace written to JSONL

The topology adapts every round. An agent working on parsing might connect to the syntax expert in round 1, then the error-handling expert in round 2.

The Implementation: Rust, Zero Python

dytopo-rs is a fully Rust implementation with no Python dependencies:

Crate	Purpose
`dytopo-core`	Shared types (AgentId, Topology, TraceEvent)
`dytopo-embed`	Text embedding (hash-based baseline, semantic planned)
`dytopo-router`	Sparse graph construction
`dytopo-agents`	Agent implementations
`dytopo-orchestrator`	Main execution loop
`dytopo-viz`	DOT export for visualization
`dytopo-cli`	Command-line interface

Why Rust?

Zero-cost abstractions for performance-critical embedding/routing
Strong type system catches protocol mismatches at compile time
No Python dependency for baseline demos
Fearless concurrency for future parallelization

Running the Demo

cargo run -p dytopo-cli -- demo --rounds 3 --agents 5 --topk 2

This produces:

Per-round topology printed to stdout
./traces/trace_*.jsonl for machine-readable analysis
DOT files for graph visualization

Current Status

Milestone 0 is complete—the system runs end-to-end with stub agents and hash-based embeddings.

Feature	Status
Core types and traits	Done
Hash embedder (deterministic)	Done
Top-K sparse routing	Done
Stub agents with templates	Done
Orchestrator loop	Done
JSONL tracing	Done
DOT visualization	Done

Planned

Semantic embeddings (fastembed/candle)
LLM-backed agents (Ollama integration)
Inbox summarization for long conversations
Evaluation harness comparing topologies

Key Design Decisions

Why Hash Embeddings First?

The baseline uses deterministic hash-based embeddings:

Reproducible demos for debugging
No external dependencies to download
Validates the full pipeline before adding ML complexity

Semantic embeddings are planned as drop-in replacements.

Why Sparse Graphs?

Each agent receives at most topk messages per round:

Prevents context explosion as agent count grows
Makes communication interpretable—you can trace why agents connected
Matches the paper’s approach

Why JSONL Traces?

Every event is logged to JSONL:

Append-only for streaming
Line-based for grep/filtering
Machine-parseable for analysis tools
Human-readable for debugging

Topology Comparison

The system supports multiple topology strategies for comparison:

Strategy	Description	Use Case
`dynamic`	DyTopo routing	Adaptive, sparse
`fully_connected`	All-to-all	Baseline comparison
`chain`	Sequential	Pipeline tasks
`star`	Hub-and-spoke	Centralized coordination

What’s Next

LLM Agent Support (Milestone 2)—Replace stubs with real reasoning
Semantic Embeddings (Milestone 1)—Meaningful routing decisions
Evaluation Harness (Milestone 4)—Quantify DyTopo advantages

Resources

DyTopo Paper (arXiv:2505.16128) - Li et al., 2025
dytopo-rs Repository

Dynamic topology lets agents find the right collaborators each round. No context explosion. No bottlenecks. Just efficient, adaptive communication.

Watch the Video

Unmute to hear narration.

February 12, 2026 • Software Wrighter

1211 words • 7 min read • Abstract

What happens when you fine-tune a model on new tasks? It forgets old ones. This post documents our implementation of the Share algorithm in Rust—using SVD-based subspace extraction to enable continual learning without catastrophic forgetting. Part 1 covers the problem and initial negative results.

Towards Continuous LLM Learning (1): Sleepy Coder - When Fine-Tuning Fails

What if your AI coding assistant could learn from its mistakes? Not just for one session, but across training cycles. We built exactly that—and fifty-one adapters later, learned the mistake was trying to teach it at all.

Resource	Link
Video	Sleepy Coder
Code	sleepy-coder
Share Paper	arXiv:2602.06043
UWSH Paper	arXiv:2512.05117
Part 2	Routing Prevents Forgetting

The Dream: Day/Night Learning

AI coding agents have a memory problem. They fix a bug today, then make the same mistake next week. Every session starts from the same frozen model. Nothing carries forward.

The idea was elegant: build an agent that improves overnight.

DAY CYCLE (Inference)
  Agent attempts to fix Rust compiler errors
  Successes and failures are logged
        ↓
NIGHT CYCLE (Training)
  Fine-tune on failure patterns using LoRA
  Create specialized adapters
        ↓
EVAL
  Test against benchmark
  Measure improvement
        ↓
(repeat)

During the day, the agent works and we log its failures—the error messages, the broken code, and the fixes that worked. Overnight, we fine-tune the model on those failures. Each morning, a new checkpoint should wake up a little better than before.

We based this on two papers from the Johns Hopkins team (Kaushik, Vaidya, Chaudhari, Chellappa, Yuille):

Share LoRA Subspaces (arXiv:2602.06043) — Learn a shared low-rank basis across tasks, then train only coefficients (76x fewer parameters per task)
UWSH (arXiv:2512.05117) — The Universal Weight Subspace Hypothesis suggests neural networks converge to shared spectral subspaces

The theory was sound. The implementation worked. The results were devastating.

The System

The Sleepy Coder agent runs in a Rust runtime, fixing compiler errors on 30 “koans” (small coding exercises) across 5 error families:

Borrow Checker: Ownership and lifetime errors
Type Bounds: Missing trait implementations
Result Handling: Option/Result conversions
Type Mismatches: Incompatible types
Missing Items: Undefined functions or modules

The base model: Qwen2.5-Coder-1.5B-Instruct — small enough to train on a single GPU, capable enough to pass most koans without any fine-tuning.

The Journey: From Hope to Reality

Chapter 1: Naive LoRA

First attempt: standard fine-tuning on failure patterns.

Metric	Before	After
Pass Rate	73.3%	60.0%
Change	—	-13.3%

Catastrophic forgetting. The model learned the new patterns but forgot how to do everything else.

Chapter 2: The Paper Chase

We found the Share paper promising “continual learning without forgetting.” The UWSH paper provided theoretical backing: neural networks naturally converge to shared low-rank subspaces.

Key insight from Share:

Train ONLY the coefficients. Keep the basis FROZEN.

This meant ~21,000 trainable parameters instead of ~1.6 million. A 76x reduction.

Chapter 3: The Proper Implementation

SVD: Singular Value Decomposition breaks a matrix into components that reveal its underlying structure. In Share, SVD finds the common “directions” that multiple LoRA adapters share—a compressed basis that captures what they have in common.

We rebuilt everything:

Phase 1: Extract shared basis from 51 adapters via SVD
Phase 2: Train only coefficient vectors (frozen basis)
Phase 3: Merge and update basis periodically

We trained 51 pattern-specific adapters. We followed the algorithm precisely.

Chapter 4: The Stubborn Seven

No matter what we tried, 7 tasks kept failing:

Task	The Problem
bc_003	Mutable borrow while immutable exists
bc_005	Double mutable borrow
bc_010	Returning reference to local data
tb_002	Missing Clone trait
tb_007	Missing Hash trait
tb_008	Missing Ord trait
rh_004	Option to Result conversion

These require deep understanding of Rust’s ownership system—something a 1.5B model can’t reliably learn.

Chapter 5: The Final Score

Approach	Pass Rate	vs Baseline	Regressions
Baseline (no training)	73.3%	—	0
Naive LoRA	60.0%	-13.3%	Many
Targeted LoRA (7 patterns)	63.3%	-10%	4+
Replay buffer	70.0%	-3.3%	2
Phase 2 coef-only (10K params)	66.7%	-6.6%	2
Share Full (Ph2+Ph3)	73.3%	0%	0

The Share algorithm did exactly what it claimed: it prevented forgetting. But it couldn’t improve beyond baseline because there was nothing to improve.

What Went Wrong

1. The Model Already Knows

The base model already passes 73% of patterns. Training on these patterns doesn’t add knowledge—it dilutes what’s there.

2. Training Causes Forgetting

Even training only on the 7 failure patterns (44 examples) caused 4 new regressions. The model’s knowledge is interconnected.

3. Averaging Destroys Specialization

The Share paper assumes task routing at inference—selecting the right coefficients for each task. We averaged coefficients, which negated any specialization.

4. More Adapters Made It Worse

Adapter Count	Pass Rate
6 adapters	73.3%
51 adapters	70.0%

More adapters meant more subspace dilution when averaging. The signal got lost in the noise.

The Critical Insight

LoRA fine-tuning cannot improve a capable base model for tasks it already handles reasonably well.

The model’s knowledge is interconnected. Even 10,000 trainable parameters (0.0007% of the model) can break things. The baseline represents the ceiling, not the floor.

What We Learned

Read the room. If your base model passes 73%, maybe it doesn’t need fine-tuning. Maybe it needs better prompts.
Negative results are results. 51 failed experiments taught us more than a successful one would have.
Catastrophic forgetting is real. Small models especially can’t absorb new knowledge without losing old.
Share prevents forgetting, not ignorance. The algorithm does what it claims—it just can’t create knowledge from nothing.
Sometimes the answer is “don’t.” The best LoRA adapter for this task is no adapter.
Task routing vs averaging matters. The Share paper assumes you select coefficients based on task type, not blend them together.
AI coding agents cut corners. When implementing research papers, AI agents repeatedly stopped before completing all phases of the algorithm. I had to direct the agent to re-read the papers many times before it implemented them correctly.

Paths Forward

Since fine-tuning doesn’t work here, alternatives:

Approach	Tradeoff
Prompt engineering	No weight changes, limited by context
Multi-turn repair	Uses base model reasoning, slower
Larger model (7B+)	More capacity to absorb knowledge
Task routing with Share	Select coefficients, don’t average
Model ensemble	Multiple models, pick best output
Accept baseline	73% may be good enough

The Numbers

Experiments run:        51 adapters, multiple algorithms
Parameters trained:     From 10K to 1.6M per adapter
Best achieved:          73.3% (matches baseline)
Target:                 ≥76.7%
Conclusion:             Target not achievable with LoRA

Resources

Sometimes the most valuable research shows what doesn’t work. Fifty-one adapters later, we know: let sleeping models lie.

Watch the Video

Unmute to hear narration.

February 12, 2026 • Software Wrighter

470 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Dropout (random disabling prevents overfitting), RLHF (learn from human preferences), Inference (using trained models), Quantization (lower precision for efficiency), Flash Attention (block-wise for memory savings).

Five ML Concepts - #9

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #9

References

Concept	Reference
Dropout	Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
RLHF	Training language models to follow instructions with human feedback (Ouyang et al. 2022)
Inference	Deep Learning (Goodfellow et al. 2016), Chapter 5
Quantization	A Survey of Quantization Methods for Efficient Neural Network Inference (Gholami et al. 2021)
Flash Attention	FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al. 2022)

Today’s Five

1. Dropout

A regularization technique that randomly disables units during training. This encourages the network to rely on multiple pathways instead of memorizing patterns.

It helps reduce overfitting, especially in large models.

Like training a team where random members sit out each practice, so no one becomes a single point of failure.

2. RLHF (Reinforcement Learning from Human Feedback)

A training approach where humans rank or compare model outputs to produce a reward signal. The model is then optimized to better match human preferences.

This technique is central to aligning language models with human intent.

Like teaching by grading essays instead of dictating every word.

3. Inference

The process of running a trained model to make predictions on new data. Training updates the model’s parameters; inference uses them.

The distinction matters for optimization, deployment, and cost.

Like the difference between studying for an exam and actually taking it.

4. Quantization

Reducing the numerical precision used to store and compute model weights. This can shrink model size and speed up inference, sometimes with a small accuracy tradeoff.

Essential for deploying large models on limited hardware.

Like compressing a high-resolution photo into a smaller file that still looks good.

5. Flash Attention

An optimized attention algorithm designed to reduce memory usage. It avoids materializing the full attention matrix by computing attention in blocks.

This enables longer sequences and faster training.

Like reading a book chapter by chapter instead of photocopying the whole thing first.

Quick Reference

Concept	One-liner
Dropout	Random disabling to prevent overfitting
RLHF	Learn from human preference comparisons
Inference	Using a trained model for predictions
Quantization	Lower precision for smaller, faster models
Flash Attention	Block-wise attention for memory efficiency

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 11, 2026 • Software Wrighter

477 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Bias-Variance Tradeoff (balance under/overfitting), Diffusion (generate by learning to denoise), KV Cache (store past keys/values), Mixed Precision (lower precision for speed), MLA (compress attention into latent space).

Five ML Concepts - #8

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #8

References

Concept	Reference
Bias-Variance	The Elements of Statistical Learning (Hastie et al. 2009), Chapter 7
Diffusion	Denoising Diffusion Probabilistic Models (Ho et al. 2020)
KV Cache	Fast Transformer Decoding (Pope et al. 2022)
Mixed Precision	Mixed Precision Training (Micikevicius et al. 2017)
MLA	DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (DeepSeek-AI 2024)

Today’s Five

1. Bias-Variance Tradeoff

A fundamental tension where simpler models tend to underfit (high bias), and more flexible models can overfit (high variance). The goal is finding a balance that generalizes well to unseen data.

One of the oldest ideas in machine learning, still relevant today.

Like choosing between a ruler that only draws straight lines and one so flexible it traces every bump.

2. Diffusion Models

A generative approach that trains a model to reverse a gradual noising process. During generation, the model starts from noise and removes it step by step.

The foundation of image generators like Stable Diffusion and DALL-E.

Like learning to restore a photo by practicing on progressively more damaged versions.

3. KV Cache

A technique that stores attention key and value tensors from earlier tokens so they don’t need to be recomputed during generation. This significantly speeds up autoregressive inference.

Essential for efficient LLM serving.

Like keeping notes from earlier in a conversation instead of rereading everything.

4. Mixed Precision

A training strategy that uses lower-precision math for most operations, while keeping some calculations in higher precision for stability. This reduces memory use and often speeds up training with little accuracy loss.

Standard practice for modern deep learning.

Like drafting in pencil and only using ink for the final signature.

5. MLA (Multi-head Latent Attention)

An attention variant that compresses key and value information into a lower-dimensional latent space. This reduces memory usage for long sequences while retaining useful context.

Used in DeepSeek-V2 and related architectures.

Like summarizing meeting notes instead of recording every word verbatim.

Quick Reference

Concept	One-liner
Bias-Variance	Balance underfitting vs overfitting
Diffusion	Generate by learning to denoise
KV Cache	Store past keys/values for fast inference
Mixed Precision	Lower precision for speed, higher for stability
MLA	Compress attention into latent space

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 11, 2026 • Software Wrighter

1033 words • 6 min read • Abstract

From behavioral emulation to real implementation: integrating hash-based Engram memory with HuggingFace models. The gating mechanism is critical---it learns when to trust memory lookup and when hash collisions would add noise. Engram excels at exact-match retrieval, not generalization.

Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation

We started by training models to act like they had memory. Then we found an open source implementation that does it for real. This is what we learned.

Resource	Link
Paper	arXiv:2601.07372
Our Code	engram-poc
Reference	weagan/Engram
Video	Engram Revisited
Playlist	All Engram Videos

The Journey

Phase 1: Behavioral Emulation

Part 2 described our first approach: LoRA fine-tuning to make a model behave like it has memory. Train on patterns, and the model learns to respond consistently.

Metric	Baseline	LoRA-tuned
Accuracy	8.6%	14.1%
Improvement	-	+63% relative

It worked, but the architecture was unchanged. We were approximating Engram benefits, not implementing them.

Phase 2: The Discovery

Then we found weagan/Engram on GitHub—real hash-based memory in ~300 lines of Python:

class EnhancedEngramModule(nn.Module):
    def __init__(self, table_size=50000, d_model=512):
        # Large learnable memory table
        self.memory_table = nn.Parameter(torch.zeros(table_size, d_model))

        # Gate decides when to trust memory
        self.gate = nn.Sequential(
            nn.Linear(d_model * 2, d_model),
            nn.ReLU(),
            nn.Linear(d_model, 1),
            nn.Sigmoid()
        )

    def forward(self, hidden_states, input_ids):
        # O(1) hash lookup
        indices = self.multi_head_hash(input_ids)
        retrieved = F.embedding(indices, self.memory_table)

        # Gated injection
        gate_score = self.gate(torch.cat([hidden_states, retrieved], dim=-1))
        return hidden_states + gate_score * retrieved

The key insight: the gate decides when to trust the lookup. Not every token needs memory.

Phase 3: Integration with HuggingFace

We ported the module to work with HuggingFace models:

SmolLM-135M (frozen)
        ↓
EnhancedEngramModule (per layer)
  - 50K slot memory table
  - O(1) hash-based lookup
  - Learned gating
        ↓
Output

The proof it works—O(1) lookup regardless of sequence length:

Sequence Length	Lookup Time	Expected if O(n)
64 tokens	0.15 ms	-
2048 tokens	2.77 ms	4.8 ms

Sub-linear scaling proves constant-time hash lookup.

The Reality Check

Here’s where it gets interesting. Real Engram memory excels at some tasks and hurts others.

Where Engram Helps

Task Type	Baseline	Engram	Change
Acronym expansion	25%	75%	+200%
Element symbols	33%	67%	+103%
Long-term fact recall	90%	100%	+11%

For exact-match lookups with structured keys, Engram dominates.

Where Engram Hurts

Task Type	Baseline	Engram	Change
World capitals	83%	67%	-19%
Pattern completion	14%	11%	-21%

For tasks where the base model already knows the answer, Engram’s hash collisions add noise.

The Key Insight

Engram is a specialized tool, not a general enhancement.

Use Engram For	Don’t Use Engram For
FAQ responses	Creative generation
Terminology lookup	Novel combinations
Entity facts	Context-dependent answers
Code boilerplate	Reasoning tasks

The gating mechanism is critical: it must learn to suppress memory when it doesn’t help. Without proper gating, hash collisions inject noise into every token.

Obstacles Encountered

1. Hash Collisions

Different inputs can map to the same memory slot. The gate must learn to ignore irrelevant retrievals.

2. Parameter Explosion

50K slots × 768 dimensions × 30 layers = 1.2B additional parameters. We had to inject selectively (every 4th layer) to stay practical.

3. Training Dynamics

Memory tables start at zero. They need higher learning rates (10x) to develop meaningful representations before the model learns to use them.

4. Evaluation Mismatch

Our pattern completion task wasn’t ideal for hash-based memory. Engram shines on exact-match retrieval, not generalization.

Combined Approach

The best results came from combining both methods:

Base Model (SmolLM-135M)
        ↓
EnhancedEngramModule
  - Long-term fact storage
  - O(1) lookup for known patterns
        ↓
LoRA Adapters
  - Pattern completion
  - Domain-specific behaviors
        ↓
Output

This gives you:

Long-term memory from hash tables
Pattern consistency from behavioral training
Flexibility to disable either component

What We Learned

Emulation vs Implementation: LoRA fine-tuning approximates memory behavior; hash tables implement it. Both have their place.
Gating is Essential: The learned gate prevents hash collisions from degrading performance. Never use Engram without gating.
Match Task to Tool: Hash-based memory excels at exact lookups, not pattern generalization. Use it where applicable.
Selective Application: Don’t inject Engram everywhere. Target layers and use cases where it helps.
The Gate as a Safety Valve: When the gate learns to output near-zero for a task, that’s the model telling you Engram doesn’t help there. Listen to it.

Resources

Engram Paper (arXiv:2601.07372)
engram-poc Repository - Our implementation
weagan/Engram - Reference implementation
Engram Revisited Video
Engram Video Playlist
Part 1: mHC
Part 2: Engram Introduction

Series Recap

Part	Topic	Key Insight
1	mHC	Doubly-stochastic constraints bound signal amplification
2	Engram Intro	O(1) lookup beats recomputing through attention
3	Engram Revisited	Use Engram where applicable; gate to avoid worse results

Hash-based memory is powerful but specialized. The gate decides when to use it—and when not to.

Watch the Video

Unmute to hear narration.

February 10, 2026 • Software Wrighter

469 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Cross-Validation (rotate held-out data), GPT (predict next token at scale), GQA (shared keys/values for efficiency), Context Window (how much the model sees), Self-Attention (each token attends to all others).

Five ML Concepts - #7

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #7

References

Concept	Reference
Cross-Validation	A Study of Cross-Validation and Bootstrap (Kohavi 1995)
GPT	Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
GQA	GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al. 2023)
Context Window	Attention Is All You Need (Vaswani et al. 2017)
Self-Attention	Attention Is All You Need (Vaswani et al. 2017)

Today’s Five

1. Cross-Validation

A technique that splits data into multiple folds to evaluate model performance on data it wasn’t trained on. By rotating which data is held out, it gives a more reliable estimate of generalization.

Essential for honest model evaluation.

Like practicing with different sets of flashcards to see if you actually learned the material.

2. GPT

Generative Pre-trained Transformer. A family of autoregressive language models trained to predict the next token in a sequence.

Many AI assistants and chatbots are built on this approach.

Like autocomplete, but scaled up and trained on vast text data.

3. GQA (Grouped Query Attention)

An attention variant where multiple query heads share key and value projections. This reduces memory usage and can speed up inference compared to standard multi-head attention.

Widely adopted in efficient transformer architectures.

Like several students sharing one set of notes instead of copying everything separately.

4. Context Window

The maximum number of tokens a model can process in a single forward pass. Larger context windows allow longer inputs, but increase memory and compute costs.

A key constraint in language model design.

Like the size of a desk that limits how many papers you can spread out at once.

5. Self-Attention

A mechanism where each token computes attention scores with other tokens in the same sequence. This lets the model weigh which parts of the input are most relevant to each position.

The core operation inside transformers.

Like everyone in a meeting deciding who to listen to based on the conversation.

Quick Reference

Concept	One-liner
Cross-Validation	Rotate held-out data for reliable evaluation
GPT	Predict next token, at scale
GQA	Shared keys/values for efficient attention
Context Window	How much the model sees at once
Self-Attention	Each token attends to all others

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 9, 2026 • Software Wrighter

491 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Regularization (constraints to prevent overfitting), BERT (bidirectional masked language modeling), RoPE (position via rotation in attention), Prompting (craft inputs to steer outputs), Positional Encoding (tell model where tokens are).

Five ML Concepts - #6

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #6

References

Concept	Reference
Regularization	Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
BERT	BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
RoPE	RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al. 2021)
Prompting	Language Models are Few-Shot Learners (Brown et al. 2020)
Positional Encoding	Attention Is All You Need (Vaswani et al. 2017)

Today’s Five

1. Regularization

Techniques that reduce overfitting by adding constraints or penalties during training. Common examples include L2 weight decay, L1 sparsity, dropout, and early stopping.

The goal is better generalization, not just fitting the training set.

Like adding friction so a model can’t take the easiest overfit path.

2. BERT

Bidirectional Encoder Representations from Transformers. A transformer encoder trained with masked language modeling: predicting hidden tokens using context from both sides.

It was a major step forward for many NLP tasks after its 2018 release.

Like filling in blanks by reading the whole sentence, not just the words before it.

3. RoPE (Rotary Positional Embeddings)

A way to represent token position inside attention by rotating query and key vectors as a function of position. This gives attention information about relative order and distance.

It’s widely used in modern transformer models.

Like turning a dial differently for each position so the model can tell where tokens are.

4. Prompting

Crafting inputs to steer a model toward the output you want. Small changes in instructions, examples, or format can change behavior significantly.

A key skill for working effectively with language models.

Like asking a question in just the right way to get a useful answer.

5. Positional Encoding

Transformers need a way to represent token order, because attention alone doesn’t include sequence position. Different methods do this, including learned embeddings and rotary approaches like RoPE.

Without it, “the cat sat on the mat” would be indistinguishable from “mat the on sat cat the.”

Like numbering the pages of a shuffled book so you can read them in order.

Quick Reference

Concept	One-liner
Regularization	Add constraints to prevent overfitting
BERT	Bidirectional masked language modeling
RoPE	Position info via rotation in attention
Prompting	Craft inputs to steer model outputs
Positional Encoding	Tell the model where tokens are in sequence

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 8, 2026 • Software Wrighter

493 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Perceptron (single linear unit ancestor), Pre-training (learn general patterns first), Speculative Decoding (draft fast, verify in parallel), In-Context Learning (adapt from prompt examples), Latent Space (internal representations where similar things cluster).

Five ML Concepts - #5

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #5

References

Concept	Reference
Perceptron	The Perceptron: A Probabilistic Model (Rosenblatt 1958)
Pre-training	BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
Speculative Decoding	Fast Inference from Transformers via Speculative Decoding (Leviathan et al. 2022)
ICL	Language Models are Few-Shot Learners (Brown et al. 2020)
Latent Space	Auto-Encoding Variational Bayes (Kingma & Welling 2013)

Today’s Five

1. Perceptron

The simplest neural network: a single linear unit with weights and a bias. It computes a weighted sum and applies a threshold or activation.

It inspired modern neural networks, even though today’s models are far more complex.

Like a single voter weighing inputs before deciding yes or no.

2. Pre-training

Training a model on a large, general dataset before adapting it to a specific task. This gives the model broad patterns that later training can refine.

BERT, GPT, and most modern LLMs use this approach.

Like going to medical school before choosing a specialty.

3. Speculative Decoding

A technique where a small, fast model proposes tokens, and a larger model verifies or rejects them in parallel. This can speed up inference without changing final outputs.

A key optimization for production LLM deployments.

Like a junior writer drafting text for a senior editor to approve in batches.

4. In-Context Learning (ICL)

When a model adapts its behavior using examples in the prompt, without updating its weights. It allows flexible task behavior at inference time.

This emergent capability surprised researchers when GPT-3 demonstrated it.

Like solving a new puzzle after seeing a few worked examples.

5. Latent Space

The internal representations a model learns as it processes data. In this space, similar inputs tend to be located near each other.

It’s not a literal place, but a useful way to think about how models organize information.

Like a map where cities are arranged by similarity instead of geography.

Quick Reference

Concept	One-liner
Perceptron	Single linear unit—the neural network ancestor
Pre-training	Learn general patterns before specializing
Speculative Decoding	Draft fast, verify in parallel
ICL	Adapt from prompt examples without training
Latent Space	Internal representations where similar things cluster

In-Context Learning Revisited: From Mystery to Engineering — A deeper exploration of how ICL evolved from emergent surprise to engineered capability.

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 7, 2026 • Software Wrighter

453 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Activation Functions (introduce nonlinearity), Transfer Learning (reuse knowledge across tasks), VLM (joint image-text understanding), Adam (adaptive learning rates), Superposition (many concepts in overlapping representations).

Five ML Concepts - #4

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #4

References

Concept	Reference
Activation Functions	Deep Learning (Goodfellow et al. 2016), Chapter 6
Transfer Learning	A Survey on Transfer Learning (Pan & Yang 2010)
VLM	Learning Transferable Visual Models (CLIP) (Radford et al. 2021)
Adam	Adam: A Method for Stochastic Optimization (Kingma & Ba 2014)
Superposition	Toy Models of Superposition (Elhage et al. 2022)

Today’s Five

1. Activation Functions

Functions like ReLU, sigmoid, and tanh that transform neuron outputs. They introduce nonlinearity, allowing networks to learn complex patterns beyond simple linear relationships.

Without them, stacking layers would just be matrix multiplication.

Like an on-off switch that can also dim the lights.

2. Transfer Learning

Using knowledge a model learned on one task to improve performance on a related task. This often reduces training time and data requirements dramatically.

Pre-trained models can be fine-tuned for specific applications.

Like a chef who already knows French cooking learning Japanese cuisine faster.

3. VLM (Vision-Language Models)

Models trained to work with both images and text. They learn shared representations that connect visual and language understanding.

CLIP, GPT-4V, and LLaVA are examples of this approach.

Like someone who can look at a photo and describe what’s happening.

4. Adam

An optimizer that adapts learning rates for each parameter using information from past gradients. It combines ideas from momentum and adaptive learning-rate methods.

One of the most popular optimizers in deep learning.

Like a hiker who adjusts step size for each part of the trail, steep or flat.

5. Superposition

A way neural networks represent many concepts using overlapping directions in the same space. This allows models to pack more information into fewer neurons than expected.

It’s why interpretability is hard—features aren’t neatly separated.

Like discovering a painting has hidden layers that appear under the right light.

Quick Reference

Concept	One-liner
Activation Functions	Introduce nonlinearity to enable complex patterns
Transfer Learning	Reuse knowledge from one task for another
VLM	Joint understanding of images and text
Adam	Adaptive per-parameter learning rates
Superposition	Many concepts packed into overlapping representations

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 6, 2026 • Software Wrighter

524 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Loss Function (how far off predictions are), Overfitting (memorizing vs learning), Fine-tuning (specializing pre-trained models), LoRA (efficient adaptation with small matrices), Tokenization (breaking text into digestible pieces).

Five ML Concepts - #3

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #3

References

Concept	Reference
Loss Function	A Survey of Loss Functions for Deep Neural Networks (Janocha & Czarnecki 2017)
Overfitting	Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
Fine-tuning	A Survey on Transfer Learning (Zhuang et al. 2020)
LoRA	LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
Tokenization	Neural Machine Translation of Rare Words with Subword Units (Sennrich et al. 2015)

Today’s Five

1. Loss Function

A formula that measures how far off the model’s predictions are from the correct answers. It quantifies the gap between what the model predicted and what it should have predicted.

Training a neural network means minimizing this function.

Like a scorecard that tells the model how badly it messed up.

2. Overfitting

When a model learns the training data too well, including noise and outliers, and fails on new data. The model performs great on examples it has seen but poorly on anything new.

One of the most common pitfalls in machine learning.

Like memorizing the answers to a test instead of understanding the subject.

3. Fine-tuning

Taking a pre-trained model and training it further on a specific task or dataset. Instead of training from scratch, you start from a model that already understands language or images, then specialize it.

This makes powerful models accessible without massive compute budgets.

Like teaching a chef who already knows cooking to specialize in sushi.

4. LoRA (Low-Rank Adaptation)

An efficient fine-tuning method that trains a small number of added parameters instead of the full model. It inserts small trainable matrices into each layer while keeping the original weights frozen.

This dramatically reduces the memory and compute needed for fine-tuning.

Like adding sticky notes to a textbook instead of rewriting the whole thing.

5. Tokenization

The process of breaking text into smaller units called tokens that a model can process. Most modern models use subword tokenization, splitting words into common pieces rather than individual characters or whole words.

It determines what the model actually “sees” and affects everything from vocabulary size to multilingual performance.

Like chopping sentences into bite-sized pieces a model can digest.

Quick Reference

Concept	One-liner
Loss Function	How far off the model’s predictions are
Overfitting	Memorizing the test instead of learning the subject
Fine-tuning	Specializing a pre-trained model for a new task
LoRA	Efficient fine-tuning with small added matrices
Tokenization	Breaking text into the pieces a model actually reads

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 5, 2026 • Software Wrighter

1779 words • 9 min read • Abstract

Unix invented pipes. Mainframes reinvented them for records, not bytes. This Throwback Thursday recreates CMS/TSO Pipelines in Rust with a visual debugger, demonstrating record-oriented dataflow from the 1996 Olympics web server era.

TBT (2/?): Pipelines on OS/390

Unix invented pipes. Mainframes reinvented them—for records, not bytes.

This is the second Throwback Thursday post—revisiting technologies that shaped how I think about programming. This time: CMS/TSO Pipelines, and a vibe coding project that brings them back to life in Rust for education, fun, and nostalgic reasons.

Resource	Link
Code	pipelines-rs
Demo	Live Demo
Video	Pipelines on OS/390 #TBT

The 1996 Olympics and a Pair of Mainframes

In 1996, IBM hosted the Olympics Web Server—one of the largest public web properties at the time. Many distributed IBM systems in different regions served dynamic web pages. The logs from all of them were funneled to a pair of IBM S/390 mainframes I was in charge of, running OS/390 (formerly MVS).

When you’re processing millions of log records for statistics and forensics, you need tools that think in records, not lines. That’s where Pipelines for TSO/E came in.

Pipelines for TSO/E was the MVS/ESA port of CMS Pipelines, which ran on VM/ESA. Both let you chain stages together to filter, transform, and aggregate record-oriented data—record-oriented pipelines that evolved in parallel with Unix’s byte-stream pipes.

Two Traditions of Piping

Unix pipes came first—Thompson and McIlroy at Bell Labs, 1969–1974. Byte streams, file descriptors, the | operator. Brutally simple. Explosively powerful. POSIX.1-1988 standardized pipe(2) and shell pipelines, though POSIX work began in the mid-1980s.

CMS Pipelines emerged on IBM mainframes in the mid-to-late 1980s. They weren’t a Unix clone—they were convergent evolution under different pressures. Where Unix piped bytes between small programs, CMS piped records through declarative stages. Pipelines for TSO/E followed in the late 1980s and early 1990s, porting CMS concepts to the MVS multi-user environment. Unlike CMS Pipelines (which ships with z/VM), the TSO/E port is typically installed separately on z/OS.

Neither tradition was “behind.” They were optimizing different dimensions:

	Unix Pipes	CMS/TSO Pipelines
Era	1969–1974	Mid-to-late 1980s
Data unit	Byte stream	Records (fixed or variable length)
Stage input	stdin (bytes)	Record buffer
Field access	`awk`, `cut` (text parsing)	Column positions (direct)
Execution	Typically a process per stage	Stages in one address space
Topology	Linear by default; fan-out/fan-in via `tee`, FIFOs, or process substitution	Multi-stream, fan-out/fan-in built in
Philosophy	Small tools, ad hoc composition	Declarative data transformation

Many datasets on mainframes are record-structured. Records can be fixed-length or variable-length. CMS and TSO/E Pipelines treat records as byte arrays—character-oriented stages assume EBCDIC text, while position/length stages are binary-safe. A fixed-length 80-byte record isn’t arbitrary text—columns 1-8 are the name, 9-18 are the department, 19-26 are the salary. You don’t parse. You just read the right columns.

Unix won culturally—cheap hardware, academic distribution, C portability. But IBM’s record-oriented pipelines were better at structured dataflow, and they anticipate or parallel patterns seen in ETL frameworks like Spark and Beam.

CMS Pipelines ships with z/VM and is still used; Pipelines for TSO/E exists for z/OS but isn’t universally installed. These are not historical curiosities—mainframes continue to process a significant share of high-value transactions, and pipelines remain an available tool for data transformation on those systems.

What a Pipeline Looks Like

CMS Pipelines uses a DSL with PIPE as the command, | to chain stages, and ? as a command terminator (it suppresses the console from being used as implicit input):

PIPE CONSOLE
| FILTER 18,10 = "SALES"
| SELECT 0,8,0; 8,10,8
| CONSOLE
?

This reads input records, keeps only those where columns 18–27 equal “SALES”, extracts the name fields, and writes the result. No regex. No string splitting. Just column positions.

Note: pipelines-rs uses 0-based offsets (e.g., SELECT 0,8,0). Historical CMS Pipelines uses 1-based column positions.

Compare with the Unix equivalent:

cat input.txt | awk '$3 == "SALES" {print $1, $2}'

The Unix version looks simpler—until your fields contain spaces, or your records contain non-text bytes, or you need to chain 15 stages without spawning 15 processes.

Bringing It Back in Rust (Vibe Coding)

pipelines-rs is a nostalgia-driven vibe coding project—my attempt to emulate Pipelines for TSO/E in Rust, not because it’s practical, but because these ideas deserve to be celebrated. It supports a subset of stages and features two execution models:

The Two Executors

Batched processes all records through one stage before moving to the next:

All records → Stage 1 → All records → Stage 2 → All records → Stage 3

This emulates the correct output and is faster, but doesn’t demonstrate record-oriented dataflow well.

Record-At-a-Time (RAT) sends each record through the entire pipeline before reading the next:

Record 1 → Stage 1 → Stage 2 → Stage 3 → Output
Record 2 → Stage 1 → Stage 2 → Stage 3 → Output
Record 3 → Stage 1 → Stage 2 → Stage 3 → Output

RAT is the implementation shown in the video. It’s a naive approach—more buffers, more copying—but it shows the dataflow concepts clearly and enables the visual debugger. Both run in linear time (records × stages) and produce identical output for all 23 test specifications.

A future version will aim for fewer buffers and fewer copy operations. Whether it’s faster than Batched remains to be seen.

The 80-Byte Record

The Rust implementation supports fixed-length records only. The fundamental data type is the Record—exactly 80 bytes, matching historical punch card width. Variable-length input lines are accepted and padded to 80 bytes:

pub const RECORD_WIDTH: usize = 80;

pub struct Record {
    data: [u8; RECORD_WIDTH],
}

Fields are accessed by column position and length. No parsing, no delimiters. The data is always right where you expect it.

Supported Stages

The current implementation supports 14 stages:

Stage	Purpose	Example
FILTER	Keep/reject records by field value	`FILTER 18,10 = "SALES"`
LOCATE	Keep records containing a pattern	`LOCATE "ERROR"`
NLOCATE	Keep records NOT containing a pattern	`NLOCATE "DEBUG"`
SELECT	Extract and reposition fields	`SELECT 0,8,0; 8,10,8`
CHANGE	Text replacement	`CHANGE "SALES" "MKTG"`
COUNT	Count records	`COUNT`
TAKE	Keep first N records	`TAKE 5`
SKIP	Skip first N records	`SKIP 2`
DUPLICATE	Repeat each record N times	`DUPLICATE 3`
LITERAL	Append a literal record	`LITERAL "--- END ---"`
UPPER/LOWER	Case conversion	`UPPER`
REVERSE	Reverse record text	`REVERSE`
HOLE	Discard all input	`HOLE`
CONSOLE	Driver stage: source or sink depending on position	`CONSOLE`

The CLI

Both executors have identical CLIs:

# Batch executor
pipe-run specs/filter-sales.pipe specs/input-fixed-80.data -v

# Record-at-a-time executor
pipe-run-rat specs/filter-sales.pipe specs/input-fixed-80.data -v

Given this input data:

SMITH   JOHN      SALES     00050000
JONES   MARY      ENGINEER  00075000
DOE     JANE      SALES     00060000
WILSON  ROBERT    MARKETING 00055000
CHEN    LISA      ENGINEER  00080000
GARCIA  CARLOS    SALES     00045000
TAYLOR  SUSAN     MARKETING 00065000
BROWN   MICHAEL   ENGINEER  00090000

And this pipeline:

PIPE CONSOLE
| FILTER 18,10 = "SALES"
| CONSOLE
?

The output is:

SMITH   JOHN      SALES     00050000
DOE     JANE      SALES     00060000
GARCIA  CARLOS    SALES     00045000
Records:  8 in -> 3 out

Exactly what I’d have gotten on OS/390 in 1996, but with Web Server log data showing client IP address, OS, browser type/version, user cookies, timestamps, URLs, and more, instead of accounting data. 😊

The Web UI for Two pipelines-rs Implementations

The web interface runs entirely in the browser via WebAssembly. It has three panels: input records with an 80-column ruler, the pipeline editor, and the output.

Tutorial Mode

The tutorial walks through each stage with examples, running pipelines automatically to show results. You can step through manually or let it auto-advance.

The Visual Debugger

The debugger is the reason RAT exists. It lets you:

Step through execution one pipe point at a time
Watch data at specific pipe points between stages
Set breakpoints to pause at specific stages
See stage state for stateful stages like COUNT

You load a pipeline, click Run, then Step to watch each record flow through each stage. The debugger highlights which stages have been reached with a green border. For COUNT and other aggregation stages, you can watch the flush phase where accumulated state becomes output.

What’s Next

The current RAT executor is intentionally naive—it uses a buffer at every pipe point and copies each record between them. A better implementation would minimize buffers and copy operations while preserving the record-at-a-time semantics.

Multi-pipe features are also planned—CMS Pipelines supported fan-out (one input, multiple output streams) and fan-in (multiple inputs merged), which enabled complex processing topologies beyond simple linear chains.

How pipelines-rs Differs from IBM Pipelines

	IBM CMS/TSO/E Pipelines	pipelines-rs
Indexing	1-based column positions	0-based offsets
Record format	Fixed or variable length, EBCDIC	Fixed 80-byte ASCII only (variable-length input padded)
Stages	Hundreds of built-in stages	14 implemented so far
Topology	Multi-stream: fan-out, fan-in, multi-pipe	Linear only (multi-pipe planned)
Environment	z/VM, z/OS mainframes	CLI (native) and browser (WASM)
Character set	EBCDIC	ASCII/UTF-8

This is a teaching tool and nostalgia project, not a production replacement.

Implementation Details

Metric	Value
Language	Rust (2024 edition)
Web UI	Yew framework, compiled to WASM
Stages	14 implemented
Test Specs	23 pipeline specifications
Tests	60+ (including batch/RAT equivalence)
License	MIT
Live Demo	sw-comp-history.github.io/pipelines-rs

Resources

Credits

Role	Who
Concept & direction	Mike Wright
Content creation	Claude (Anthropic)
Editorial review	ChatGPT (OpenAI)

Mainframe ideas, modern tools. Follow for more.

Watch the Video

Unmute to hear narration.

February 5, 2026 • Software Wrighter

985 words • 5 min read • Abstract

Which small AI fits your laptop? Benchmarking Phi-2, Gemma-2B, and SmolLM on the 2-3B efficient frontier. Phi-2 achieves 61.7% MMLU with only 2.7B parameters, beating models 5x larger through synthetic textbook training. Data quality beats parameters.

Small Models (6/6): Which Small AI Fits YOUR Laptop?

Maximum AI capability on minimum hardware. The 2-3B efficient frontier.

This is Part 6 (the finale) of the Small Models, Big Brains series. We’re benchmarking the best small models to help you choose the right one for your laptop.

Resource	Link
Code	efficient-llm
Phi-2	microsoft/phi-2
Gemma	ai.google.dev/gemma
SmolLM	HuggingFace Blog
Video	Which Small AI Fits YOUR Laptop?

The Efficient Frontier

In economics, the “efficient frontier” is the set of optimal portfolios offering the highest return for a given level of risk.

In AI, it’s the models offering the best capability for a given size.

The Contenders

Model	Params	Source	Key Strength
Phi-2	2.7B	Microsoft	Reasoning, synthetic data
Gemma-2B	2B	Google	Distillation, multilingual
SmolLM2-1.7B	1.7B	HuggingFace	11T tokens, fast inference
SmolLM3-3B	3B	HuggingFace	Dual reasoning, 6 languages

Benchmark Results

Actual measurements on Apple Silicon (M-series) from efficient-llm:

Model	MMLU	GSM8K	HumanEval	Speed (CPU)	Memory
Phi-2	61.7%	57.0%	50.0%	7.1 tok/s	5.2GB
Gemma-2B	38.9%	18.0%	90.0%	8.5 tok/s	4.7GB
SmolLM2	55.6%	*	*	3.7 tok/s	3.2GB

*SmolLM2 GSM8K/HumanEval scores reflect prompt format incompatibility, not capability.

The Key Insight: Data Quality Beats Parameters

Phi-2 achieves 61.7% MMLU with only 2.7B parameters.

For comparison:

Llama-2-7B: ~46% MMLU
Llama-2-13B: ~55% MMLU

Phi-2 beats models 5x its size. The secret? Synthetic textbook training.

Microsoft generated high-quality educational content specifically designed to teach reasoning. Quality data > quantity data > model size.

Model Profiles

Phi-2: The Reasoning Champion

Strengths: Math, logic, code understanding
Weakness:  Less conversational
Best for:  Technical tasks, chain-of-thought

Phi-2 was trained on “textbook quality” synthetic data. It thinks like a textbook explains.

Gemma-2B: The Distillation Expert

Strengths: Multilingual, edge deployment
Weakness:  Lower benchmark scores
Best for:  Production apps, Google ecosystem

Google distilled knowledge from larger models into this compact package. Great tooling and documentation.

SmolLM2-1.7B: The Speed Demon

Strengths: Fastest inference, smallest footprint
Weakness:  Prompt format sensitivity
Best for:  Memory-constrained environments

HuggingFace trained on 11T tokens—massive overtraining like TinyLlama but at a slightly larger scale.

SmolLM3-3B: The Balanced Choice

Strengths: Dual reasoning modes, 6 languages
Weakness:  Newest, less battle-tested
Best for:  General-purpose small model needs

The latest from HuggingFace, designed to be the go-to small model.

Decision Framework

├── Need best reasoning?           → Phi-2
├── Need instruction following?    → SmolLM2 or SmolLM3
├── Need multilingual?             → Gemma-2B or SmolLM3
├── Memory constrained (<4GB)?     → SmolLM2 + INT4
├── Need Google ecosystem?         → Gemma-2B
├── General purpose?               → SmolLM3
└── Maximum quality per byte?      → Phi-2

Running the Benchmarks

git clone https://github.com/softwarewrighter/efficient-llm
cd efficient-llm

# Setup
uv venv && source .venv/bin/activate
uv pip install torch transformers accelerate bitsandbytes datasets tqdm

# HuggingFace login (required for Gemma)
huggingface-cli login

# Download and benchmark
python download_models.py
python benchmark_quality.py
python benchmark_speed.py
python benchmark_memory.py

# Interactive demos
python demo_reasoning.py
python demo_code.py
python demo_chat.py

Hardware Requirements

Setup	Models You Can Run
4GB RAM	SmolLM2 (INT4)
8GB RAM	All models (INT4)
16GB RAM	All models (FP16)
Apple Silicon	All models (MPS)

Implementation Details

Metric	Value
Primary Language	Python
Source Files	7 `.py` files
Estimated Size	~1.4 KLOC
Framework	Transformers, PyTorch
Build System	uv / pip
Key Features	MMLU/GSM8K/HumanEval benchmarks, demos

Good for you if: You want to benchmark 2-3B models, compare quality vs speed tradeoffs, or run interactive comparisons between Phi-2, Gemma, and SmolLM.

Complexity: Low. Similar structure to billion-llm. Standalone Python scripts for each benchmark and demo. Requires HuggingFace authentication for Gemma access.

Series Recap

Over six parts, we’ve explored the cutting edge of small model research:

Part	Model/Topic	Key Insight
1	TRM (<1K params)	Iteration beats scale
2	MobileLLM (350M)	Offline AI is practical
3	HRM (27M)	Hierarchy enables reasoning
4	BDH	Sparsity enables interpretability
5	1B models	The efficiency sweet spot
6	2-3B models	Data quality beats parameters

Key Takeaways

Data quality beats parameter count. Phi-2 proves careful curation outperforms brute scaling.
The 2-3B range is remarkably capable. These models handle real tasks, not just demos.
Each model has its niche. Match the model to your use case.
Quantization makes everything accessible. INT4 lets you run 3B models on 4GB RAM.
The frontier keeps moving. SmolLM3 is weeks old. Better models are coming.

What We’ve Learned

Small models aren’t a compromise—they’re a different optimization target. When you can’t throw compute at a problem, you’re forced to be clever:

Recursive reasoning (TRM)
Mobile-optimized architectures (MobileLLM)
Hierarchical decomposition (HRM)
Sparse interpretable activations (BDH)
Overtraining on quality data (TinyLlama, Phi-2)

These techniques will eventually feed back into large models too. Small model research isn’t a dead end—it’s the frontier.

Resources

Part 6 of 6 in the Small Models, Big Brains series. Thanks for following along!

Have questions? Find me on YouTube @SoftwareWrighter or Discord.

Watch the Video

Unmute to hear narration.

February 5, 2026 • Software Wrighter

446 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Gradient Descent (walk downhill to minimize error), Attention (focus on what matters), DPO (align from preference pairs), Learning Rate (step size tradeoff), Temperature (dial between predictable and creative).

Five ML Concepts - #2

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #2

References

Concept	Reference
Gradient Descent	An overview of gradient descent optimization algorithms (Ruder 2016)
Attention	Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014)
DPO	Direct Preference Optimization (Rafailov et al. 2023)
Learning Rate	Cyclical Learning Rates (Smith 2015)
Temperature	On the Properties of Neural Machine Translation (Cho et al. 2014)

Today’s Five

1. Gradient Descent

A general optimization method used across machine learning. It improves a model by taking small steps in the direction that reduces error the most.

Many learning algorithms rely on it, especially neural networks.

Like walking downhill in fog, adjusting each step based on the slope beneath your feet.

2. Attention

A mechanism that lets models weigh different parts of the input by importance. Instead of treating everything equally, attention highlights what matters most.

This was key to breakthroughs in translation and language models.

Like reading a sentence and focusing more on the important words.

3. DPO (Direct Preference Optimization)

A method for aligning language models with human preferences. Unlike RLHF, it trains directly on preference comparisons and avoids an explicit reward model.

This simplifies training while achieving comparable alignment.

Like learning preferences by observing choices, not by designing a scoring system.

4. Learning Rate

Controls how large each update step is during training. Too large and learning becomes unstable. Too small and training is slow or gets stuck.

One of the most important hyperparameters to tune.

Like choosing how fast to walk downhill without losing balance.

5. Temperature

A parameter that controls randomness during text generation. Low temperature favors predictable, high-probability outputs. Higher temperature increases variety and surprise.

A tradeoff between consistency and creativity.

Like adjusting a dial from cautious to adventurous.

Quick Reference

Concept	One-liner
Gradient Descent	Walk downhill to minimize error
Attention	Focus on what matters in the input
DPO	Align models from preference pairs directly
Learning Rate	Step size that balances speed and stability
Temperature	Dial between predictable and creative

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 4, 2026 • Software Wrighter

839 words • 5 min read • Abstract

One billion parameters: the sweet spot for AI. Big enough to reason, small enough to run anywhere. Comparing TinyLlama, Llama-3.2-1B, StableLM, and Pythia with LoRA fine-tuning in minutes and speculative decoding for 2-3x speedups.

Small Models (5/6): Max AI Per Watt

One billion parameters. The sweet spot for AI.

Big enough to reason. Small enough to run anywhere. Maximum capability per watt.

This is Part 5 of the Small Models, Big Brains series, comparing four models at the 1B parameter point.

Resource	Link
Code	billion-llm
TinyLlama	jzhang38/TinyLlama
Llama 3.2	ai.meta.com/llama
Pythia	EleutherAI/pythia
Video	Max AI Per Watt

Why One Billion?

Range	Reality
Below 1B	Models struggle with complex reasoning
Above 1B	Hardware requirements increase significantly
At 1B	Maximum capability per watt

1B parameters is where you get:

Real language understanding
Ability to follow instructions
Fine-tuning in minutes on a laptop
Deployment anywhere (phone, Raspberry Pi, browser)

The Contenders

Model	Params	Key Strength	Training Data
TinyLlama	1.1B	Overtrained on 3T tokens	Community
Llama-3.2-1B	1B	Official Meta ecosystem	Meta
StableLM-1.6B	1.6B	Multilingual, 2T tokens	Stability AI
Pythia-1B	1.08B	154 research checkpoints	EleutherAI

TinyLlama: The Overtraining Champion

TinyLlama breaks the rules. The Chinchilla scaling laws suggest training tokens should scale with parameters. TinyLlama uses 100x more data than optimal.

Chinchilla-optimal for 1B: ~30B tokens
TinyLlama actual:          3T tokens (3,000B)

The result? A tiny model that punches well above its weight.

Benchmarks

From the billion-llm repository:

Model	MMLU	HumanEval	Speed	Memory
TinyLlama	25.3%	12.2%	Fast	2.2GB
Llama-3.2-1B	32.1%	18.5%	Fast	2.4GB
StableLM-1.6B	30.8%	15.1%	Medium	3.2GB
Pythia-1B	26.4%	10.3%	Fast	2.2GB

Llama-3.2-1B leads on quality. TinyLlama offers the best value when you factor in the open training recipe.

LoRA Fine-Tuning in Minutes

All these models can be fine-tuned on a laptop using LoRA:

cd billion-llm
python finetune_demo.py --model tinyllama --epochs 3

LoRA adds small trainable adapters without modifying base weights:

Base Model (frozen): 1.1B parameters
LoRA Adapters:       ~4M parameters (0.4%)
Training time:       5-10 minutes on M1 Mac

Speculative Decoding: 2-3x Speedup

Use a fast 1B model to draft tokens, verify with a slower 7B model:

Draft (1B):   "The quick brown fox" → [jumps, over, the, lazy]
Verify (7B):  Accept [jumps, over, the] → Reject [lazy] → Generate [sleepy]

The 1B model generates candidates quickly. The 7B model only needs to verify, not generate from scratch.

python speculative_demo.py

Results: 2-3x speedup on autoregressive generation.

Hardware Requirements

Setup	What You Can Run
CPU only	All models (slower, INT4 quantized)
4GB VRAM	All models (INT4 quantized)
8GB VRAM	All models (FP16)
Apple Silicon	All models (MPS acceleration)

Quick Start

git clone https://github.com/softwarewrighter/billion-llm
cd billion-llm

# Setup
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

# Download models
python download_models.py

# Run benchmarks
python benchmark.py

# Interactive comparison
python demo_chat.py --compare tinyllama llama3.2-1b

Which Model Should You Choose?

├── Need Meta ecosystem compatibility? → Llama-3.2-1B
├── Need multilingual support?         → StableLM-1.6B
├── Need research reproducibility?     → Pythia-1B (154 checkpoints)
├── Need maximum performance/size?     → TinyLlama
└── Just getting started?              → Any of them work!

Implementation Details

Metric	Value
Primary Language	Python
Source Files	8 `.py` files
Estimated Size	~1.4 KLOC
Framework	Transformers, PyTorch
Build System	uv / pip
Key Features	Benchmarking, LoRA fine-tuning, speculative decoding

Good for you if: You want to benchmark small LLMs, learn LoRA fine-tuning, experiment with speculative decoding, or compare models head-to-head.

Complexity: Low. Clean Python scripts with HuggingFace Transformers. Each script is standalone—run benchmarks, chat demos, or fine-tuning independently. Well-documented with shell scripts for common tasks.

Key Takeaways

1B is the efficiency sweet spot. Below this, capability drops. Above, hardware costs rise.
Overtraining works. TinyLlama proves you can compensate for size with data.
LoRA makes fine-tuning accessible. Customize models on consumer hardware.
Speculative decoding is free speed. Use small models to accelerate large ones.
All roads lead to open weights. Every model here is fully open.

What’s Next

Part 6 explores the 2-3B efficient frontier—Phi-2, Gemma, and SmolLM pushing the limits of small model capability.

Resources

Watch the Video

Unmute to hear narration.

February 4, 2026 • Software Wrighter

411 words • 3 min read • Abstract

Five ML concepts in under 30 seconds each: Backpropagation (learning by flowing error backward), Transformers (attention over all tokens), Mamba (linear-time sequence modeling), Hallucination (confident nonsense), and Embeddings (meaning as coordinates).

Five ML Concepts - #1

5 machine learning concepts. Under 30 seconds each.

Resource	Link
Papers	Links in References section
Video	Five ML Concepts #1

References

Concept	Reference
Backprop	Learning representations by back-propagating errors (Rumelhart, Hinton, Williams 1986)
Transformer	Attention Is All You Need (Vaswani et al. 2017)
Mamba	Mamba: Linear-Time Sequence Modeling (Gu & Dao 2023)
Hallucination	Survey of Hallucination in NLG (Ji et al. 2023)
Embedding	Word2Vec (Mikolov et al. 2013)

Today’s Five

1. Backpropagation

Back propagation of errors. It’s how neural networks learn—flowing error backward through the network to adjust each weight.

Without it, modern deep learning wouldn’t be practical.

Think of it like retracing your steps to see which earlier choices caused the mistake.

2. Transformer

The architecture behind GPT, Claude, and most modern language models. Instead of processing words one at a time, transformers use attention to weigh relationships between all tokens.

This enables parallel training and rich context awareness.

Like understanding a sentence by seeing how every word relates to every other.

3. Mamba (State Space Models)

A newer alternative to transformers that processes sequences in linear time instead of quadratic.

This allows scaling to very long documents with much lower memory use.

Like a smart conveyor belt that carries forward only what matters.

4. Hallucination

When a model generates confident-sounding nonsense. It happens because language models predict plausible next words, not true facts.

They optimize for likelihood, not correctness.

Like a student who writes confidently without verifying sources.

5. Embedding

Turning words, images, or concepts into vectors of numbers. Similar meanings end up close together in this space.

This lets math capture semantic relationships.

Think of it as a coordinate system for meaning.

Quick Reference

Concept	One-liner
Backprop	Learn by flowing error backward
Transformer	Attention over all tokens at once
Mamba	Linear-time sequence modeling
Hallucination	Confident nonsense from likelihood optimization
Embedding	Meaning as coordinates in vector space

Short, accurate ML explainers. Follow for more.

Watch the Video

Unmute to hear narration.

February 3, 2026 • Software Wrighter

842 words • 5 min read • Abstract

LLMs are black boxes. Baby Dragon Hatchling uses brain-inspired sparse coding with 80% sparsity, making only 20% of neurons active per token. When fewer neurons fire, each one carries interpretable meaning. Train it on Shakespeare and actually see what's happening inside.

Small Models (4/6): This AI Has a Visible Brain

LLMs are black boxes. Baby Dragon Hatchling (BDH) is different—a brain-inspired language model with sparse, interpretable activations.

Train it on Shakespeare and actually see what’s happening inside.

This is Part 4 of the Small Models, Big Brains series, exploring interpretability through sparsity.

Resource	Link
Paper	Pathway (Sparse Coding)
Original Code	pathwaycom/bdh
Fork (with tools)	softwarewrighter/bdh
Video	This AI Has a Visible Brain

The Black Box Problem

Modern neural networks are opaque:

Billions of parameters
Dense activations everywhere
No clear mapping from neurons to concepts
“It works, but we don’t know why”

This isn’t just an academic concern. We’re deploying AI systems we don’t understand.

Baby Dragon Hatchling: A Different Approach

BDH takes inspiration from biological brains, which use sparse coding:

Biological Brains	Dense Neural Networks
~1-5% neurons active	~100% neurons active
Energy efficient	Computationally expensive
Interpretable patterns	Distributed, opaque
Robust to noise	Brittle

Sparse Activations

BDH enforces 80% sparsity—only 20% of neurons are active for any given token.

Dense Network:    [████████████████████] 100% active
BDH:              [████░░░░░░░░░░░░░░░░]  20% active

This constraint forces the network to learn meaningful, localized representations.

Training on Shakespeare

The demo trains BDH on Shakespeare’s works:

Training Progress:
Epoch 1:   Loss 0.86
Epoch 50:  Loss 0.54
Epoch 100: Loss 0.38
Epoch 200: Loss 0.22

Loss drops from 0.86 to 0.22—the architecture works.

Seeing Inside the Model

With sparse activations, you can actually inspect what neurons mean:

# Which neurons fire for "love"?
activations = model.forward("love")
active_neurons = activations.nonzero()

# Neuron 47: fires for emotional words
# Neuron 112: fires for abstract nouns
# Neuron 203: fires for relationship terms

When only 20% of neurons fire, each one carries interpretable meaning.

Running the Code

The bdh repository is a fork of Pathway’s original with added inspection tools:

git clone https://github.com/softwarewrighter/bdh
cd bdh
pip install -r requirements.txt

# Train on Shakespeare
python train.py --dataset shakespeare --sparsity 0.8

# Inspect activations
python inspect.py --model checkpoint.pt --text "To be or not to be"

GPU recommended (Nvidia or Apple Silicon) for reasonable training times.

Why Sparsity Enables Interpretability

Dense Networks

Every neuron participates in every computation. The “meaning” of any single neuron is distributed across all inputs it ever sees.

Input: "cat"  → All neurons contribute → Output
Input: "dog"  → All neurons contribute → Output
Input: "love" → All neurons contribute → Output

Trying to understand one neuron means understanding everything.

Sparse Networks

Only a small subset of neurons fire for each input. Neurons develop specialization.

Input: "cat"  → Neurons [12, 47, 89] fire → Output
Input: "dog"  → Neurons [12, 52, 89] fire → Output
Input: "love" → Neurons [47, 112, 203] fire → Output

Neuron 12 might mean “animal.” Neuron 47 might mean “emotional/living.” You can actually trace meaning.

Comparison with Other Sparse Architectures

Model	Sparsity Type	Purpose
Mixture of Experts	Routing sparsity	Efficiency
Top-k attention	Attention sparsity	Memory
BDH	Activation sparsity	Interpretability

BDH’s sparsity is specifically designed for understanding, not just efficiency.

Implementation Details

Metric	Value
Primary Language	Python
Source Files	9 `.py` files
Estimated Size	~1.5 KLOC
Framework	PyTorch
Build System	pip / requirements.txt
GPU Support	CUDA, MPS (Apple Silicon)

Good for you if: You want to experiment with sparse neural architectures, study interpretability techniques, or train small language models with visible internals.

Complexity: Low-Moderate. Standard PyTorch project structure. The sparse activation mechanism is well-documented. Fork includes additional inspection tools not in the original.

Key Takeaways

Sparsity enables interpretability. When fewer neurons fire, each one means more.
Brain-inspired design works. Biological neural coding principles transfer to AI.
Interpretability doesn’t require sacrifice. BDH learns effectively despite constraints.
We can build AI we understand. Black boxes aren’t inevitable.

Current Limitations

Early research stage
Smaller scale than production models
Training requires more epochs
Not yet competitive with dense models on benchmarks

But the principle is sound: constraint breeds clarity.

What’s Next

Part 5 dives into the 1B parameter sweet spot—comparing TinyLlama, Llama 3.2, StableLM, and Pythia.

Resources

Watch the Video

Unmute to hear narration.

February 3, 2026 • Software Wrighter

1473 words • 8 min read • Abstract

Single explorer: 0% success. Five explorers: 60% success. Sparse rewards are an information problem, not a compute problem. Using multiple scouts with different exploration strategies, we gather diverse discoveries that benefit a shared learner.

Solving Sparse Rewards with Many Eyes

Single explorer: 0% success. Five explorers: 60% success.

Learning often fails not because models are slow, but because they see too little. In sparse-reward environments, a single explorer is likely to miss the rare feedback entirely. The solution? Put many eyes on the problem.

Resource	Link
Related	Intrinsic Rewards and Diversity
Papers	IRPO · Reagent
Code	many-eyes-learning
ELI5	eli5.md
Video	Given enough eyeballs…

The Problem: Sparse Rewards Create Blindness

As IRPO formalizes: in sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal.

A 7x7 grid with a single goal demonstrates this perfectly:

Random agent success rate: ~9%
With limited training (75 episodes), a single learner exploring alone never finds the goal

This isn’t a compute problem. It’s an information problem.

Challenge	Effect	Paper Connection
Rare rewards	Weak gradient signal	IRPO’s core problem statement
Single explorer	Limited coverage	Why multiple scouts help
Random exploration	Misses valuable states	Why intrinsic rewards matter
No feedback structure	Can’t distinguish “almost right” from “nonsense”	Reagent’s motivation

The Solution: Many Eyes

Instead of one explorer, use multiple scouts—independent exploratory agents that gather diverse information.

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   Scout 1   │  │   Scout 2   │  │   Scout N   │
│ (strategy A)│  │ (strategy B)│  │ (strategy N)│
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │
       v                v                v
┌─────────────────────────────────────────────────┐
│              Experience Buffer                   │
└─────────────────────────────────────────────────┘
                       │
                       v
┌─────────────────────────────────────────────────┐
│               Shared Learner                     │
└─────────────────────────────────────────────────┘

Each scout explores with its own strategy. Their discoveries are aggregated to improve a shared learner.

Results

On a 7x7 sparse grid with 75 training episodes:

Method	Success Rate
Random baseline	9%
Single scout	0%
Many eyes (3 scouts)	40%
Many eyes (5 scouts)	60%

Same total environment steps. Dramatically better outcomes.

Why It Works

Single Scout Fails Because:

In IRPO terms: sparse reward → sparse gradient signal → no learning.

Random exploration rarely reaches the goal (~9%)
Insufficient successful trajectories
DQN can’t learn from sparse positive examples
The policy gradient has near-zero magnitude

Many Eyes Succeeds Because:

IRPO’s key insight: multiple exploratory policies manufacture signal.

More coverage: Different scouts explore different regions (intrinsic rewards drive novelty-seeking)
More discoveries: Higher probability of reaching goal (scouts find extrinsic reward)
Signal routing: Scout discoveries update the shared learner (surrogate gradient in IRPO, experience pooling in many-eyes)
Better gradients: Aggregated experience provides meaningful learning signal

Scout Strategies (Intrinsic Rewards)

IRPO uses intrinsic rewards to drive exploration. The many-eyes-learning project implements several strategies:

Strategy	Intrinsic Motivation	IRPO Connection
Epsilon-greedy	Random action with probability ε	Simple exploration noise
Curious	Bonus for novel states: `1/√(count+1)`	Count-based intrinsic reward
Optimistic	High initial Q-values	Optimism under uncertainty
Random	Pure random baseline	Maximum entropy exploration

# CuriousScout intrinsic reward (simplified)
def intrinsic_reward(self, state):
    count = self.state_counts[state]
    return self.bonus_scale / sqrt(count + 1)

Scouts can be homogeneous (same strategy, different seeds) or heterogeneous (different strategies). IRPO supports swapping intrinsic reward functions—many-eyes makes this concrete with pluggable scout types.

Running the Demo

git clone https://github.com/softwarewrighter/many-eyes-learning
cd many-eyes-learning

# Setup
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Interactive CLI demo
python experiments/cli_demo.py

# Full experiment
python experiments/run_experiment.py --episodes 75 --scouts 1 3 5

# Generate plots
python experiments/plot_results.py

Results appear in ~5-10 minutes on a laptop.

Diversity Experiment

Does diversity of strategies matter, or just number of scouts?

Configuration	Success Rate
5 random scouts	20%
5 epsilon-greedy scouts	40%
5 diverse scouts (mixed strategies)	40%

Finding: In simple environments, strategy quality matters more than diversity. Epsilon-greedy beats random regardless of diversity.

Key Insight

The problem isn’t that learning is slow. The problem is that learning is blind.

Many eyes make learning better.

Implementation Details

Metric	Value
Primary Language	Python
Source Files	~12 `.py` files
Estimated Size	~1.5 KLOC
Framework	PyTorch, NumPy
Platform	CPU (no GPU required)

Good for you if: You want to understand exploration in RL, experiment with sparse-reward environments, or see a clean implementation of scout-based learning.

Complexity: Low-Moderate. Clean codebase with CLI demos. Runs on a laptop in minutes.

Design Philosophy

The project prioritizes clarity over performance:

Single-file implementations where practical
Minimal dependencies
Sequential mode is first-class (parallel optional)
Reproducible experiments with fixed seeds

Simplifications from IRPO

Full IRPO computes Jacobians to route gradients from exploratory policies back to the base policy. Many-eyes-learning simplifies this:

IRPO	Many-Eyes-Learning
Jacobian chain rule	Experience pooling
Surrogate gradient	Standard DQN updates
Learned intrinsic rewards	Hand-designed strategies

The core insight remains: scouts explore with intrinsic motivation, discoveries benefit the shared learner. The math is simpler, the demo runs on a laptop, and the concept is clear.

Key Takeaways

Sparse rewards create information bottlenecks. Learning fails not from lack of compute, but lack of signal.
More eyes = more information. Multiple scouts increase coverage and discovery rate.
Diversity helps, but quality matters more. In simple environments, good exploration strategy beats diversity.
Same compute, better outcomes. Many-eyes improves sample efficiency, not wall-clock speed.

The Papers Behind Many-Eyes

This project builds on two recent papers that attack the same fundamental problem: sparse rewards starve learning of signal.

IRPO: Intrinsic Reward Policy Optimization

IRPO (Cho & Tran, UIUC) formalizes the scouts concept mathematically.

The core insight: In sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal. Learning stalls.

IRPO’s solution:

┌─────────────────────────────────────────────────┐
│  1. Train exploratory policies (scouts)         │
│     using INTRINSIC rewards                     │
├─────────────────────────────────────────────────┤
│  2. Scouts discover EXTRINSIC rewards           │
│     through exploration                         │
├─────────────────────────────────────────────────┤
│  3. Route extrinsic signal back to base policy  │
│     via surrogate gradient (Jacobian chain)     │
└─────────────────────────────────────────────────┘

IRPO Concept	What It Means
Intrinsic rewards	“Explore what’s new” - reward novelty
Exploratory policies	Scouts driven by intrinsic motivation
Surrogate gradient	Trade bias for signal - approximate gradient that actually has magnitude
Base policy	The learner that benefits from scout discoveries

How many-eyes-learning demonstrates this:

Scouts implement intrinsic motivation (CuriousScout uses count-based novelty bonuses)
Multiple exploration strategies create diverse coverage
Aggregated experience routes discoveries to the shared DQN learner
Simplified gradient routing - we pool experiences rather than compute full Jacobians

Reagent: Reasoning Reward Models for Agents

Reagent (Fan et al., CUHK/Meituan) takes a different approach: make feedback richer and more structured.

The problem with sparse rewards: They can’t distinguish “almost right, failed at the end” from “complete nonsense.” Both get the same zero reward.

Reagent’s solution: Build a Reasoning Reward Model that emits:

Signal	Purpose
`<think>`	Explicit reasoning trace
`<critique>`	Targeted natural-language feedback
`<score>`	Overall scalar reward

This provides dense-ish supervision without hand-labeling every step.

How many-eyes-learning relates:

Both papers recognize sparse rewards as an information problem
Reagent enriches the reward signal; IRPO multiplies the exploration
Many-eyes takes the IRPO path: more explorers finding the sparse signal
Future work could combine both: scouts + richer feedback per trajectory

The Shared Meta-Lesson

Both papers are saying the same thing:

Sparse signals are a tragedy. Let’s smuggle in richer ones.

IRPO: via intrinsic-reward exploration gradients
Reagent: via language-based reward feedback

Many-eyes-learning demonstrates the IRPO intuition in a simple, visual, reproducible way.

Resources

Sparse rewards are an information problem. Many eyes provide the solution.

Watch the Video

Unmute to hear narration.

February 2, 2026 • Software Wrighter

661 words • 4 min read • Abstract

Teaching Claude to play tic-tac-toe and trash talk using Model Context Protocol (MCP). A Rust server exposes 6 tools via JSON-RPC over stdio, proving MCP standardizes AI tool integration across any compatible language model.

MCP: Teaching Claude to Play (and Trash Talk)

Claude learned to play tic-tac-toe. And trash talk. Using one protocol that works with any language model.

Resource	Link
Code	game-mcp-poc
MCP Spec	modelcontextprotocol.io
Video	Claude Plays Tic-Tac-Toe

The Problem

Language models are stuck in text. They can’t click buttons, make moves, or interact with real systems. Every integration is custom—different for Claude, GPT, Gemini.

The Solution: MCP

Model Context Protocol is a standard way for models to use tools. Define your tools once, they work with Claude, GPT, or any MCP-compatible agent.

The protocol is simple:

JSON-RPC 2.0 over stdio
No HTTP needed
Clean request/response cycle

The Demo: Trash Talkin’ Tic Tac Toe

This proof-of-concept implements 6 MCP tools:

Tool	Purpose
`view_game_state`	See the board, players, status
`get_turn`	Whose turn is it?
`make_move`	Play a square (row, col)
`taunt_player`	Send trash talk to opponent
`restart_game`	Start a new game
`get_game_history`	All moves with timestamps

The AI calls tools, the server responds. Claude can play a full game AND talk trash—all through the same protocol.

Architecture

┌─────────────────────────────────────────────┐
│            Claude Code (AI)                 │
│              (MCP Client)                   │
└──────────────────┬──────────────────────────┘
                   │ JSON-RPC 2.0 via stdio
                   ▼
┌─────────────────────────────────────────────┐
│         MCP Server (Rust Binary)            │
│  ┌───────────────────────────────────────┐  │
│  │  6 Tools: view, turn, move, taunt,   │  │
│  │           restart, history            │  │
│  └───────────────────────────────────────┘  │
│                   ▼                         │
│  ┌───────────────────────────────────────┐  │
│  │      SQLite (game.db)                 │  │
│  │  • Games • Moves • Taunts             │  │
│  └───────────────────────────────────────┘  │
└─────────────────────────────────────────────┘
         ▲                           ▲
         │ REST API                  │ MCP
         │                           │
    Browser (UI)              AI Agent
    (Yew/WASM)              (Claude Code)

Running It

git clone https://github.com/sw-game-dev/game-mcp-poc
cd game-mcp-poc

# Development mode (with hot-reload)
./scripts/dev.sh

# Or production build
./scripts/build.sh
./scripts/serve.sh

The server runs on http://localhost:7397 serving:

REST API for UI interactions
MCP endpoint for AI agents
SSE for real-time updates
Yew/WASM frontend

Configuring Claude Code

Add to ~/.config/claude-code/mcp.json:

{
  "mcpServers": {
    "tic-tac-toe": {
      "command": "/path/to/game-mcp-poc/target/release/game-mcp-server",
      "args": [],
      "env": {
        "GAME_DB_PATH": "/path/to/game.db"
      }
    }
  }
}

Restart Claude Code, then:

You: "Let's play tic-tac-toe! Show me the board."
You: "I'll take the center."
You: "Your turn!"
You: "Can you taunt me?"

Implementation Details

Metric	Value
Language	Rust 2024 Edition
Frontend	Yew + WebAssembly
Database	SQLite
Tests	175+ passing
LOC	~2,500 (backend) + ~1,500 (tests)
Binary Size	~8 MB

Good for you if: You want to learn MCP, build AI-tool integrations, or see a production-quality Rust game server.

Complexity: Moderate. Clean architecture with TDD. Requires Rust toolchain and understanding of JSON-RPC.

Key Takeaways

MCP standardizes AI tools. Define once, works with any compatible model.
JSON-RPC over stdio is elegant. No HTTP complexity for local tools.
Rust + WASM = fast everywhere. Same language for server and (via Yew) frontend.
Trash talk is essential. Games without taunting are just… exercises.

Resources

MCP turns language models into tool users. This demo proves it works—and that AI can talk trash.

Watch the Video

Unmute to hear narration.

February 2, 2026 • Software Wrighter

789 words • 4 min read • Abstract

27 million parameters beats o3-mini on ARC. The Hierarchical Reasoning Model separates planning from execution, mimicking the brain's dual-process theory. It achieves 40% on the hardest reasoning benchmark where most LLMs score under 5%.

Small Models (3/6): Planner + Doer = Genius

27 million parameters beats o3-mini on ARC.

The hardest reasoning benchmark. Most LLMs score under 5 percent. This tiny model scores 40 percent.

This is Part 3 of the Small Models, Big Brains series, exploring the Hierarchical Reasoning Model (HRM)—a brain-inspired architecture that separates planning from execution.

Resource	Link
Paper	Hierarchical Reasoning Model
Original Code	sapientinc/HRM
Visualization	viz-hrm-ft
Video	Planner + Doer = Genius

The ARC Challenge

The Abstraction and Reasoning Corpus (ARC) tests:

Abstract reasoning
Pattern matching
Spatial logic
Puzzles requiring real thinking

These aren’t problems you can memorize. Each puzzle is unique, requiring genuine understanding of the underlying pattern.

Why LLMs Struggle

Challenge	LLM Limitation
Novel patterns	Can’t rely on training data
Spatial reasoning	Text-based thinking is linearized
Multi-step logic	Each step compounds errors
Abstraction	Pattern matching isn’t generalization

Meet HRM: The Hierarchical Reasoning Model

HRM uses just 27 million parameters but achieves remarkable results by mimicking how the brain thinks: plan first, then act.

Two-Module Architecture

┌─────────────────────────────────────┐
│           PLANNER                   │
│   Thinks slow and abstract          │
│   Sets goals and strategies         │
└─────────────┬───────────────────────┘
              │ Goals
              ▼
┌─────────────────────────────────────┐
│            DOER                     │
│   Moves fast                        │
│   Takes concrete actions            │
└─────────────────────────────────────┘

Module	Speed	Function
Planner	Slow	Abstract thinking, goal setting
Doer	Fast	Concrete actions, execution

This mirrors the brain’s dual-process theory: System 1 (fast, intuitive) and System 2 (slow, deliberate).

Results

Benchmark	HRM (27M)	o3-mini	GPT-4
ARC	40%	<40%	<5%
Hard Mazes	99%	-	~0%
Complex Sudoku	99%	-	-

A 27M parameter model outperforming models 1000x larger on reasoning tasks.

The Visualization Tool

The viz-hrm-ft repository provides a React app to visualize HRM’s reasoning process:

Watch the Planner form strategies
See the Doer execute actions
Visualize the feedback loop between modules
Simulate fine-tuning on BabyAI tasks

git clone https://github.com/softwarewrighter/viz-hrm-ft
cd viz-hrm-ft
npm install
npm start

Why Hierarchy Works

Traditional Flat Models

Input → [Single Network] → Output

Everything happens in one pass. Complex problems overwhelm the network.

Hierarchical Models

Input → [Planner] → Strategy
                  ↓
Strategy → [Doer] → Action
                  ↓
Action → [Environment] → Feedback
                       ↓
Feedback → [Planner] → Refined Strategy
                     ↓
                    ...

The Planner doesn’t worry about details. The Doer doesn’t worry about strategy. Each module focuses on what it does best.

Key Insights

Separation of concerns scales. Splitting planning from execution lets each module specialize.
Iteration enables refinement. The Planner-Doer loop allows course correction.
Small can beat big. 27M parameters with good architecture beats 100B+ with brute force.
Brain-inspired design works. Mimicking cognitive architecture yields better results.

Comparison with Part 1 (TRM)

Aspect	TRM	HRM
Parameters	<1,000	27M
Architecture	Think-Act cycles	Planner-Doer hierarchy
Strength	Maze solving	Abstract reasoning
Key insight	Iteration	Hierarchical decomposition

Both use recursive reasoning, but HRM adds hierarchical structure for more complex tasks.

Implementation Details

Metric	Value
Primary Language	TypeScript
Source Files	26 `.ts`/`.tsx`, 7 `.js`
Estimated Size	~4 KLOC
Framework	React
Build System	npm / Create React App
Visualization	Canvas-based rendering

Good for you if: You want to visualize neural reasoning processes, build interactive ML demos, or learn React with a real project.

Complexity: Low-Moderate. Standard React/TypeScript project. No ML training code—this is a visualization tool for understanding the HRM architecture. Easy to extend with new visualizations.

Key Takeaways

Plan, then act. Separating strategy from execution mirrors effective human thinking.
Hierarchy enables complexity. Multi-level reasoning handles problems flat networks can’t.
Architecture > Scale for reasoning tasks.
ARC remains unsolved by brute-force scaling—clever architectures are the path forward.

What’s Next

Part 4 explores Baby Dragon Hatchling (BDH)—a brain-inspired model with visible, interpretable activations.

Resources

Watch the Video

Unmute to hear narration.

February 2, 2026 • Software Wrighter

705 words • 4 min read • Abstract

Implementing Deepseek's Engram paper on conditional memory. Instead of recomputing common patterns through O(n^2) attention, Engram provides O(1) lookup for cached results. Our LoRA-based behavioral approximation achieves 58% loss reduction in 10 seconds.

Deepseek Papers (2/3): Engram - Conditional Memory for Transformers

Deepseek publishes papers. I implement them. This paper tackles another fundamental transformer problem: redundant computation.

This post covers my implementation of Engram (Conditional Memory via Scalable Lookup)—running on both Apple Silicon and NVIDIA GPUs.

Resource	Link
Paper	arXiv:2601.07372
Code	engram-poc
Video 1	Engram Part 1
Video 2	Engram Part 2

The Problem: Redundant Computation

LLMs waste compute reconstructing patterns they’ve seen before:

Style rules repeated across files
Common code idioms re-derived each call
Boilerplate knowledge injected repeatedly

Attention computes everything from scratch every time. For recurring patterns, this is wasteful.

The Engram Solution: O(1) Lookup

Engram introduces conditional memory as a complementary sparsity axis. Instead of recomputing common patterns through attention, look them up in O(1) time.

Think of it as a cache for the model’s learned patterns:

Without Engram	With Engram
Recompute pattern every call	Look up cached result
O(n²) attention	O(1) deterministic lookup
Implicit knowledge	Explicit, inspectable memory

The PoC Approach

The full Engram paper describes in-model memory. The engram-poc repo approximates the benefits through behavioral fine-tuning:

Pattern Injection: Training data encodes lookup-like patterns
LoRA Adapters: Learn to recognize and consistently respond
Evaluation: Compare baseline vs tuned model

Pattern Categories

The PoC includes 131 patterns across 4 categories:

Category	Examples
Code Idioms	`for i in range(` → `len(items)):`
Factual Recall	`HTTP status for 'Not Found'?` → `404`
Format Transforms	`snake_case: getUserName` → `get_user_name`
Error Fixes	`Fix: if x = 5:` → `if x == 5:`

Results

Training on SmolLM-135M-Instruct:

Metric	Value
Training Examples	337
Training Time	~10 seconds (M-series Mac)
Loss Reduction	58.2% (4.34 → 1.82)

Behavioral change:

Prompt: Complete: for i in range(

Baseline:     "Here is a Python function that implements this approach..."
Engram-tuned: "len(items)):"

The tuned model produces direct, pattern-completing responses instead of verbose explanations.

Running the Engram Demo

git clone https://github.com/softwarewrighter/engram-poc
cd engram-poc

# Apple Silicon
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
./scripts/run_all.sh

# NVIDIA GPU (separate directory)
cd unsloth-nvidia
uv venv && source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu124
uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
./scripts/run_all.sh

Implementation Details

Metric	Value
Primary Language	Python
Source Files	24 `.py`, 10 `.sh`, 6 `.yaml`
Estimated Size	~3.0 KLOC
Frameworks	MLX-LM, Unsloth
Platforms	Apple Silicon, NVIDIA CUDA
Key Features	LoRA fine-tuning, pattern evaluation, interactive demo

Good for you if: You want to experiment with LoRA fine-tuning, understand behavioral pattern injection, or compare MLX vs Unsloth workflows.

Complexity: Moderate. Includes extensive documentation and video recording guides. Pattern data is human-readable YAML.

Key Takeaways

Engram reduces redundant computation. O(1) lookup for recurring patterns beats recomputing through attention.
LoRA makes experimentation accessible. Fine-tune small models in seconds on a laptop.
Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, with different tooling for each.
Deepseek publishes useful research. Their papers address real problems with practical solutions.

What’s Next

Part 3 will cover Engram Revisited—what happened when we moved from behavioral emulation to real hash-based memory implementation. Spoiler: it works, but not everywhere.

Resources

Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.

Watch the Video

Unmute to hear narration.

February 1, 2026 • Software Wrighter

692 words • 4 min read • Abstract

A 135M parameter model goes from 0% to 75% accuracy in 5 minutes. Using knowledge graph-guided training with rejection sampling, we teach multi-hop reasoning with scaffolding during training, then remove it at inference.

Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs

A tiny 135M parameter model goes from 0% to 75% accuracy in 5 minutes of training. The secret? Knowledge graph-guided training with rejection sampling.

Resource	Link
Paper	KG-Guided RAG (arXiv)
Code	multi-hop-reasoning
ELI5	eli5.md
Demo	Live Demo
Video	LLM with Training Wheels
Part 2	The Distribution Trap

The Problem: Multi-Hop Reasoning

LLMs struggle with questions requiring multiple reasoning steps. “What’s the fix for a crash caused by a corrupted config file on a system running outdated firmware?” requires connecting several facts:

Corrupted config → need config reset
Outdated firmware → need firmware update
Crash context → check dependencies between these fixes

Standard fine-tuning teaches pattern matching. Multi-hop reasoning requires following logical chains.

The Paper’s Approach

Learn with training wheels, remove them after learning completes.

Knowledge Graph-Guided RAG from Princeton proposes using knowledge graphs during training to score reasoning quality—then removing the graph at inference.

The key insight: train with scaffolding, test without it.

My Implementation

The repo implements this for a software troubleshooting domain:

Component	Details
Knowledge Graph	~200 entities, ~600 edges (symptoms, causes, fixes)
Training Data	MCQs with 1-3 hop paths
Eval Data	MCQs with 4-5 hop paths (harder)
Model	SmolLM-135M-Instruct
Framework	MLX (Apple Silicon native)

The Training Pipeline

┌─────────────────────────────────────────┐
│  1. SFT: Learn output format            │
│     TRACE: <reasoning>                  │
│     ANSWER: A|B|C|D                     │
├─────────────────────────────────────────┤
│  2. RSFT: Rejection Sampling FT         │
│     - Generate multiple answers         │
│     - Score with knowledge graph        │
│     - Keep only correct traces          │
│     - Train on winners                  │
└─────────────────────────────────────────┘

The Reward Function

The knowledge graph scores outputs during training:

R_corr: +1.0 correct answer, -2.0 incorrect
R_path: Entity coverage (did the trace mention relevant nodes?)
P_spam: -0.5 penalty for repeating entities (prevents gaming)

At inference, the graph is removed. The model must reason from learned patterns.

Results

Phase	Accuracy	Training Time
Base model	0%	-
After SFT	30%	~2 min
After RSFT	75%	~3 min

The critical finding: distribution matching matters.

Training on easy examples (1-2 hops) hurt performance on hard eval (4-5 hops). Training on examples matching the eval distribution achieved 75%.

Running It

git clone https://github.com/softwarewrighter/multi-hop-reasoning
cd multi-hop-reasoning

# Setup (Apple Silicon)
make setup-mlx

# Full pipeline
make train

Results appear in ~5 minutes on an M-series Mac.

Implementation Details

Metric	Value
Primary Language	Python
Source Files	12 `.py` files
Estimated Size	~1.5 KLOC
Framework	MLX, Transformers
Platform	Apple Silicon (MLX native)

Good for you if: You want to understand knowledge graph-guided training, experiment with rejection sampling fine-tuning, or see how small models can learn reasoning patterns.

Complexity: Moderate. Clean codebase with Make targets for each step. Requires understanding of fine-tuning concepts.

Key Takeaways

Scaffolded training works. Use structured feedback during training, remove it at inference.
Distribution matching matters. Train on examples that match your eval distribution.
Small models can reason. 135M parameters is enough for 75% accuracy on 4-5 hop questions.
MLX makes iteration fast. Full pipeline runs in 5 minutes on a MacBook.

Resources

Knowledge graphs as training wheels—helping small models learn to reason, then letting go.

Watch the Video

Unmute to hear narration.

February 1, 2026 • Software Wrighter

765 words • 4 min read • Abstract

AI in your pocket, no internet required. Pocket Eliza++ runs MobileLLM-350M on Android via llama.cpp and JNI, creating a privacy-first therapist chatbot. The 260MB quantized model achieves ~10 tokens/second on mid-range phones.

Small Models (2/6): AI in Your Pocket

AI on your phone. All day. No internet required.

This is Part 2 of the Small Models, Big Brains series. Today we’re putting a language model in your pocket with Pocket Eliza++—a modern AI therapist that runs completely offline on Android.

Resource	Link
Paper	MobileLLM (ICML 2024)
Code	pocket-llm
Runtime	llama.cpp
Video	AI in Your Pocket

Why Offline Matters

Benefit	Description
Privacy	Data never leaves your device
Speed	No network latency
Cost	No API fees
Offline	Works without internet
Battery	Efficient on-device inference

Cloud AI is convenient, but sometimes you want a conversation that stays on your device.

MobileLLM: Meta’s Edge Champion

MobileLLM is Meta’s sub-500M parameter model optimized specifically for on-device inference.

Architecture Optimizations

Technique	Benefit
Deep-thin design	More layers, fewer parameters per layer
SwiGLU activation	Better performance than ReLU
Embedding sharing	Saves 30% of parameters
Grouped-query attention	Faster inference

The result: a 260MB quantized model (Q4_K_M) that runs smoothly on phones.

Pocket Eliza++

Eliza taking notes

The original ELIZA (1966) used pattern matching to simulate a Rogerian therapist. Pocket Eliza++ uses the same therapeutic approach but with actual language understanding.

Therapeutic Design

The system prompt instructs the model to:

Ask one short question at a time
Never repeat questions
Vary question types (feelings, motivations, specifics)
Never give advice or explanations

It’s a reflective listener, not a problem solver.

Technical Stack

┌─────────────────────────────────┐
│     Kotlin + Jetpack Compose    │  UI Layer
├─────────────────────────────────┤
│            JNI Bridge           │
├─────────────────────────────────┤
│           llama.cpp             │  Inference Engine
├─────────────────────────────────┤
│    MobileLLM-350M (Q4_K_M)      │  Model (260MB)
└─────────────────────────────────┘

Model: MobileLLM-350M quantized to Q4_K_M (260MB GGUF)
Runtime: llama.cpp compiled for Android via NDK
Interface: Kotlin + Jetpack Compose
Bridge: JNI bindings connect Kotlin to native llama.cpp

Building the App

# Clone the repository
git clone https://github.com/softwarewrighter/pocket-llm
cd pocket-llm/android-demo

# Clone llama.cpp into native source
git clone https://github.com/ggerganov/llama.cpp.git \
    app/src/main/cpp/llama.cpp

# Download the model (260MB)
mkdir -p app/src/main/assets
curl -L -o app/src/main/assets/MobileLLM-376M-Q4_K_M.gguf \
    "https://huggingface.co/pjh64/MobileLLM-350M-GGUF/resolve/main/MobileLLM-376M-Q4_K_M.gguf"

# Build and install
./gradlew assembleDebug
adb install -r app/build/outputs/apk/debug/app-debug.apk

Build Requirements

Requirement	Value
Target SDK	35 (Android 15)
Min SDK	28 (Android 9.0)
ABI	arm64-v8a
NDK	CMake for native build
Kotlin	2.0.0

Quick CLI Demo

Don’t want to build the Android app? Test with Ollama:

pip install -r requirements.txt
ollama pull smollm:360m
python3 eliza.py

Performance

On a mid-range Android phone (Snapdragon 7 series):

First token: ~500ms
Generation: ~10 tokens/second
Memory: ~400MB RAM
Battery: Minimal impact for short sessions

Implementation Details

Metric	Value
Languages	Kotlin (UI), Python (CLI), C++ (JNI)
Source Files	6 `.kt`, 4 `.py`, 2 `.cpp`
Estimated Size	~1.3 KLOC
Android Target	SDK 28+ (Android 9.0)
Build System	Gradle + CMake (NDK)
Key Dependency	llama.cpp (vendored)

Good for you if: You want to deploy LLMs on Android, learn JNI/NDK integration, or build privacy-focused mobile AI apps.

Complexity: Moderate-High. Requires Android Studio, NDK setup, and understanding of JNI bridges. The llama.cpp integration is the tricky part; the Kotlin UI is straightforward Jetpack Compose.

Key Takeaways

Sub-500M models are phone-ready. MobileLLM proves useful AI fits in your pocket.
llama.cpp is the universal runtime. Same engine runs on Mac, Linux, Windows, and Android.
Privacy doesn’t require sacrifice. Offline AI can still be conversational and helpful.
Quantization is essential. Q4_K_M brings 350M parameters down to 260MB with minimal quality loss.

What’s Next

Part 3 explores the Hierarchical Reasoning Model (HRM)—a 27M parameter model that beats o3-mini on abstract reasoning.

Resources

Watch the Video

Unmute to hear narration.

February 1, 2026 • Software Wrighter

760 words • 4 min read • Abstract

Implementing Deepseek's mHC (Manifold-Constrained Hyper-Connections) paper. Using Sinkhorn-Knopp iteration to create doubly-stochastic matrices, mHC maintains training stability at 48 layers where standard hyper-connections explode. Cross-platform validation on Apple Silicon and NVIDIA.

Deepseek Papers (1/3): mHC - Training Stability at Any Depth

Deepseek publishes papers. I implement them. This paper tackles a fundamental transformer problem: training stability in deep networks.

This post covers my implementation of mHC (Manifold-Constrained Hyper-Connections)—running on both Apple Silicon and NVIDIA GPUs.

Resource	Link
Paper	arXiv:2512.24880
Code	mHC-poc
ELI5	eli5-mHC.md
ELI4	eli4-mHC.md
Video 1	mHC Demo
Video 2	mHC Explained
Video 3	mHC Results

The Problem: Deep Networks Explode

Residual connections revolutionized deep learning. Skip connections let gradients flow through hundreds of layers. But there’s a catch.

Standard residual connections:

output = layer(input) + input

This works, but the signal accumulates. With many layers, small amplifications compound into instability.

Hyper-Connections (HC) tried to fix this by learning connection weights:

output = α₁ × layer(input) + α₂ × input

Better expressiveness, but learned weights can still cause explosion. At 48 layers, HC becomes unstable.

The mHC Solution: Doubly-Stochastic Constraints

mHC constrains the connection weights using Sinkhorn-Knopp iteration—a mathematical technique that ensures weights form a doubly-stochastic matrix.

What does “doubly-stochastic” mean?

Each row sums to 1
Each column sums to 1

This bounds the total signal flow. No matter how deep the network, amplification stays controlled.

# Sinkhorn-Knopp iteration (simplified)
def make_doubly_stochastic(weights, iterations=5):
    for _ in range(iterations):
        weights = weights / weights.sum(dim=0)  # Column normalize
        weights = weights / weights.sum(dim=1)  # Row normalize
    return weights

Results: Stability at Depth

The mHC-poc repo stress-tests this with a depth sweep:

Depth	Baseline	HC	mHC
12 layers	Stable	Stable	Stable
24 layers	Struggling	Stable	Stable
48 layers	Oscillating	Explodes	Stable

At 48 layers:

HC gain proxy: 10²⁷ (catastrophic amplification)
mHC gain proxy: 10⁻⁰·⁶ (bounded, healthy)

HC’s final loss at 48 layers: 5.54 (never learns) mHC’s final loss at 48 layers: 0.0002 (perfect convergence)

Cross-Platform Validation

The implementation runs on both Apple Silicon (MLX) and NVIDIA (PyTorch/CUDA):

Metric	MLX (Apple)	CUDA (NVIDIA)
Gain Proxy (24L)	-0.6	-0.602
Gradient Stability	Stable	Stable
NaN Events	0	0

Identical results confirm the Sinkhorn-Knopp projection works correctly on both platforms.

Running the mHC Demo

git clone https://github.com/softwarewrighter/mHC-poc
cd mHC-poc

# Apple Silicon (MLX)
uv venv && source .venv/bin/activate
uv pip install -r mlx/requirements.txt
bash scripts/run_depth_sweep.sh

# NVIDIA (CUDA)
cd cuda
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt
bash scripts/run_cuda_depth_sweep.sh

Results go to runs/ with plots showing loss, gradient norms, and gain proxy across depths.

Implementation Details

Metric	Value
Primary Language	Python
Source Files	29 `.py`, 3 `.sh`, 10 `.yaml`
Estimated Size	~2.5 KLOC
Frameworks	MLX, PyTorch
Platforms	Apple Silicon, NVIDIA CUDA
Key Features	Depth sweep, cross-platform validation, visualization

Good for you if: You want to understand mHC’s stability benefits, compare MLX vs PyTorch implementations, or experiment with residual connection variants.

Complexity: Moderate. Well-documented with ELI5 explanations in docs/. Requires understanding of residual connections and matrix constraints.

Key Takeaways

mHC solves deep network instability. Doubly-stochastic constraints bound signal amplification at any depth.
Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, validated to produce identical results.
Deepseek publishes useful research. Their papers address real problems with practical solutions.

What’s Next

Part 2 covers Engram—Deepseek’s approach to reducing redundant computation through conditional memory.

Resources

Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.

Watch the Video

Unmute to hear narration.

January 31, 2026 • Software Wrighter

703 words • 4 min read • Abstract

The best LLMs score zero on hard mazes. A model with 976 parameters scores 85%. The Tiny Recursive Model uses think-act cycles with deep supervision, proving iteration beats scale for tasks requiring backtracking and spatial reasoning.

Small Models (1/6): 976 Parameters Beat Billions

The best large language models score zero on hard mazes. A model with under 1,000 parameters scores 85 percent.

This is Part 1 of the Small Models, Big Brains series, exploring how tiny models with clever architectures outperform massive ones on specific tasks.

Resource	Link
Paper	Tiny Recursive Model
Code	train-trm
Video	976 parameters is more than billions?!

Why LLMs Fail at Mazes

Large language models generate one token at a time. They cannot backtrack. One wrong move and the entire solution fails.

Maze solving requires:

Exploring dead ends
Backtracking when stuck
Maintaining spatial awareness
Planning multiple steps ahead

Autoregressive generation is fundamentally incompatible with these requirements.

Meet TRM: The Tiny Recursive Model

The Tiny Recursive Model uses under 1,000 parameters. Instead of being bigger, it thinks in loops.

Input → Think → Act → Think → Act → ... → Output

A simple two-layer network that iterates until the solution emerges.

The Architecture

TRM alternates between two phases:

Phase	Purpose
Think	Update internal latent state by processing input, current answer, and previous state
Act	Update the answer based on the refined latent state

This process repeats for multiple cycles, progressively improving the output.

TRMConfig {
    input_dim: 5,
    output_dim: 5,
    hidden_dim: 16,
    latent_dim: 16,
    l_layers: 2,      // Network depth
    h_cycles: 3,      // Outer think-act cycles
    l_cycles: 4,      // Inner think cycles
}

The Secret: Deep Supervision

The key insight isn’t just recursion—it’s supervising every step, not just the final answer.

Traditional training:

Input → [black box] → Final Output → Loss

TRM training:

Input → Step 1 → Loss₁
      → Step 2 → Loss₂
      → Step 3 → Loss₃
      → ...
      → Final  → Loss_n

Every iteration gets feedback. The model learns to make progress at each step.

Results

Model	Maze Accuracy
GPT-4	~0% on hard mazes
Claude	~0% on hard mazes
TRM (976 params)	85%

Iteration beats scale.

Running the Code

The train-trm repo provides a complete Rust implementation:

# Clone and build
git clone https://github.com/softwarewrighter/train-trm
cd train-trm
./scripts/build.sh --release

# Train a model
./scripts/train.sh --epochs 1000 --lr 0.01

# Evaluate
./scripts/eval.sh

# Or launch the web UI
cargo install --locked trunk
./scripts/web-serve.sh

The web UI includes interactive maze visualization with solution paths and real-time training charts.

Implementation Details

Metric	Value
Primary Language	Rust
Source Files	21 `.rs` files
Estimated Size	~2.5 KLOC
Also Includes	HTML (web UI), Shell scripts
Build System	Cargo + Trunk (WASM)
Dependencies	ndarray, serde, clap, wasm-bindgen

Good for you if: You want to learn Rust ML from scratch, experiment with recursive architectures, or need a web-based training visualization.

Complexity: Moderate. Clean Rust code with good documentation. The neural network is implemented from scratch (no PyTorch/TensorFlow), making it educational but requiring Rust familiarity.

Key Takeaways

Parameter count isn’t everything. Architecture and training strategy matter more for certain tasks.
Recursion enables backtracking. By iterating, TRM can explore and refine solutions.
Deep supervision accelerates learning. Feedback at every step, not just the end.
Task-specific models excel. TRM isn’t a general-purpose LLM—it’s optimized for maze-like reasoning.

What’s Next

Part 2 explores MobileLLM and running AI completely offline on your Android phone.

Resources

Watch the Video

Unmute to hear narration.

January 30, 2026 • Software Wrighter

1013 words • 6 min read • Abstract

Introduction to Software Wrighter Lab: a blog, YouTube channel, and GitHub repos exploring AI coding agents, systems programming in Rust, and practical ML implementations. Written by Mike Wright, a software engineer with 40+ years of experience from mainframes to modern AI.

Welcome to Software Wrighter Lab

Welcome to Software Wrighter Lab—a blog, YouTube channel, Discord server, and GitHub repos for exploring the intersection of AI coding agents, systems programming, and practical machine learning.

I’m Mike Wright, a software engineer with over four decades of experience, currently focused on AI-assisted development with Rust and WebAssembly.

Quick Links
YouTube	@SoftwareWrighter
GitHub	softwarewrighter
Discord	SW Lab

Contents:

About Me
Programming Languages
What This Blog Covers
Why “Software Wrighter”?
What to Expect
Current Projects
Technology Stack
Get Involved
What’s Next

About Me

I’ve been writing code professionally for over 35 years—an Emacs user since 1989, still going strong.

My background spans mainframes to startups:

IBM Data Processing Division - MVS Dynamic Reconfiguration and Standalone Dump (SADUMP)
IBM T.J. Watson Research - Advisory Programmer on MVS Batch Pipes, Automatic Restart Manager, Java Record I/O, and IMS Data Sharing
Forte Software / Sun Microsystems - Senior Programmer on Forte 4GL/Conductor/Fusion, Open Enterprise Service Bus, and Glassfish
Startups - Individual contributor and management roles including LogiCoy (Open ESB), Likestream (Facebook Clojure App), Guidewire (Platform), Illumio (Network Security Web UI), and Signifyd (Gradle/Docker performance tuning)

Areas I’ve worked in: mainframe O/S development, EAI/B2B middleware, platform engineering, build/release engineering, and embedded programming.

Programming Languages

Over the years, I’ve written production code in:

Era	Languages
Mainframe	APL, Assembler (S/370, S/390), IBM PL/S, PL/AS, PL/X, CMS/TSO Pipelines
Systems	C, C++
Enterprise	Java, Forte 4GL, Guidewire Gosu, Groovy
Web/Modern	JavaScript, TypeScript, Go, Clojure, ClojureScript
Current	Elisp, JavaScript, Kotlin, Python, Rust, WebAssembly

Each language taught me something different about how to think about problems. APL taught me array thinking. Assembler taught me what the machine is actually doing. CMS/TSO Pipelines taught me dataflow composition (an area I plan to revisit in Throwback Thursday posts). Lisp (via Clojure) taught me functional composition. Rust is teaching me ownership and fearless concurrency.

I’m a lifelong learner. When Rust emerged as a modern systems language, I dove in. When AI coding agents became capable enough to be genuine collaborators, I started exploring how they change the way we build software.

This blog and the accompanying YouTube channel document that exploration.

What This Blog Covers

Software Wrighter Lab focuses on three main areas:

1. AI Coding Agents

How do tools like Claude Code, Cursor, and other AI assistants actually perform on real projects? I build the same applications with different agents to compare their strengths and weaknesses.

Vibe coding comparisons (Claude vs GLM, different models)
Practical workflows (parallel coding with git worktrees, hooks, custom scripts)
Tool development (guardian-cli, proact, ralphy)

2. Machine Learning Research Implementation

When interesting ML papers come out, I implement them to understand how they work. The goal isn’t to compete with research labs—it’s to learn by building.

Recent implementations include:

Tiny Recursive Model (TRM) - Under 1,000 parameters solving mazes
Hierarchical Reasoning Model (HRM) - Planner-Doer architecture for abstract reasoning
MobileLLM - Running LLMs offline on Android
Deepseek papers (mHC, Engram) - Novel architectures for efficient inference
MIT’s Recursive Language Model - Implemented in Rust with WASM

3. Rust, WebAssembly, and Practical Tools

Rust is my language of choice for new projects. Combined with WebAssembly, it enables building tools that run anywhere—CLI, browser, or embedded.

Topics include:

Rust/Yew/WASM web applications
Visualization (Three.js, d3.js, pure CSS approaches)
Video production tools (TTS, lip sync, explainer generation)
Developer utilities (installation scripts, repo assistants, modularizers)

Why “Software Wrighter”?

A “wright” is a craftsperson—someone who builds things. A wheelwright builds wheels. A playwright builds plays.

A Software Wrighter builds software, with attention to craft.

The name reflects my belief that good software comes from treating programming as a craft: learning continuously, choosing tools deliberately, and building things that work well and last.

What to Expect

Posts on this blog will typically include:

Links to papers, repos, and videos (above the fold)
Implementation details (language, LOC, complexity assessment)
Working code you can clone and run
Honest assessments of what works and what doesn’t

I’m not trying to sell you anything. This is a lab notebook—a record of experiments, some successful, some not.

Current Projects

As of February 2026, I’m actively working on:

Project	Description	Status
Small Models, Big Brains	6-part series on efficient LLMs	Publishing
Deepseek papers	mHC and Engram implementations	In progress
Explainer pipeline	AI-generated video production	Ongoing
RLM implementations	Recursive Language Models in Rust	Complete

Technology Stack

Most of my current work uses:

Layer	Technology
Systems	Rust
Web	Yew, WASM, TypeScript
ML	Python, PyTorch, HuggingFace
AI Agents	Claude Code, Cursor
Video	OBS, FFmpeg, TTS tools

Get Involved

If any of this resonates with you:

Subscribe to the YouTube channel for video content
Star repos on GitHub that interest you
Join the Discord server to discuss

I’m always interested in discussing these topics with other engineers exploring similar territory.

What’s Next

The first content series, Small Models, Big Brains, starts tomorrow. It’s a 6-part deep dive into small language models that outperform much larger ones on specific tasks:

TRM: 976 parameters beating GPT-4 on mazes
MobileLLM: AI running offline on your phone
HRM: 27M parameters beating o3-mini on abstract reasoning
BDH: A language model with visible, interpretable activations
Billion-parameter models: The efficiency sweet spot
The 2-3B efficient frontier: Phi-2, Gemma, SmolLM

Each post maps to a YouTube video, a GitHub repo, and working code you can run yourself.

Thanks for reading. Let’s build something interesting.

Mike Wright Software Wrighter LLC San Francisco Bay Area

January 29, 2026 • Software Wrighter

1138 words • 6 min read • Abstract

My first program was a horse race game in APL on an IBM mainframe in 1972. This Throwback Thursday post recreates it using GNU APL, exploring array-oriented programming and the ideas that shaped languages from J to NumPy.

TBT (1/?): My First Program Was a Horse Race

My first program was a horse race. Written in APL. On a mainframe. In 1972.

This is the first Throwback Thursday post—a series where I revisit the technologies, languages, and ideas that shaped how I think about programming.

Resource	Link
Code	apl-horse-race
Demo	Live Demo
GNU APL	gnu.org/software/apl
Video	Greek Code, No Lowercase #TBT

APL: A Programming Language

APL was created by Kenneth Iverson at IBM in the 1960s. The name literally means “A Programming Language”—Iverson was a mathematician who designed it as a notation for describing algorithms before it became an actual programming language.

What made APL special:

Feature	Description
Array-oriented	Operations work on entire arrays, not single values
Symbolic notation	Greek letters and mathematical symbols as operators
Interactive	REPL-style development decades before it was common
Terse	Complex operations in a few characters

APL programs look like nothing else:

POS←POS+?5⍴3

This single line adds random values (1-3) to all five horse positions simultaneously. No loops. No iteration. The operation just happens across the entire array.

The IBM 2741 Experience

In 1972, APL\360 ran on IBM mainframes. You accessed it through terminals like the IBM 2741—essentially a modified Selectric typewriter with a special APL typeball.

IBM Selectric APL typeball — APL typeball for IBM Selectric

The typeball had all the APL glyphs: ⍴ ⍳ ∇ ⎕ ← ⌈ ⌊ ⍵ ⍺ ∘ ⊃ ⊂ and dozens more. You physically typed these symbols. The keyboard layout was completely different from anything you’d seen before.

When you made an error, there was no backspace in the modern sense. You’d overstrike characters or start the line over. Programs were stored in workspaces, saved to tape or disk.

The terminal printed on paper. Every interaction left a physical record.

The Horse Race Program

Horse race simulations were popular APL demonstrations. They showed off several things:

Random number generation (? roll operator)
Array operations (updating all positions at once)
Character graphics (crude but effective visualization)
Interactive output (watching the race unfold)

Here’s the verbose version from the repo:

∇ RACE;HORSES;POS;FINISH;ROUND;_
  HORSES←'LUCKY  ' 'THUNDER' 'SHADOW ' 'COMET  ' 'BLAZE  '
  POS←5⍴0
  FINISH←15
  ROUND←0
  ⎕←'══════════════════════════════════════════'
  ⎕←'           THE RACE IS ON!'
  ⎕←'══════════════════════════════════════════'
LOOP:ROUND←ROUND+1
  ⎕←'--- ROUND ',(⍕ROUND),' ---'
  POS←POS+?5⍴3
  SHOWHORSES
  →DONE×⍳∨/POS≥FINISH
  →LOOP
DONE:⎕←'WINNER: ',((⊃(POS=⌈/POS)/⍳5)⊃HORSES),'!'
∇

Key APL Idioms

Array creation:

POS←5⍴0    ⍝ Create array of 5 zeros

The ⍴ (rho) operator reshapes. 5⍴0 means “reshape 0 into a 5-element array.”

Random numbers:

?5⍴3       ⍝ Roll 5 dice, each 1-3

The ? operator is “roll”—like rolling dice. ?5⍴3 rolls five 3-sided dice.

Finding the winner:

(⊃(POS=⌈/POS)/⍳5)⊃HORSES

This reads right-to-left:

⌈/POS — maximum of all positions
POS=⌈/POS — boolean mask: which horses are at max?
/⍳5 — compress: keep only those indices
⊃ — take the first one
⊃HORSES — select that horse’s name

One line. No loops. Pure array thinking.

The Idiomatic Version

APL programmers pride themselves on terseness. The idiomatic version does the same thing in fewer characters:

HORSES←'LUCKY  ' 'THUNDER' 'SHADOW ' 'COMET  ' 'BLAZE  '

∇ SHOW;I
  I←1
N:⎕←(I⊃HORSES),'│',((I⊃POS)⍴'░'),'▓'
  I←I+1
  →N×⍳I≤5
∇

∇ RACE;POS;_
  POS←5⍴0
  ⎕←'THE RACE IS ON!'
L:_←⎕DL 0.3
  POS←POS+?5⍴3
  SHOW
  ⎕←''
  →L×⍳~∨/POS≥15
  ⎕←'WINNER: ',(⊃(POS=⌈/POS)/⍳5)⊃HORSES
∇

The entire program fits on a single screen. This was the APL aesthetic: powerful ideas expressed concisely.

Running It Today

GNU APL implements ISO 13751 (Extended APL) and runs on modern systems:

# macOS
brew install gnu-apl

# Arch Linux
yay -S gnu-apl

# Run the race
git clone https://github.com/sw-comp-history/apl-horse-race
cd apl-horse-race
apl -f src/race.apl

Sample output:

══════════════════════════════════════════
           THE RACE IS ON!
══════════════════════════════════════════

--- ROUND 1 ---
LUCKY   │▓▓▓◄
THUNDER │▓▓◄
SHADOW  │▓◄
COMET   │▓▓▓◄
BLAZE   │▓▓◄

The horses advance randomly until one crosses the finish line.

What APL Taught Me

APL shaped how I think about programming in ways that persist today:

1. Think in arrays, not loops.

When I see a problem, I ask: can this be expressed as an operation on a whole collection? Languages like NumPy, R, and Julia carry this forward.

2. Notation matters.

Good notation can make complex ideas simple. Bad notation obscures them. APL’s symbols were controversial, but they made array operations visible in ways that verbose syntax doesn’t.

3. The REPL is powerful.

Interactive development—type an expression, see the result immediately—was central to APL decades before it became fashionable again with Jupyter notebooks and modern REPLs.

4. Terseness has value.

Not obfuscation for its own sake, but the ability to see an entire algorithm at once. When your program fits on one screen, you can reason about the whole thing.

APL’s Legacy

APL influenced many languages:

Language	Year	APL Influence
J	1990	Iverson’s ASCII-only redesign
K/Q	1993	Powers financial systems at Kx
A+	1988	Morgan Stanley’s open-source APL
BQN	2020	Modern APL with cleaner semantics
NumPy	2006	Array operations in Python
R	1993	Vector operations for statistics

The ideas live on, even if the glyphs don’t.

Implementation Details

Metric	Value
Primary Language	APL
Source Files	2 `.apl` files
Lines of Code	~50 lines total
Runtime	GNU APL
Also Includes	Documentation, PNG samples for Unicode issues

Good for you if: You want to understand array programming origins, learn basic APL, or experience what programming felt like in the 1970s.

Complexity: Low. The program is intentionally simple—a teaching example, not production code. The repo includes extensive documentation explaining every line.

Why Throwback Thursday?

Programming didn’t start with Python and JavaScript. Every abstraction we use today was invented by someone solving a real problem.

TBT posts will explore:

Languages that shaped my thinking (APL, Lisp, Forth)
Technologies that were ahead of their time (CMS/TSO Pipelines, dataflow)
Ideas worth revisiting with modern tools

Understanding where we came from helps us see where we’re going.

Resources

Have your own “first program” story? Find me on YouTube @SoftwareWrighter.

Watch the Video

Unmute to hear narration.