• 6886 words35 min readAbstract

    Embedded (3/?): How Much of Forth Can Be Forth? A Kernel Self-Hosting Spectrum

    How much of a Forth kernel can be written in Forth instead of assembly? The question has an obvious answer (“as much as possible”) and a less obvious answer (“it depends on which phase of the bootstrap you’re in”). This post walks through four points along that spectrum for the COR24 Forth kernel: two phases shipped, a third in progress with its first subsets landing, and a fourth on the horizon.

    It’s a deep dive — every movement of a word from .s to .fth changes the bootstrap ordering, the primitive set, and what the next movement looks like. Forth is an unusually good language for showing its own self-extending nature, and reading the phases in sequence reads like one of Escher’s drawings: each hand sketching the other.

    Why this matters — Self-hosting is the final test that a language is expressive enough for systems work. Moving Forth words from assembly into Forth itself shows exactly where the irreducible floor is: the primitives that must be machine code. Everything above that floor can, in principle, live in .fth source.

    Resource Link
    Play in Browser COR24 Forth Demo — three tabs: forth.s (phase 1), forth-in-forth (phase 2, default), forth-on-forthish (phase 3 in progress)
    Forth Interpreter (CLI) sw-embed/sw-cor24-forth
    Web Demo sw-embed/web-sw-cor24-forth
    Approach Comparison docs/future.md
    Phase 2 Status forth-in-forth/docs/status.md
    Closed Issues #1 hashed dictionary · #2 DO/LOOP & friends
    Prior Post Embedded (2/?): COR24 Assembly Emulator
    Comments Discord

    The Four Approaches

    A single axis: what fraction of the kernel is hand-written assembly, and what fraction is Forth? Four labeled points along it:

    # Name Directory Where the kernel comes from
    1 All-asm kernel ./ (repo root) Hand-written .s
    2 Tiered Forth on a slimmed kernel ./forth-in-forth/ Hand-written .s + hand-written .fth
    3 Minimal-primitive kernel ./forth-on-forthish/ Smaller .s (a Forth-ish primitive substrate) + larger .fth
    4 Self-hosted via cross-compiler ./forth-from-forth/ Hand-written Forth compiler emits the .s

    The preposition family — in / on-ish / from — signals what the kernel is to the Forth code on top of it:

    • In approach 2, Forth runs in a slimmed asm host.
    • In approach 3, the substrate is so reduced it’s barely asm any more — Forth runs on something that’s already Forth-ish.
    • In approach 4, the kernel itself comes from Forth (Forth source emits the .s).

    Phase 1: All-Asm Kernel — Where We Started

    forth.s as a single self-contained file. Every word is assembly, including IF/THEN, ., WORDS, .S, \, (, and so on. About 3000 lines of asm, 3879 bytes assembled. Still the canonical kernel for the web frontend and the existing reg-rs regression tests.

    This was the right starting point. A single-file kernel is easy to debug, easy to load, and explicit about every mechanism. The cost: it doesn’t show Forth’s most characteristic feature — self-extension — because everything is already defined in asm. There’s no moment where Forth makes itself bigger.

    Phase 2: forth-in-forth — Shipped Today

    forth-in-forth/kernel.s plus four tiered .fth files: core/minimal.fth, lowlevel.fth, midlevel.fth, highlevel.fth. The kernel keeps only what must be asm — the threading layer, ALU primitives, hardware I/O, the dict-text triplet (WORD/FIND/NUMBER), the outer loop (INTERPRET/QUIT), and :/;. Everything else moved to .fth.

    The migration happened in 11 subsets, each a single commit:

    Subset Description Commit
    1 Baseline fib example, demo, reg-rs test 86edf74
    2 Scaffold forth-in-forth/ directory 94e76b2
    3 Move IF/THEN/ELSE/BEGIN/UNTIL to Forth 686c65f
    4 Move \ and ( to Forth (add EOL! primitive) 71e1627
    5 Stack & arith helpers in core/lowlevel.fth 7d0037c
    6 = and 0= via XOR ce57489
    7 CR SPACE HEX DECIMAL to Forth 06a8dca
    8 . to Forth (hide asm .) 12de5b1
    9 DEPTH / .S to Forth (add SP@ primitive) d65ae26
    10 WORDS VER SEE to Forth (add ', >NAME) c908615
    11 repl.sh and see-demo.sh 8c9104a

    The net movement was 18 words out of asm, 3 new asm primitives in, and 19 brand-new Forth words added on top:

    Category Words
    Moved asm → Forth (18) IF, THEN, ELSE, BEGIN, UNTIL, \, (, =, 0=, CR, SPACE, HEX, DECIMAL, ., DEPTH, .S, WORDS, VER
    New asm primitives (3) ['] (needed for Forth IF/THEN to compile BRANCH/0BRANCH at compile time), EOL! (needed for Forth \ to end the input line), SP@ (needed for Forth DEPTH/.S to inspect the stack pointer)
    New Forth words (19) NIP, TUCK, ROT, -ROT, 2DUP, 2DROP, 2SWAP, 2OVER, 1+, 1-, NEGATE, ABS, /, MOD, 0< (lowlevel); ', PRINT-NAME, >NAME, SEE (highlevel)

    The headline numbers after phase 2:

    Category Before After Δ
    asm dictionary entries 65 50 −15
    asm lines (kernel.s) 2852 2239 −613 (−22%)
    Assembled binary bytes 3879 2786 −1093 (−28%)
    Forth colon defs (core/*.fth) 0 37 +37
    Total vocabulary visible at REPL 62 86 +24

    Forth words, by tier:

    Tier Count Words
    minimal.fth 9 BEGIN UNTIL IF THEN ELSE 0= = ( \
    lowlevel.fth 15 NIP TUCK ROT -ROT 2DUP 2DROP 2SWAP 2OVER 0< 1+ 1- NEGATE ABS / MOD
    midlevel.fth 5 CR SPACE HEX DECIMAL .
    highlevel.fth 8 DEPTH .S ' PRINT-NAME WORDS VER >NAME SEE

    SEE SQUARE now prints DUP * ;. SEE CUBE prints DUP SQUARE * ;. The machinery for decompiling a colon definition lives in Forth, because SEE itself is Forth. That’s the self-extending story the all-asm kernel couldn’t tell.

    Why Phase 2 Stopped Where It Did

    Three categories of word resist moving to Forth, and together they explain the ~50 asm primitives left:

    1. Threading-layer primitives are below Forth’s level. NEXT, DOCOL, EXIT, LIT, BRANCH, 0BRANCH, EXECUTE define how threaded code runs. They can’t themselves be threaded code — the CPU has to jump to them.
    2. Some primitives need hardware/ALU/memory access. +, @, !, KEY, EMIT, LED!, SW? ultimately compile to native instructions. Forth can wrap them, but something has to execute the actual add, lw, sw, or memory-mapped UART access.
    3. Bootstrap-phase primitives need to exist before any .fth source loads. WORD, FIND, NUMBER, :, ;, INTERPRET, QUIT are all used by the outer interpreter that reads .fth source. They could be Forth in principle — but only if a smaller bootstrap interpreter runs first. Phase 2 sidesteps the recursion by keeping them in asm.

    Phase 3 doesn’t dodge category 3. It attacks it head-on.

    What the Web Port Taught Us

    Building the phase 2 tab in web-sw-cor24-forth — alongside the phase 1 forth.s tab, and now joined by a phase 3 forth-on-forthish tab — surfaced two categories of learning: performance and vocabulary.

    The performance thread spans three hash functions, a 1-entry lookaside cache, an adaptive web pump-loop, and a build-time bootstrap snapshot (infrastructure shipped but gated off for measurement cleanliness). The vocabulary thread surfaced when the phase-2 tab’s Forth sat side-by-side with standard Forth idioms in teaching material. Both threads shipped fixes, some explicit deferrals, and one feature-flagged fast-path that stays off until the kernel-side work finishes.

    Making It Fast, Part 1: Hashing FIND — Three Attempts

    The obvious suspect for slow bootstrap was FIND: a linear O(N) walk of the LATEST link chain, called for every token in every .fth line. At 90 dictionary entries (50 asm + 40 Forth colon defs), the constant factor should add up. That hypothesis drove three successive attempts, documented in detail in docs/hashing-analysis.md.

    A glance at the first-letter distribution explains why the first attempt was in trouble:

    First char Words (count)
    S SWAP STATE SW? SP@ SEE-CFA SEE SPACE (7)
    E EMIT EXIT EXECUTE EOL! ELSE (5)
    D DROP DUP DEPTH DUMP-ALL DECIMAL (5)
    C C@ C! C, CREATE CR (5)
    B BRANCH BASE BYE BEGIN (4)
    2 2DUP 2DROP 2SWAP 2OVER (4)

    Only ~43 distinct first-letter classes across 90 words. Any scheme that hashes on first char alone saturates long before 256 buckets.

    Attempt 1 — First-char buckets (shipped)

    A 256-bucket first-character table (tracked in sw-cor24-forth#1, commit a3a63f0). Populated at _start by walking LATEST newest-first, maintained by do_create on every new header, with linear fallback on bucket miss. Correctness held — all reg-rs tests pass, SEE, DUMP-ALL, every example produced identical UART output.

    The measurement was humbling:

    CLI speedup on fib-demo compile: ~0% within timestamp resolution. cor24-run reports instruction timestamps rounded to 10K. Last UART TX for fib complete: 61.17M inst with hash vs 61.17M inst without.

    Profiling showed why. FIND is only ~0.3% of compile time. The other 99.7% splits between KEY’s UART busy-wait (spinning while cor24-run delivers the next input byte) and the threaded-code overhead of Forth-defined IMMEDIATE words (IF, BEGIN, UNTIL, \, (). Shrinking FIND from ~250 inst to ~50 inst per lookup saves ~200K inst, which disappears into the 61M total.

    Still, WASM might behave differently. And with EMIT/EXIT, OVER/OR, and similar first-letter twins all sharing buckets, the fallback was doing more work than it should have been. Time for a better hash.

    Attempt 2 — len-seeded mult33 (shipped, with a detour)

    An offline collision analysis ran nine hash functions against all 90 known dictionary words, at bucket sizes 64/128/256/512. Full data lives in docs/hashing-analysis.md; the summary:

    Hash function 64 128 256 512
    first_char 47 47 47 47
    len + first + last 47 34 34 34
    len*31 + first + last 42 28 17
    djb2 44 29 17
    mult33 (no seed) 44 31 21
    fnv1a 44 28 17
    len-seeded mult33 34 25 11 9
    2-Round XMX 23 15 8

    Len-seeded mult33 (h = length; for c in name: h = h*33 + c) won at 256 buckets with 11 collisions — a 35% improvement over the closest classical competitor. The length seed perturbs the initial state so short words spread out early in the iteration.

    The rollout itself was instructive. Commit 485f36f landed mult33 without the full example-suite check and broke the web agent. Commit ab9817f reverted. Commit 9bd4b10 re-landed it properly — all 15 examples byte-identical to the first-char version on CLI, then tested on WASM. WASM verdict: works, but wall-clock still not fast enough. A better hash doesn’t rescue a cold-boot that spends the majority of its time not in FIND at all.

    That measurement effectively closed out hashing as a standalone fix. If bootstrap speed mattered on WASM — and it did, because the “forth-in-forth” tab felt visibly slower than the forth.s tab — something more fundamental than a hash swap was needed. The “Build-Time Bootstrap Dump” section below describes that answer.

    Attempt 3 is still worth running, though, for reasons specific to this ISA.

    Attempt 3 — 2-Round 24-bit XMX (shipped)

    The updated docs/hashing.txt design notes — a Gemini-assisted survey of 2025–2026 hashing research — surface three recent developments that change the tradeoffs:

    • Krapivin’s optimal open addressing (2025). Probe sequence keeps lookups near-constant-time even at 99% table occupancy. Probe 2 becomes (index + (hash >> 12) + 1) & mask instead of +1 — a tiny asm change that avoids the clustering cliff classical linear probing hits when tables fill.
    • Learned / data-aware hashing. For a static Forth core with a known vocabulary at build time, a perfect-hash-function generator can emit a hash with zero collisions on the core dictionary, lookup collapsing to a single multiply-shift.
    • SSHash cache-locality hashing (2024–2026). Order-preserving hashing for short strings (Forth word names are shaped like bioinformatics k-mers). Keeps related words physically close in RAM so the CPU prefetcher stays effective.

    For COR24’s constraints — 24-bit words, ~8 GPRs, sometimes no hardware multiplier — the pick was 2-Round 24-bit XMX (Xor, Multiply, Xor), which shipped in commit fdae7dd:

    \ R0 = running hash (24 bits, native word size)
    \ R1 = next character (or temp during avalanche)
    \ R2 = MAGIC = 0xDEADB5, loaded once
    \ Per character:
    XOR              \ R0 ^= R1            (mix char into hash)
    24_BIT_MUL       \ R0 *= R2            (native 24-bit truncation)
    DUP 12 RSHIFT    \ R1 = R0 >> 12
    XOR              \ R0 ^= R1            (spread high bits into low bits)
    

    Two registers, no overflow waste (every bit of the 24-bit GPR carries signal), and the h ^ (h >> 12) avalanche step is the most bit-distributing operation tested. In the collision analysis XMX tied mult33’s worst-bucket depth at 256 (3) and beat it at 512 (2 vs 3). Per-character cost: ~10 COR24 ops vs ~4 for mult33 (roughly 2.5× slower per char), but for typical 4-character word names that’s ~24K extra instructions across a full bootstrap — noise against 61M.

    All 15 example files and scripts/see-demo.sh produce byte-identical UART output vs the first_char baseline. Verified correctness, shipped, moved on.

    Making It Fast, Part 2: A 1-Entry Lookaside Cache

    A better hash function still does compute_hash → bucket probe → name compare for every token. Most colon-def bodies repeat words back-to-back (DUP DUP, DROP DROP, a word used twice in the same definition). Why recompute?

    Commit 4ea2f79 added a 1-entry lookaside cache (classic memento pattern). After every successful FIND, the kernel stashes a single triple — (full 24-bit XMX hash, CFA, flag) — in fixed memory. The next FIND that produces the same full hash skips the bucket probe and the name compare entirely, pushes the cached (cfa, flag), and returns.

    Property Choice Why
    Cache size 1 entry Simplest possible memento; covers the “same word twice” pattern which is the common case.
    Cache key Full 24-bit XMX hash (not just the 8-bit bucket index) 24-bit keyspace is effectively collision-free across a 90-word dict. False positives (returning the wrong CFA on a spurious hit) are astronomically unlikely.
    Cache update In find_push_flag just before the NEXT jump Reads flag + CFA off the data stack via mov fp, sp; lw rX, off(fp) without disturbing DS. Three sws to store flag/cfa/hash.
    Cache NOT-FOUND? No Would cause incorrect stale hits when the user later defines the previously-failed word. Only successful lookups are cached.
    Invalidation Implicit — cfa=0 slot treated as empty; overwritten on next successful FIND Simple and correct; a user-defined FORGET that removes the cached word would need to clear the slot, but that isn’t currently implemented.

    Binary size went from 3893 → 3981 bytes (+88 bytes of asm). All 15 example files and scripts/see-demo.sh remained byte-identical.

    CLI measurement once again showed no improvement — cor24-run timestamps quantize to 10,000 cycles, and the per-FIND savings (~30–50 inst per cache hit, ~15–25K across ~1000 lookups) are below that resolution. But this is a measurement-infrastructure limitation, not evidence the cache does nothing: WASM wall-clock has millisecond resolution over a multi-second bootstrap, and that’s where the cumulative savings of hash + lookaside become visible.

    Implementation history

    Commit Hash Notes
    a3a63f0 first_char First hash landed. 47 collisions. Poor distribution but correct.
    485f36f len-seeded mult33 First try at a better hash. Pushed without full test suite; web agent reported broken.
    ab9817f (revert) Reverted to first_char after bug report.
    9bd4b10 len-seeded mult33 Re-landed after all 15 examples went byte-identical. WASM-tested: works, still not fast enough.
    fdae7dd 2-Round XMX Per hashing.txt recommendation for 24-bit GPR ISAs. Shipped.
    4ea2f79 XMX + 1-entry lookaside Memento-pattern cache on top of XMX; +88 bytes. Shipped.

    Making It Fast, Part 3: The Web Tab Goes Snappy

    With the kernel-side hash + lookaside work landing, the web side had its own journey. The web agent tried two approaches in parallel — one shipped disabled, the other turned out to be the real winner.

    The adaptive pump-loop (shipped, the actual fix)

    web-sw-cor24-forth/src/repl.rs runs the emulator in batches between UART byte feeds. The old scheme was a fixed 20k instructions per byte — but for cheap-byte cases (where a single input byte triggers maybe 500 instructions of compile work before the next KEY poll), that meant burning ~19,500 cycles spinning in key_poll waiting for the next byte that the scheduler hadn’t delivered yet.

    Commit f757800 reworked the pump to inspect the CPU’s PC each iteration and adapt:

    Knob Old New Why
    Sub-batch size Fixed 20k inst PUMP_TINY = 2k when PC is at a key_poll with bytes to feed; PUMP_BIG = 50k elsewhere Stop wasting cycles spinning in key_poll on cheap bytes; let real compile work run longer when it has actual work to do
    Tick batch BOOTSTRAP_BATCH = 500k BOOTSTRAP_BATCH = 600k per tick Small bump; more work per scheduler wake
    Tick interval TICK_MS = 25 everywhere TICK_MS_BOOT = 5 during bootstrap; TICK_MS_INTERACTIVE = 25 once ready Cut scheduler overhead during the one phase where it matters

    “Biggest single win.” Combined with the kernel-side XMX + lookaside work, this dropped the phase-2 tab’s cold-bootstrap from ~10 seconds to subjectively instant.

    The build-time snapshot (infrastructure shipped, gated off)

    The snapshot idea — run the cold bootstrap natively at build time, embed a 64 KB memory + registers blob via include_bytes!, restore on load — is actually implemented: build.rs does the native bootstrap and writes fif_snapshot.bin, src/snapshot.rs parses and restores it, a localStorage cache keys on a content hash of kernel.s + core/*.fth so edits auto-invalidate.

    But it’s shipped with a runtime feature flag, SNAPSHOT_CACHE_ENABLED = false. The reason is honest: with the pump-loop fix alone making the tab feel instant, turning on the snapshot would contaminate kernel-side perf measurements. Any future change to the hash, lookaside, or threading-layer primitives needs to be benchmarked against the slow-path boot, not the pre-warmed one. The flag flips on once the kernel-side optimization work is fully wrapped.

    This also means the originally-planned CLI pre-load-and-dump-to-binary is now formally deferred. The rationale, recorded in docs/plan.md: it’s the biggest expected WASM win, but the same deliverable — a kernel that starts in the ready state, without replaying bootstrap — is exactly what phase 4 (forth-from-forth/) produces as its build artifact. Two paths to the same destination; doing both is wasteful. Revisit if the hash + lookaside + pump-loop stack proves insufficient once the snapshot flag is flipped on.

    The speedups that shipped, stacked

    Speedup Mechanism Where it helps Status
    First-char hashed FIND 256-bucket table + _start populate Any host, marginal on CLI Shipped (a3a63f0); CLI 0% gain
    Len-seeded mult33 hash Drop-in compute_hash subroutine Any host, marginal on CLI Shipped (9bd4b10 after revert ab9817f); WASM still slow
    2-Round 24-bit XMX hash ~10 ops/char XMX avalanche WASM (cheaper bit ops) and denser dictionaries Shipped (fdae7dd)
    1-entry FIND lookaside cache Memento keyed by full 24-bit hash Same-word-twice patterns in compile Shipped (4ea2f79); +88 bytes
    Adaptive web pump-loop PC-aware PUMP_TINY / PUMP_BIG batches; shorter boot tick Web bootstrap; “biggest single win” Shipped (f757800)
    Build-time snapshot + localStorage cache build.rs + snapshot.rs in web crate Web, skipping cold boot entirely Shipped, gated (SNAPSHOT_CACHE_ENABLED=false)
    CLI pre-load-and-dump-to-binary Native bootstrap → .bincor24-run --load-state CLI scripts, CI Deferred — phase 4 produces the same artifact

    Net effect: the live phase-2 tab boots as fast as the phase-1 tab now, even though it’s still doing the full “language builds itself” cold bootstrap — the snapshot fast-path isn’t even on.

    A measurement footnote

    CLI perf numbers look identical across all hash variants. cor24-run reports instruction timestamps quantized to 10,000 cycles; the per-FIND savings of XMX (~200 inst × 1000 lookups) and the lookaside (~30–50 inst × dozens of hits) both land below that resolution. The four-commit CLI iteration — a3a63f09bd4b10fdae7dd4ea2f79 — all reports 61.17M instructions for fib-demo compile. That’s not the optimizations doing nothing; it’s the measurement infrastructure not having the resolution to show it. WASM wall-clock at millisecond granularity over a multi-second boot is the authoritative metric, and there the stacked speedups are very visible.

    The Vocabulary Feels Thin — and Fills In

    FIB and the existing examples already worked on forth-in-forth before any of this — nothing was missing for correctness. What the web tab made obvious, once the phase-2 REPL sat next to standard Forth idioms in teaching material, was how much more ergonomic the same demos would read with a fuller vocabulary.

    The FIB print loop used to look like:

    : FIB ... ;
    : FIBS 0 BEGIN DUP FIB . 1 + DUP 21 = UNTIL DROP ;
    FIBS
    

    Every hand-rolled BEGIN/UNTIL counter is a small tax. In a fuller Forth the same thing reads as:

    : FIB ... ;
    21 0 ?DO I FIB . LOOP
    

    Not shorter by much — but no setup, no sentinel variable, no DROP at the end. Several files in forth-in-forth/examples/ collapsed to one-liners once the vocabulary filled in.

    The additions shipped into both the phase-2 and phase-3 kernels (scoped there — the phase 1 forth.s kernel stays as-is for its existing users):

    Group Shipped Landed in How
    Extra BEGIN-style flow AGAIN, WHILE, REPEAT 3b4f541 Pure Forth in core/minimal.fth, built on 0BRANCH/BRANCH. No new primitives.
    Defining words CONSTANT, VARIABLE 3b4f541 Pure Forth in core/lowlevel.fth, layered on CREATE + ,DOCOL + LIT. DOES> parked for later.
    Counted loops DO, LOOP, ?DO, I, UNLOOP 92cef7f New RS primitives (DO), (LOOP), (?DO), I, UNLOOP in kernel; IMMEDIATE Forth wrappers in core/lowlevel.fth. Matching Forth examples 15-again.fth through 19-do-loop.fth.

    RS layout inside a DO loop body:

    top    [ index ]
           [ limit ]
    deeper [ caller IP ]
    

    Standard Forth convention — UNLOOP must precede an EXIT from inside a loop to restore the caller’s IP. The (LOOP) and (?DO) primitives stash the IP in the frame-pointer register during the compare, because this ISA’s ceq rejects fp as an operand and sw rejects fp as a source; that frees r2 as a scratch register for the limit/index work.

    A handful of additional conveniences (+LOOP, J, LEAVE, DOES>, RECURSE, PICK, ROLL, ?DUP, MIN/MAX, <=/>=/<>) are left for follow-up work. What’s shipped is enough for the demos the browser tab shows side-by-side with the phase-1 kernel.

    Live demos in the web UI (AGAIN, CONSTANT, DO LOOP, VARIABLE) are already wired into both the phase-2 and phase-3 tabs, sharing one demo constant via the refactored component in src/repl.rs.

    The general lesson: a language that only feels thin once it’s compared against a fuller one benefits from that comparison. Good that it surfaced before phase 3 cemented the primitive set.

    Phase 3: forth-on-forthish — First Subsets Shipping

    ./forth-on-forthish/ scaffolded with a copy of phase 2’s kernel and core — the current phase-2 kernel with XMX hash + 1-entry lookaside carried forward (commit 4f5e8ab), verified byte-identical to baseline on all 15 examples. Phase 3 work starts on the optimized substrate, not the pre-hash version. Then the first two subsets landed on top of it:

    • Subset 12 (79f4350): the ,DOCOL primitive. Wraps the existing do_colon_cfa as a named dict entry, exposing the 6-byte far-CFA template emission so a Forth : can build headers without touching asm. First attempt at Forth : / ; in a new core/runtime.fth also landed and was reverted — hit the classic SMUDGE-bit problem where ; at the end of : ; ... ; IMMEDIATE resolves to the in-progress new ; because FIND has no way to skip “being-compiled” entries. Documented three options to unblock (asm tweak to :/; that sets/clears HIDDEN, dedicated HIDE-LATEST/UNHIDE-LATEST primitives, or modify CREATE to always hide).
    • Subset 13 (a98b4b8): Forth : and ; shipping. Went with the “asm sets/clears HIDDEN inline” option — colon_thread now runs do_hide_latest between do_colon_cfa and do_rbrac (sets bit 6 on the new entry so FIND skips it during the rest of the definition). do_semi clears HIDDEN on LATEST before compiling EXIT and zeroing STATE. A new core/runtime.fth tier, loaded first (before minimal.fth), defines:
    : : CREATE ,DOCOL LATEST @ 3 + DUP C@ 64 OR SWAP C! ] ;
    : ; ['] EXIT , LATEST @ 3 + DUP C@ 191 AND SWAP C! 0 STATE ! ; IMMEDIATE
    

    No \ comments in runtime.fth\ is defined in minimal.fth which loads after. An initial draft that included comments parsed them as code.

    All 15 examples/*.fth produce the same functional behavior as the first-char hash baseline; the only new UART output is two extra " ok" lines for the two new runtime.fth definitions. The phase-3 kernel now has Forth : and Forth ; — every new colon definition from here on uses the Forth implementations.

    The remaining subsets push further into the primitive set:

    Specific moves enabled by the new substrate:

    Word(s) Strategy New primitive needed Status
    : and ; : : CREATE ,DOCOL ... ] ; plus a tricky ; that compiles EXIT and toggles STATE; both flip HIDDEN inline on LATEST ,DOCOL + HIDDEN-bit handling in colon_thread / do_semi Shipped (subsets 12, 13)
    WORD Forth loop over KEY into a known word buffer WORD-BUFFER (or a fixed address) Planned
    FIND Walk LATEST @ with @, C@, =, AND None — uses existing primitives Planned
    NUMBER Digit-parsing on top of *, +, <, BASE @ None Planned
    INTERPRET / QUIT BEGIN ... UNTIL loops over WORD / FIND / EXECUTE / NUMBER None Planned
    *, /MOD, - +-loops or NEGATE + None — can drop the asm versions Planned
    AND / OR / XOR Derivations from a single bit-primitive NAND (replaces 3 primitives with 1) Planned
    DUP / SWAP / OVER / >R / R> SP@-based memory operations on the data stack SP!, RP@, RP! (already have SP@) Planned

    After the refactor the irreducible asm primitives are approximately:

    NEXT  DOCOL  EXIT  LIT  BRANCH  0BRANCH  EXECUTE
    +  NAND  @  !  C@  C!  KEY  EMIT  SP@  RP@  SP!  RP!
    LED!  SW?  HALT
    

    About 20 primitives, ~600–800 asm lines (vs. ~2240 today). Projected progression:

    Approach asm lines Forth lines asm primitives Self-hosting
    1: all-asm ~2983 0 ~65 100% asm
    2: today 2239 161 50 93% asm
    3: forth-on-forthish ~700 ~600 ~22 54% asm
    4: forth-from-forth 0 hand-written ~1000 Forth ~22 emitted 0% hand-written asm

    The Phased Plan

    Phase 3 breaks into subsets the same way phase 2 did:

    Subset Size Scope Status
    12 small Add ,DOCOL primitive Shipped (79f4350)
    13 medium Forth : and ; via core/runtime.fth + inline HIDDEN-bit management Shipped (a98b4b8)
    14 medium Add SP!/RP@/RP!; move DUP/SWAP/OVER/>R/R> to Forth Next
    15 medium Move *//MOD/- to Forth as loops; AND/OR/XOR from a new NAND primitive Planned
    16 large Move WORD/FIND/NUMBER/INTERPRET/QUIT to Forth — after this, kernel matches approach 3 (~700 asm lines) Planned

    Subset 16 is the scary one. The outer interpreter written in Forth is slow — every text token goes through Forth-coded dictionary walking instead of asm. Estimates: ~10× slower text-input path, but compiled colon definitions run at nearly the same speed.

    Known Tradeoffs

    Phase 3 isn’t free. The comparison from phase 2 to phase 3:

      Phase 2 (today) Phase 3 (target)
    Asm lines to maintain 2239 ~700
    Asm primitive count 50 ~22
    WORD/FIND speed asm (fast) Forth (~10× slower)
    : and ; speed asm Forth (slightly slower compile)
    Bootstrap complexity Low Higher — careful .fth load ordering required
    Retargeting effort Rewrite ~2240 lines of asm Rewrite ~700 lines of primitives + rebuild

    The payoff is dramatic: the kernel becomes easy to retarget to a different ISA, the language story becomes much cleaner (Forth doing Forth’s job), and phase 4 becomes tractable because the primitive set is already small and orthogonal.

    Phase 4: forth-from-forth — On the Horizon

    ./forth-from-forth/. Write a Forth-to-COR24-asm compiler in Forth. Run it on a host Forth (either a separate Forth, or phase 3’s kernel) to emit kernel.s. After bootstrap, no hand-written .s exists; kernel.s is a build artifact.

    The cross-compiler has three pieces:

    • Instruction encoder: each COR24 opcode → bytes.
    • Primitive registry: each Forth primitive defined as a small Forth word that emits the asm body. E.g. : prim-+ asm-pop-r2 asm-pop-r0 asm-add-r0-r2 asm-push-r0 asm-next ;.
    • Linker: lays out the dict chain and writes the final .s.

    This is the standard pattern behind eForth, JonesForth, and several ITSY-style projects. Roughly 500–1000 lines of cross-compiler Forth plus a runtime specification.

    At that point the kernel’s .s becomes a build artifact, not source. Retargeting to a different ISA means swapping the instruction encoder module. The self-hosting story goes from “Forth is written in asm, except for the words that aren’t” to “Forth is written in Forth, including the compiler that produces the kernel.”

    Estimated work from phase 3 to phase 4: ~2–3 weeks. Risk: medium — instruction-encoding bugs are silent.

    Why Ship the Phases As Separate Directories?

    Each phase is a snapshot. ./ is the canonical reference (stays untouched). ./forth-in-forth/ is today. ./forth-on-forthish/ is where work happens next. ./forth-from-forth/ is future. Keeping them as sibling directories means:

    • Regression tests for the original kernel keep passing against ./.
    • The web frontend can keep pointing at ./ while the next phase stabilizes.
    • Each phase documents its own subset ordering and status (e.g., forth-in-forth/docs/status.md).
    • The comparison tables from phase to phase stay honest — you can diff the asm line counts, binary sizes, and primitive tables directly.

    It also matches the language-building pattern used across other COR24 languages: build a reference, keep it, and iterate new variants beside it.

    Vibe Coding the Migration

    Every subset in phase 2 was a short conversation: “Here’s the current kernel; move = and 0= to Forth, deriving them from XOR. Add a minimal.fth line and a test.” An AI agent proposed the edits, I reviewed and ran the regression harness, and the subset shipped as one commit. Eleven subsets in a day. That pace is only possible because each move is small, each test is fast, and the kernel stays buildable at every step.

    The reward for that discipline is visible in the commits: every subset is a single logical change, every status.md update is a diff, and SEE FIB on the REPL reads back the Forth definition the AI agent wrote an hour earlier. Forth’s self-extending nature and vibe coding’s tight loop fit each other well — the language is already expected to grow incrementally, and the agent’s output is exactly one .fth addition at a time.

    What to Watch Next

    • forth-on-forthish/ subset 14 — stack-pointer primitives (SP!, RP@, RP!) and moving DUP/SWAP/OVER/>R/R> into Forth on top of SP@.
    • The first visible win in phase 3: kernel.s drops below 2000 lines. Likely around subset 15.
    • Subset 16 — the big one. WORD/FIND/NUMBER in Forth; asm line count drops by hundreds.
    • Eventually, ./forth-from-forth/ gets scaffolded, and the question becomes which Forth hosts the first cross-compile run.

    Hashing References

    In-repo docs for the attempt sequence: docs/hashing-analysis.md (measurement-driven comparison of 9 hash functions) and docs/hashing.txt (Gemini-assisted survey of classical through 2025–2026 techniques).

    Key external references:

    Topic Link Why it matters here
    Krapivin optimal open addressing (2025) Quanta Magazine · arXiv:2501.02305 New probe sequence keeps hash tables near-constant-time to 99% fill. Directly informs the secondary-probe formula in the attempt-3 design.
    Perfect hash functions Wikipedia · CMPH library · GNU gperf Build-time generator for zero-collision lookup over the static kernel vocabulary (~90 words).
    Learned index structures (2018) Kraska et al., “The Case for Learned Index Structures” Foundational paper on replacing static hash functions with data-aware models. Inspires the “one hash for the known set, another for user defs” split.
    SSHash (order-preserving short-string hashing) jermp/sshash Cache-local hashing for short strings — Forth word names are the same shape as bioinformatics k-mers.
    xxHash / XXH3 xxhash.com · Cyan4973/xxHash Current speed gold standard for non-cryptographic hashing. Benchmark baseline even when we can’t use it directly (too many registers for a 24-bit GPR-limited ISA).
    FNV-1a Fowler/Noll/Vo hash — Wikipedia Classic short-string hash; one of the attempt-2 candidates, tied for second at 17 collisions.
    djb2 hash Dan Bernstein cdb docs · hash discussion Another attempt-2 candidate; h = h*33 ^ c. Inspired the len-seeded mult33 winner.
    PJW / ELF hash PJW hash — Wikipedia Historical precedent for shift-based rolling hashes used in compilers and linkers.
    JonesForth git.annexia.org/jonesforth Reference Forth implementation covering dictionary layout tradeoffs.

    Resources

    Project GitHub Live Demo
    Forth Interpreter (CLI) sw-embed/sw-cor24-forth
    Web Demo (browser UI for the interpreter above) sw-embed/web-sw-cor24-forth COR24 Forth
    Issue: hashed dictionary #1
    Issue: DO/LOOP etc. #2
    Approach Comparison Doc docs/future.md
    Phase 2 Status forth-in-forth/docs/status.md
    COR24 Assembler sw-embed/sw-cor24-assembler
    COR24 Demo Hub sw-embed/web-sw-cor24-demos Demo Hub

    Forth sketches itself the way Escher’s hands do — each version a clean line drawing, each one pointing at the next.

    Part 3 of the Embedded series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1899 words10 min readAbstract

    Saw (7/?): Prolog, Many-Agent Isolation, Self-Hosting Assembler, and MLPL

    Seventh Sharpen the Saw update. Last time the theme was independence—agents coordinating without stepping on each other, tools testing other tools, compilers vendoring their dependencies. This week the theme is controlled scale: adding more agents, more languages, and more layers of the stack, but with infrastructure that keeps growth reliable instead of chaotic.

    Four threads, one idea: the way to scale vibe-coding isn’t to run harder—it’s to build the platform underneath so that running harder stays safe.

    Why Sharpen the Saw? — The name comes from Covey’s Habit 7: stop cutting long enough to sharpen the blade. This series tracks weekly investment in the tools themselves—agent orchestration, testing infrastructure, compiler toolchains—so the feature work on top goes faster.

    Resource Link
    Rust-to-Prolog Demos sw-vibe-coding.github.io/rust-to-prolog
    Repos & Live Demos Table below
    Language-Building Pattern language-building-tech.md
    Prior Post Saw (6/?): Agent Coordination, Fuzzing Tests, Vendoring, and Emacs Graphics
    Comments Discord

    Rust-to-Prolog: From Lion and Unicorn to a Full Demo Set

    “The Lion lies on Mondays, Tuesdays, and Wednesdays… the Unicorn lies on Thursdays, Fridays, and Saturdays…” Smullyan’s Alice-in-the-Forest-of-Forgetfulness puzzles are a canonical showcase for Prolog: facts about when each creature tells the truth, rules for what a statement implies given the day, and a query—“what day is it?”—that the engine answers by backward-chaining through the constraints. No procedural search code; just facts, rules, and unification. That’s the puzzle behind this week’s image—and liar.pl in the demo set.

    Rust-to-Prolog (live demo) is a reference Prolog implementation in Rust. The in-browser demo ships a curated set of classic Prolog examples that each exercise a different feature of the interpreter:

    Demo What It Exercises
    ancestor Recursion and pattern matching
    append List concatenation (the canonical Prolog program)
    color Backtracking across constraint choices
    fib Fibonacci with an accumulator
    liar Smullyan’s Lion Lies on Tuesdays puzzle
    max ! (cut) and commitment
    member List membership
    neq (×2) Disequality—same atoms fail, distinct atoms succeed
    path (×2) Graph reachability: yes/no and print-each-path
    sum Tail-recursive arithmetic

    Between them they cover unification, resolution, cut, backtracking, lists, arithmetic, and graph search—the core of what a Prolog implementation has to get right. The liar.pl puzzle is the showcase piece, but every demo is a focused test of one language feature.

    The Rust interpreter is a reference—the starting point, not the destination. The COR24 port is one self-hosting port split across two languages, not two competing implementations:

    • Runtime (Rust WAM → PL/SW LAM): The Warren Abstract Machine at the heart of the interpreter—term representation, unification, choice points, and the backtracking trail—moves to PL/SW as a LAM. PL/SW is the right language for this layer: it compiles with tc24r and runs on COR24, so the runtime itself stops needing a host machine.
    • Front end (Rust lex/parse → SNOBOL4): Prolog syntax is a pattern-matching and string-processing problem, which is SNOBOL4’s home turf. Tokenization and parsing move into .sno files and take natural advantage of SNOBOL4’s pattern idioms.

    Together the .plsw runtime and the .sno front end are a drop-in replacement for the current Rust (.rs) sources. The Rust implementation stays around as the oracle the ported version gets diffed against, but nothing in the on-device toolchain depends on it—the COR24 Prolog is self-hosting.

    Why this split? The hard parts of a Prolog implementation—unification, backtracking, and parsing—are semantic decisions, not implementation-language decisions. Solve them once in Rust where the tooling is strong, then pick the right tool per layer for the port: SNOBOL4 for strings, PL/SW for the abstract machine, Rust for the high-confidence reference that keeps the other two honest.

    All-Together-Now: Many-Agent Isolation

    All Together Now (ATN) continues to evolve toward running more agents, concurrently, with better session durability. Four changes landed this week:

    • Many agents at once: Beyond the coordinator + workers demo from last week, ATN now supports a larger pool of concurrent agent sessions. Panels and mailboxes scale horizontally instead of hardcoding small counts.
    • mosh for the SSH layer: Mosh replaces plain SSH for the control connection to the host running agents. Roaming networks, laptop sleep/wake, and dropped Wi-Fi no longer kill sessions—mosh’s state sync and local echo keep the pipe alive across transient failures.
    • tmux for remote session management: Agent sessions live inside tmux so they survive disconnects, can be re-attached from any client, and can be inspected side-by-side on a remote host. The PTY streaming in ATN’s Web UI still works—tmux adds durability underneath.
    • Mac → Arch, one user per agent: Development is moving from macOS to Arch Linux on the agent host. Each agent gets its own Linux user account, so filesystem, process tree, resource limits (ulimit), and environment are isolated at the OS level—not just inside a coordinator process. An agent that misbehaves can only affect its own $HOME, its own cgroup, its own sandbox.

    The theme is real boundaries. Process-level isolation inside a single user is too porous for agents that can run arbitrary code. Per-user Linux accounts give OS-enforced separation for free, and standard Unix tools (sudo -u, su, systemd-run --uid=) manage the dispatch.

    Self-Hosting the COR24 Assembler

    Why these patterns? — COR24 languages get built different ways. Some start as reference implementations in a high-level language (Prolog in Rust). Some are cross-compiled from the host toolchain (tc24r for C). Some are self-hosted from the start and build on top of already-self-hosted layers (the native assembler, then Forth on top of it). Vendoring, reference-first, and self-hosting each solve a different problem.

    The motivation, tradeoffs, and when to pick which technique are collected in one doc:

    → language-building-tech.md

    Read it for why this post keeps mixing approaches across Prolog, the assembler, and the rest of the COR24 stack.

    sw-cor24-assembler is the native COR24 assembler—a two-pass assembler written in C that, once bootstrapped, runs directly on COR24 FPGA hardware. The naming convention is strict:

    Repo Role Written in Runs on
    sw-cor24-x-assembler Cross-assembler Rust Host (x86/ARM)
    sw-cor24-assembler Native assembler C COR24 FPGA

    The x- prefix marks cross-tools. The plain name is the native tool that runs on the target. The bootstrap pipeline is short:

    tc24r (Rust)                  compiles    cas24.c  →  cas24.s
    sw-cor24-x-assembler (Rust)   assembles   cas24.s  →  cas24.bin
    cas24.bin runs on COR24 FPGA  →  native assembler available on-device
    

    Self-hosting the assembler isn’t about performance—it’s about removing the host PC from the inner loop. Once cas24 runs on-device, COR24 can assemble code for itself. That unlocks every other assembly-based toolchain on the same hardware: a Forth system, a p-code VM, small interpreters—all buildable on COR24 without reaching back to a host.

    This is the same motivation as last week’s vendoring: each layer of the stack should be able to rebuild itself without reaching outside the COR24 ecosystem. Vendoring isolates compilers from each other in time; the native assembler isolates the on-device toolchain from the host machine in space.

    sw-MLPL: Tiny LM Complete, MLX Backend Started

    sw-MLPL had its biggest week yet—three sagas moved forward.

    Saga 12 closed with the tokenizers release (v0.9.0): a byte-level BPE trainer (train_bpe), apply_tokenizer + decode for round-trip validation, the experiment "name" { body } scoped form, and an :experiments registry with compare(a, b) for side-by-side experiment inspection. Dataset-prep built-ins (shuffle, batch, split) and a --data-dir sandboxed loader rounded out the training-pipeline surface.

    Saga 13 completed end-to-end as v0.10.0—a tiny language model from embeddings to generation, all in MLPL:

    Step Feature
    001 embed(vocab, d_model, seed) token embeddings
    002 sinusoidal_encoding(seq_len, d_model) positional encoding
    003 causal_attention(d_model, heads, seed) masked self-attention
    004 cross_entropy(logits, targets) fused loss
    005 sample(logits, t, seed) + top_k(logits, k) generation
    006 End-to-end training demo
    007 Generation loop + attention-map visualization
    008 “Language Model Basics” and “Training and Generating” tutorials

    The saga also shipped a Criterion benchmark harness comparing the interpreter against compiled MLPL, a :version REPL command, a Workspace Introspection demo, seven new docs guides with a README index, and a wasm32-unknown-unknown panic fix for the experiment block.

    Saga 14 opened: an MLX backend. MLPL is gaining an Apple MLX runtime target so array ops can dispatch to Apple Silicon’s unified-memory GPU path. Progress in the last four days:

    • mlpl-mlx crate with MLX matmul (step 001)
    • Elementwise ops and shape primitives on MLX (step 002)
    • Reductions, softmax, and cross-entropy on MLX (step 003)
    • device("mlx") { ... } scoped form for switching the active backend (step 004)
    • Model DSL dispatch + to_device for moving models between backends (step 005)

    The MLX work is the clearest example of this week’s controlled scale theme: MLPL already had a CPU runtime, a compile-to-Rust path, and a wasm build—adding an MLX backend means experiments can now scale to GPU without changing user code, just by wrapping a block in device("mlx") { ... }. Same scripts, more hardware.

    Repos and Live Demos

    Project GitHub Live Demo
    Rust-to-Prolog sw-vibe-coding/rust-to-prolog Prolog Demos
    All Together Now sw-vibe-coding/all-together-now in development
    COR24 Native Assembler sw-embed/sw-cor24-assembler N/A
    COR24 Cross-Assembler sw-embed/sw-cor24-x-assembler N/A
    sw-MLPL sw-ml-study/sw-mlpl MLPL Demo
    PL/SW sw-embed/sw-cor24-plsw PL/SW Demo
    SNOBOL4 sw-embed/sw-cor24-snobol4 SNOBOL4 Demo
    COR24 Demo Hub sw-embed/web-sw-cor24-demos Demo Hub

    What’s Next

    Rust-to-Prolog: Begin the self-hosting COR24 port—PL/SW for the LAM runtime (the WAM analog) and SNOBOL4 for the lexer and parser. The Rust implementation stays around as the oracle; the ported version must pass the same demo set while running entirely on the COR24 toolchain.

    All Together Now: Full migration to the Arch host with per-user agent accounts. Campaign to run many worker agents concurrently on a shared long-running task, using mosh/tmux durability for multi-day runs.

    COR24 Assembler: Finish the two-pass native cas24 implementation in C, boot it on COR24 FPGA, and validate by using it to rebuild other on-device tools (Forth, p-code VM experiments) without the host cross-assembler in the loop.

    sw-MLPL: Fill out the MLX backend (optimizers, autograd, remaining layer ops), then re-run the Tiny LM demo end-to-end on MLX and publish a backend-parity report against the CPU path.


    Scaling up without breaking down takes infrastructure. Follow for more Sharpen the Saw updates.

    Part 7 of the Sharpen the Saw Sundays series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1467 words8 min readAbstract

    TBT (9/?): UNIVAC Startrek, TRS-80 Adventures, and COR24 BASIC

    Three BASIC games. Three eras. One retro-computing platform.

    Startrek on a UNIVAC 1108 teletype in the ’70s. A Trek text adventure typed in from a computer magazine on a TRS-80 in the ’80s. And Robot Chase, added just last week at a friend’s request. All three now run in the browser on an emulated COR24 BASIC—a line-numbered, integer-only, 1970s-style time-sharing BASIC built on a p-code VM written in Pascal.

    Resource Link
    Play in Browser COR24 BASIC Demos
    Games startrek, trek-adventure, robot-chase
    BASIC Interpreter sw-embed/sw-cor24-basic
    Web Runtime sw-embed/web-sw-cor24-basic
    Prior Post TBT (8/?): Wiki Systems
    Comments Discord

    UNIVAC 1108 Startrek: The Bell That Gave You Away

    The first version of Star Trek I ever played ran on a UNIVAC 1108 time-sharing system, dialed into from a teletype terminal. You typed commands, the ship’s status came back as ALL CAPS tabular output, and the teletype’s mechanical print head hammered out every line of the 8×8 galactic map one character at a time.

    The game used an 8×8 galaxy of quadrants, each with its own 8×8 sector grid, populated with Klingons, starbases, and stars. You commanded the Enterprise: warp around, fire phasers and torpedoes, dock for repairs, and save the Federation before your energy or time ran out.

    But the detail nobody forgets: the BELL character. Every teletype in the room had a solenoid-driven bell. When your ship entered a quadrant containing Klingons, the Star Trek program printed:

    *** RED ALERT *** *** RED ALERT ***
    

    …preceded by ASCII BEL (CHR$(7)), which rang the bell on your terminal. Loudly. Across the entire room.

    Which meant everyone in the terminal room could tell exactly who was playing Star Trek. This was at college, where the UNIVAC 1108 was also where you did your programming homework—and the bell was supposed to be the game warning you about Klingons, but in practice it also let every other student in the room know you were not, in fact, finishing that assignment.

    My COR24 version (startrek.bas) keeps the command loop, the galaxy, and the Red Alert banner. The one thing it sadly doesn’t reproduce is the bell itself: CHR$(7) in a browser tab is silent, and the solenoid-driven clack of a shared teletype doesn’t port. You’ll have to imagine the sound.

    TRS-80 Trek Adventure: The Vibe of a Magazine Listing

    A half-decade later, the home computer era produced a different kind of Trek game: text adventures distributed as BASIC source listings in computer magazines. On a TRS-80—in my case, my dad’s old Model I—you read the listing, typed it in line by line (100 DIM A$(50), 110 PRINT "BRIDGE", on for hundreds of lines), saved to cassette, and prayed you hadn’t mistyped a variable name. Debug night was the same night you got the game. If it didn’t run, you walked back through every line number until you found the typo.

    For trek-adventure.bas, I didn’t have an actual magazine listing to work from. So I did what the era didn’t have: vibe coding, baby. I described the game I wanted—a text adventure in integer-only line-numbered BASIC, starting on the bridge of the Enterprise, menu-driven in the 80s-magazine style, with a tight little puzzle about boarders, a phaser, a key card, a tribble, a relay coupler, and a decaying orbit—and the AI wrote one. No original source, no translation, no finger down a magazine page. Just a description and iteration.

    The AI leaned into the bit. The resulting file opens with a REM block claiming a provenance that never existed:

    104 REM STAR TREK: DECAYING ORBIT - A TEXT ADVENTURE.
    105 REM TRANSLATED FROM A QBASIC/GW-BASIC MAGAZINE LISTING
    106 REM INTO INTEGER-ONLY COR24 BASIC V1. COMMANDS ARE
    107 REM NUMERIC MENUS; TARGETS ARE NUMBERS SHOWN IN THE
    108 REM ROOM DESCRIPTIONS.
    

    There is no magazine listing. The vibe is the listing. Numbered menus, numeric targets pulled from room descriptions, a handful of endings, state on a 24-bit integer VM—the game feels exactly like something that could have shipped in a 1982 issue of 80 Micro, because that’s what I asked for. No ?SN ERROR at 3am, no cassette rewinding, no walking every line number hunting a typo. The grunt work moved up a layer of abstraction; the feel stayed put.

    Robot Chase: Added on Request

    Every retro project eventually picks up a side quest. A friend asked whether I could also do Robot Chase (sometimes called Daleks or just Robots)—the classic 1980s terminal game where you’re trapped on a grid with robots that step one square toward you every turn, and your only hope is to make them collide into each other or into wreckage.

    It’s a small game with a lot of character:

    • 16×16 board, 12 robots.
    • Numpad-style movement: 7 8 9 / 4 5 6 / 1 2 35 waits a turn.
    • Three emergency teleports per game. 99 resigns.
    • 10 gives a 4×4 long-range-scan summary, collapsing the board into regional robot counts so you can plan routes.
    • Collide a robot with another robot or with wreckage, and the tile becomes a * wreck. Touch a robot or a wreck yourself, and you lose.

    COR24 BASIC is integer-only and has no clock, so the PRNG seed comes from whatever the variable R holds when you start the game. First run, R=0 → deterministic default seed 5237. Subsequent runs pick up the previous game’s residual R, which in practice gives you a different board each time. Pure, old-school deterministic pseudo-randomness. No time(), no /dev/urandom, just whatever integer happens to be sitting in R.

    My version runs as robot-chase.bas.

    Why COR24 BASIC?

    All three games run on COR24 BASIC v1, which is deliberately time-sharing-era BASIC, not a modern dialect. The design target is the experience of UNIVAC 1100-series terminal BASIC: line numbers, integer arithmetic, single-letter variables, GOSUB/RETURN, GOTO, interactive LIST and RUN. No floats, no strings beyond CHR$(n) inside PRINT, no structured programming. Just enough to type a program into a terminal and watch it go.

    The implementation is a four-layer stack:

    Layer What It Is
    Layer 3 BASIC interpreter (tokenizer, parser, dispatch)
    Layer 2 BASIC runtime (I/O, line storage, stacks, PEEK/POKE)
    Layer 1 P-code virtual machine (language-neutral abstract machine)
    Layer 0 COR24 hardware / emulator

    The interpreter is a Pascal program compiled to p-code by p24p, assembled by pa24r into .p24 bytecode, and run on pv24t (the p-code VM). The p-code machine handles arithmetic, stacks, calls, and memory; the BASIC runtime on top of it handles line-numbered statements and interactive editing.

    This matters for the games because integer-only BASIC on a 24-bit word is a real constraint. No floating-point Klingon positioning, no shortcut RNGs, no lazy string parsing. You write the game like you’re writing it in 1975—arrays of integers, careful arithmetic, explicit line numbers, GOSUB instead of functions—and the emulator gives it back to you pixel-true in a browser tab.

    Try the Demos

    sw-embed.github.io/web-sw-cor24-basic runs the entire stack in WebAssembly. Pick a program from the examples list, click RUN, and the integer-only BASIC interpreter—running on a Pascal-compiled p-code VM—plays the game in your browser:

    Demo What You Get
    startrek.bas The 8×8 galaxy, phaser and torpedo combat, starbase docking, energy management. The command loop and the quadrant display are faithful to the 1970s teletype experience—minus the bell.
    trek-adventure.bas Numbered-menu text adventure starting on the bridge of the Enterprise. Save the ship, stop the boarders, keep the orbit from decaying.
    robot-chase.bas The 16×16 board with 12 robots, numpad movement, teleports, long-range scan. Collide the robots into each other and yourself into none of them.

    All three are line-numbered BASIC source you can inspect, edit, re-run, and (if you want) port to your own time-sharing-era BASIC. The listings are MIT-licensed in sw-cor24-basic/examples.

    Three eras, one interpreter, one browser tab. The bell is silent now, but the galaxy still needs defending.


    TBT: what we built with, what we still build from.

    Part 9 of the Throwback Thursday series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1331 words7 min readAbstract

    Saw (6/?): Agent Coordination, Fuzzing Tests, Vendoring, and Emacs Graphics

    Sixth Sharpen the Saw update. Last time the theme was dependency chains—saga archiving, new languages, and compiler fixes cascading through the stack. This week the theme is independence: making agents coordinate without stepping on each other, testing tools without trusting their error handling, and letting compilers evolve without breaking their downstream consumers.

    Four projects, one idea: build the infrastructure so that parallel work stays parallel.

    Why Sharpen the Saw? — The name comes from Covey’s Habit 7: stop cutting long enough to sharpen the blade. This series tracks weekly investment in the tools themselves—agent orchestration, testing infrastructure, compiler toolchains—so the feature work on top goes faster.

    Resource Link
    Repos & Live Demos Table below
    Prior Post Saw (5/?): Sagas, Languages, and Compiler Chains
    Comments Discord

    All Together Now: Multi-Agent Coordination Demo

    All Together Now (ATN) is a program manager that orchestrates multiple AI agent sessions running in isolated PTY environments. The new milestone is a demo that shows Claude Code as coordinator delegating tasks to multiple opencode/GLM-5 worker agents, with the agents collaborating through mailboxes and a shared wiki.

    The Web UI (Yew/WASM) now presents this as a multi-panel layout:

    • Agent terminal panels: Each agent’s TUI runs in its own panel, with live PTY streaming so you see exactly what the agent sees
    • Tabbed views: Switch between agent-graphs (who’s talking to whom), wiki pages (shared durable state), agentrail Sagas per agent (workflow history), and mailbox events (messages flowing to and from the coordinator)

    The coordination model separates two planes: a terminal I/O plane (raw PTY bytes, keystroke forwarding) and an orchestration plane (structured JSON events for feature requests, completion notices, and status updates). The coordinator doesn’t type into worker terminals—it sends structured messages through mailboxes. Workers read their mailbox, do their work, and post results back. The wiki provides durable shared state: goals, decisions, and context that any agent can read without asking.

    This matters because the alternative—agents talking through unstructured terminal paste—is fragile and loses context. Mailboxes give you a clean event log. The wiki gives you shared memory that survives agent restarts. Agentrail sagas give you per-agent workflow audit trails.

    Fuzzit: Testing Tools with Extreme Inputs

    Fuzzit is an LLM-guided fuzz testing tool built to discover crashes, hangs, panics, and unexpected behavior in CLIs, compilers, interpreters, REPLs, and APIs. The core idea: a tool that tests other tools by throwing extreme and random inputs at them to validate error handling.

    Fuzzit runs four-layer fuzzing campaigns with budget allocation across:

    Layer Budget What It Does
    Baseline 30% Deterministic edge cases—empty inputs, whitespace storms, deep nesting, invalid UTF-8, huge payloads
    LLM Seeds 10% Targeted seed generation via local Ollama models that understand the tool’s grammar and API surface
    Mutations 40% Bit flips, insertions, deletions, and crossover applied to seeds that previously triggered interesting behavior
    Feedback 20% Retain and re-mutate inputs that produced new exit codes, slow responses, or unusual stderr output

    Each finding gets deterministically classified: panic, hang (wall-time timeout), segfault, unexpected exit code, or stderr anomaly. Interesting findings are automatically promoted to Rust regression tests, so once Fuzzit finds a bug, the fix stays tested forever.

    No cloud dependencies—Fuzzit uses local Ollama for LLM-guided seeds, making it practical for testing proprietary tools and compilers that can’t be sent to external services. The nine-crate workspace (fz-core, fz-manifest, fz-corpus, fz-exec, fz-classify, fz-mutate, fz-llm, fz-artifacts, fz-cli) keeps concerns separated and testable.

    The immediate use case: Fuzzit is already testing the COR24 compiler toolchain (tc24r, PL/SW, Pascal) to find codegen bugs that downstream languages would otherwise discover the hard way.

    Vendoring: Parallel Development Without the Pain

    The COR24 compiler chain has a dependency problem. PL/SW is written in C and compiled by tc24r (the C cross-compiler). SNOBOL4 is written in PL/SW. A new Fortran compiler will be written in SNOBOL4. Each layer depends on the one below it, and changes at any level can break everything above.

    Vendoring solves this by pinning stable snapshots:

    • PL/SW vendors tc24r: The PL/SW repo includes a known-good version of the C compiler. I can add features or fix bugs in tc24r’s main branch without breaking PL/SW mid-development. When a tc24r improvement is ready and tested, PL/SW explicitly updates its vendored copy.
    • SNOBOL4 vendors PL/SW: Same pattern one level up. SNOBOL4 works against a stable PL/SW compiler. PL/SW macro system changes don’t destabilize the SNOBOL4 interpreter until deliberately promoted.
    • Fortran vendors SNOBOL4: The new Fortran compiler (upcoming) will vendor SNOBOL4, giving it a stable implementation language while SNOBOL4 continues evolving.

    This is the same idea behind Go’s vendor/ directory or Rust’s Cargo.lock—except applied to entire compiler toolchains in an embedded ecosystem. Each project controls when it absorbs upstream changes, so three developers (or three agent sessions) can work on three layers simultaneously without coordination overhead.

    The alternative was what we had before: fix a C compiler bug, rebuild PL/SW, discover the fix exposed a PL/SW assumption, fix that, rebuild SNOBOL4, discover that exposed a SNOBOL4 assumption. Vendoring breaks the cascade. Fix, test, promote when ready.

    Emacs Graphics: PaperBanana-Style Visuals in Emacs

    Graphical Experiments is a new project exploring what happens when you bring PaperBanana-styled graphics into Emacs—SVG menu cards, inline charts, animated headers, and slide presentations, all rendered in native Emacs buffers.

    The project is a kit of six Elisp packages:

    Package What It Does
    pb-menu PaperBanana-style SVG menu cards with rounded corners, icons, and solarized color palette
    pb-chart Bar charts, sparklines, and scatter plots rendered as inline SVG
    pb-media Image viewing and an animated header-line “heat” indicator
    pb-present A minimal slide/presentation mode with keyboard navigation (n/p/q)
    pb-web Browser embedding helpers—xwidgets if available, EAF fallback, else external browser
    pb-demo-init Convenience loader and command index

    The immediate goal is a visual toolkit for Emacs that feels like PaperBanana’s aesthetic—warm card layouts, clean data visualizations, and interactive menus—without leaving the editor. The longer-term direction includes clickable menu actions, layout helpers for arrows and swimlanes, Org integration for slides rendered from headings, and Rust-backed SVG layout for more complex graph visualizations.

    Everything is built on Emacs 29+’s native SVG support (svg.el), keeping the code small and hackable. No external rendering dependencies—if your Emacs build supports SVG images, the demos work.

    Repos and Live Demos

    Project GitHub Live Demo
    All Together Now sw-vibe-coding/all-together-now in development
    Fuzzit sw-cli-tools/fuzzit N/A
    Emacs Graphics sw-emacs/graphical-experiments N/A
    PL/SW sw-embed/sw-cor24-plsw PL/SW Demo
    SNOBOL4 sw-embed/sw-cor24-snobol4 SNOBOL4 Demo
    Fortran sw-embed/sw-cor24-fortran future
    tc24r (C compiler) sw-embed/sw-cor24-x-tinyc TinyC Demo
    agentrail-rs sw-vibe-coding/agentrail-rs N/A
    COR24 Demo Hub sw-embed/web-sw-cor24-demos Demo Hub

    What’s Next

    All Together Now: Live multi-agent demo with Claude Code coordinating opencode/GLM-5 workers on a real task—likely a collaborative code review or multi-repo refactor. The Web UI will stream all panels simultaneously.

    Fuzzit: Expanding target coverage beyond compilers—next up are the COR24 embedded tools (monitor, shell, editor) and the agentrail-rs CLI. Campaign comparison reports to track regression across tool versions.

    Emacs Graphics: Clickable menu actions on the SVG cards, Org-mode integration so presentation slides render from headings, and exploring Rust-backed SVG layout helpers for more complex graph and swimlane diagrams.

    Vendoring: Establishing the Fortran-vendors-SNOBOL4 relationship as the Fortran compiler scaffolding begins. Documenting the vendoring protocol so agent sessions can update pinned versions without manual intervention.


    Parallel work needs parallel infrastructure. Follow for more Sharpen the Saw updates.

    Part 6 of the Sharpen the Saw Sundays series. View all parts | Next: Part 7 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1233 words7 min readAbstract

    Saw (5/?): Sagas, Languages, and Compiler Chains

    Fifth Sharpen the Saw update. Last time the focus was integration—Emacs packages, a multi-agent orchestrator. This week spread across multiple repos touching four themes: workflow infrastructure for agents, new programming languages, compiler-chain fixes that unblock downstream tools, and embedded infrastructure (monitor, shell, editor) that makes COR24 usable as a development platform.

    The common thread is dependency chains—saga archiving lets agentrail manage MLPL’s multi-phase development, MLPL draws on APL ideas validated on embedded hardware, that embedded APL needs the C compiler fixed, and BASIC needs the Pascal compiler extended. Every fix at the bottom of the stack unlocks something above it.

    Why Sharpen the Saw? — The name comes from Covey’s Habit 7: stop cutting long enough to sharpen the blade. This series tracks weekly investment in the tools themselves—compilers, languages, agent infrastructure—so the feature work that sits on top of them goes faster. Five weeks in, the dependency chains are getting shorter.

    Saga Archiving in agentrail-rs

    Agentrail-rs tracks AI agent workflows as append-only saga records. Until now, one project meant one saga directory. That breaks down when a project has multiple concurrent workstreams—say, a language project where one saga covers the parser, another the runtime, and a third the test harness.

    Saga archiving adds the ability to close out a completed saga and start a new one without losing history. Archived sagas move to a timestamped subdirectory, keeping the active saga directory clean while preserving the full trajectory for later analysis or ICRL replay. MLPL is the first project using saga archiving—each phase of the language (lexer, parser, interpreter, compiler) gets its own saga, archived when complete.

    Not every project needs multiple sagas. The C compiler (tc24r) uses a single ever-growing saga that accumulates GitHub issues as they arrive from downstream projects like APL and PL/SW. When an APL feature hits a codegen gap, the issue lands in tc24r’s saga and stays there until fixed. One saga, one backlog—simple and appropriate for a project driven by external bug reports rather than internal phases.

    MLPL: An APL2/J/BQN-Inspired ML Language

    MLPL is a new Rust-based language inspired by APL2, J, BQN, and K, purpose-built for machine learning workflows. Unlike the integer APL on COR24—which is a minimal subset running on embedded hardware—MLPL runs on Linux and Mac with full floating-point support, first-class tensors, visual debugging, and a contract-first compartmentalized architecture.

    The language is being built in Rust with a phased approach: parser and AST first, then a tree-walking interpreter, then compilation. The two APL projects inform each other: operator semantics and reduction patterns tested in COR24’s constrained integer environment validate core ideas, while MLPL extends them into floating-point territory and higher-rank tensor operations that embedded hardware can’t touch. MLPL is also the proving ground for agentrail’s new saga archiving—each language phase gets its own saga with clean boundaries.

    PL/SW: Macros for a PL/I-Inspired Systems Language

    PL/SW is a systems programming language inspired by PL/I, targeting COR24 today with an eye toward future FPGA soft CPUs. It compiles to human-readable COR24 assembler, which means you can inspect every instruction the compiler emits.

    This week’s milestone: compile-time macro metaprogramming. PL/SW macros expand at compile time and generate COR24 assembly directly, enabling:

    • Hardware abstraction: %MMIO_WRITE(addr, val) expands to the correct load/store sequence
    • Inline patterns: Loop unrolling and register allocation hints without runtime cost
    • BASED record templates: Structured memory access patterns that the macro system can verify at compile time

    The macro system is the bridge between PL/SW-as-a-language and PL/SW-as-a-systems-tool. Without it, every hardware interaction required inline assembly. With it, the language can express hardware patterns in its own syntax. The COR24 demo hub hosts the emulator where PL/SW programs run.

    Enabling technology dependency diagram

    Fixing the C Compiler to Unblock APL

    The COR24 APL interpreter is written in C and compiled by tc24r (a chibicc-inspired C compiler targeting COR24’s 24-bit RISC ISA). APL’s array operations hit several compiler gaps:

    • Missing features: Certain C constructs that APL’s runtime relied on weren’t yet implemented in tc24r
    • Code generation bugs: Edge cases in pointer arithmetic and array indexing produced incorrect COR24 assembly

    Each fix in tc24r immediately unblocked APL features that were waiting on correct codegen. The APL REPL now handles more complex array expressions thanks to these fixes.

    Extending Pascal to Unblock BASIC

    A similar dependency chain on the Pascal side. The COR24 BASIC interpreter runs on a Pascal p-code VM: Pascal source compiles to p-code assembler (pa24r), links via pl24r, and executes on the COR24 virtual machine. BASIC features that need new p-code instructions require Pascal compiler work first.

    The Pascal compiler extensions this round added capabilities that BASIC’s string handling and control flow were blocked on. Like the C/APL chain, every Pascal fix cascades into BASIC functionality.

    Embedded Tools: Monitor, Shell, and Editor

    The COR24 ecosystem also gained progress on three infrastructure tools that make the platform usable as more than a compiler target.

    Monitor (sw-cor24-monitor) is the low-level system monitor—the first thing that runs on COR24 hardware. It provides memory inspection, register dumps, and direct machine-code entry. Think of it as the boot ROM’s interactive console.

    Script (sw-cor24-script) is a shell and scripting environment for COR24. It gives the platform a command-line interface for running programs, piping output, and automating tasks—the glue layer between the monitor and the language tools above it.

    Yocto-Ed (sw-cor24-yocto-ed) is an Emacs-inspired line editor for COR24. On embedded hardware without a full terminal, a line editor is the practical way to edit source files and configuration. Yocto-Ed borrows Emacs keybindings and concepts (kill ring, incremental search) scaled down to fit in COR24’s memory constraints.

    All three are works in progress with live demos planned.

    Repos and Live Demos

    Project GitHub Live Demo
    agentrail-rs sw-vibe-coding/agentrail-rs N/A
    APL (integer, embedded) sw-embed/sw-cor24-apl APL REPL
    BASIC sw-embed/sw-cor24-basic future
    COR24 Demo Hub sw-embed/web-sw-cor24-demos Demo Hub
    MLPL (APL2/J/BQN, Rust, float) sw-ml-study/sw-mlpl future
    Monitor sw-embed/sw-cor24-monitor future
    Pascal sw-embed/sw-cor24-pascal Pascal Demo
    PL/SW sw-embed/sw-cor24-plsw PL/SW Demo
    Script (shell) sw-embed/sw-cor24-script future
    tc24r (C compiler) sw-embed/sw-cor24-x-tinyc N/A
    Yocto-Ed (line editor) sw-embed/sw-cor24-yocto-ed future

    What’s Next

    Agentrail-rs: Multi-saga coordination—archived sagas feeding context into new ones, so an agent starting a fresh workstream can learn from completed ones.

    MLPL: Parser completion and first interpreter pass. The APL operator semantics validated on COR24 hardware will inform which primitives make it into MLPL’s core.

    PL/SW: Macro library for COR24 peripheral access patterns, targeting the MakerLisp hardware and eventually FPGA soft CPUs.

    COR24 compilers: Continuing to close gaps in tc24r and the Pascal toolchain as APL and BASIC push further into their respective feature sets.


    Sharpen the tools, shorten the chains. Follow for more Sharpen the Saw updates.

    Part 5 of the Sharpen the Saw Sundays series. View all parts | Next: Part 6 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1101 words6 min readAbstract

    Bucket List (2/?): A Landing Page for Software Tools

    In the first Bucket List post, I listed the things I’ve always wanted to build. A lot of them were software tools: compilers, interpreters, languages, debuggers—the infrastructure that software is made of. In the past two weeks, a surprising number of those items moved from “someday” to “done” or “in progress.”

    The catalyst was agentrail-rs, my AI agent workflow tool. I pointed it at the COR24 ecosystem and let it rip.

    Resource Link
    COR24 Software Tools Landing Page
    Bucket List GitHub
    Comments Discord

    The Landing Page

    The COR24 Software Tools page is a Yew/WebAssembly single-page application that ties together everything I’ve built for the COR24 24-bit RISC processor. All nineteen tools target the COR24 ISA—assemblers, compilers, interpreters, and system software, all generating or executing COR24 machine code. They’re organized into five groups: foundation tools, cross-compilers, the p-code system, native languages, and system software. Each group has interactive browser demos, documentation, and links to source.

    Two weeks ago, most of these existed as scattered repos. Now they have a home, and every one of them is a bucket list item I can point to and say: done, or close to it.

    Bucket List Items: Checked Off

    Here’s what’s actually built and working, mapped back to the list from Part 1.

    Write a Lisp

    Tiny Macro Lisp — a Lisp interpreter written in C, targeting COR24. Lexical scoping, defmacro, closures, and a mark-and-sweep garbage collector. It runs in the browser as a live REPL. This was two bucket list items in one: write a Lisp and implement a garbage collector.

    I’ve wanted to write a Lisp since I first read Structure and Interpretation of Computer Programs decades ago. The garbage collector was the part I was most nervous about. Turns out, once you have a clear mark phase and a clear sweep phase, it’s not mysterious—it’s just graph traversal with consequences.

    Implement a Garbage Collector

    See above. Mark-and-sweep, integrated into the Lisp runtime. Every cons cell, every closure, every environment frame is a heap object that gets traced. When the free list runs low, the collector walks the root set and reclaims everything unreachable. Simple, correct, and satisfying to watch in the debugger.

    Implement a P-Code VM

    P-code VM, Assembler & Linker — the VM is written in COR24 assembly (running on the emulator), with a Rust-based assembler and linker for the toolchain. Plus a Pascal compiler (p24p) that targets the p-code instruction set, and a P-code AOT compiler (pc-aotc) that translates p-code bytecode to native COR24 assembly.

    This is straight out of the 1970s UCSD Pascal playbook. A stack-based virtual machine with its own instruction set, a compiler that targets it, and an ahead-of-time compiler that eliminates the interpretation overhead. Three layers of abstraction, all visible and inspectable in the browser.

    Design a Systems Programming Language (PL/SW)

    PL/SW — inspired by PL/I, targeting COR24. A compiled systems language with structured control flow, typed variables, and direct hardware access. It has its own IDE in the browser.

    PL/I was the language IBM designed to replace everything—FORTRAN, COBOL, assembly. It was too ambitious, too complex, and too slow. But the idea of a language that spans systems programming and application programming has always appealed to me. PL/SW is my take on what PL/I might have been if it had been designed for a small machine instead of a mainframe.

    Design a Scripting Language (SWS)

    SWS — a Tcl-like scripting language for shell and editor automation on COR24. Where PL/SW is for building the system, SWS is for gluing it together. Command substitution, string manipulation, interactive use.

    Every system needs a scripting layer. Something you can type at a prompt, something that can automate the editor, something that doesn’t require a compile step. SWS fills that role.

    Implement a Monitor

    Resident Monitor — boots at address 0, provides program invocation, I/O services, and a command interface. Written in COR24 assembly with some C components. This is the closest thing the COR24 has to an operating system: it loads programs, manages memory regions, and provides system calls.

    Implement an Editor

    yocto-ed — a minimal modal text editor with a gap buffer implementation. Written in C, compiled with the tc24r cross-compiler. This one is practical: I needed an editor to dogfood the C compiler, so I wrote one. Gap buffers are one of those data structures you hear about but rarely implement yourself.

    Write a Compiler (Several, Actually)

    • Tiny C Cross-Compiler (tc24r) — compiles a subset of C to COR24 assembly, written in Rust
    • Pascal Compiler (p24p) — compiles Pascal to p-code, written in C
    • Fortran Compiler — translates Fortran to COR24 assembly, written in C
    • P-code AOT Compiler (pc-aotc) — translates p-code bytecode to native COR24 assembly
    • Native Assembler (as24) — runs on the COR24 itself, part of the self-hosting toolchain

    Five compilers/translators. Each one taught me something different about parsing, code generation, register allocation, and the gap between source language semantics and machine capabilities.

    Implement an Interpreter

    Beyond the Lisp interpreter, there’s also:

    • Forth IDE — a direct-threaded code Forth with dictionary browsing and stack inspection
    • APL Interpreter (apl-sw) — integer-only APL with rank-2 arrays
    • BASIC Interpreter — classic BASIC with line numbers, GOTO/GOSUB, string variables

    Four interpreters across four very different language paradigms: functional (Lisp), concatenative (Forth), array-oriented (APL), and imperative (BASIC).

    Still on the List

    A few items from Part 1 aren’t checked off yet:

    • Debugger — a source-level debugger is planned but not yet implemented
    • Shell — the monitor handles basic command dispatch, but a proper shell with pipes and redirection is future work
    • Linker — the p-code linker exists, but a general-purpose native linker is still to come

    The Vibe-Coding Part

    All of this was built with AI assistance via agentrail-rs. The pattern: I describe what I want at an architectural level—“implement a mark-and-sweep garbage collector for the Lisp runtime”—and the AI writes the implementation. I review, test, redirect, and iterate. The landing page itself is a Yew SPA compiled to WebAssembly with Trunk, using the Catppuccin Mocha theme.

    Two weeks. Nineteen tools. One person, operating as architect and project manager rather than line-by-line coder.

    This is what retirement plus AI looks like. The bucket list is getting shorter.


    Previous: Bucket List (1/?): Things I’ve Always Wanted to Build — the full list.

    Part 2 of the Bucket List series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1237 words7 min readAbstract

    Saw (4/?): All Together Now --- Emacs Meets the Multi-Agent Orchestra

    Fourth Sharpen the Saw update. Last time I was deep in agentrail-rs, building the ICRL dual-memory engine. This week shifted gears: two Emacs integration packages for existing CLI tools, and a brand-new multi-agent Program Manager called All Together Now (ATN) that orchestrates agents running agentrail-rs workflows. ATN went from empty repo to working web dashboard in a single session.

    The theme this week is making tools talk to each other—Emacs talking to CLI tools, agents talking to agents through agentrail-rs, and a Program Manager keeping the whole orchestra in sync. Next up: bringing Emacs into the ATN loop as a first-class frontend for the orchestrator.

    Why Sharpen the Saw? — The name comes from Covey’s Habit 7: stop cutting long enough to sharpen the blade. This series is the weekly checkpoint where I step back from feature work and invest in the tools themselves—smoother editor integration, better agent coordination, less friction between the moving parts. Four weeks in, the compound interest is showing.

    Emacs Meets the CLI: pjmai-rs

    pjmai-rs is a project manager that maintains a stack-based navigation history, groups, and per-project metadata. It works well from the terminal, but Emacs shell-mode has a blind spot: when the CLI changes directories via exit-code signaling, Emacs doesn’t update default-directory. File completion breaks. Dired opens the wrong place.

    The new pjmai.el package (376 lines) solves this by calling the binary directly from Elisp and managing per-project shell buffers where default-directory is correct from the start.

    What It Does

    Everything lives under the C-c p prefix:

    Key Action
    C-c p c Change project (opens/switches shell)
    C-c p l List projects
    C-c p s Show current project
    C-c p p Push to stack and switch
    C-c p o Pop from stack
    C-c p d Open project in dired
    C-c p a Add project
    C-c p e Edit project metadata
    C-c p g Group commands (list, show, prompt)

    Each project gets a dedicated shell buffer (*pjmai:projectname*) with the correct working directory set before the shell spawns. Tab completion just works. The shell function is pluggable—#'shell by default, but #'vterm or #'eshell are configurable.

    25 ERT tests cover the CLI interface, JSON parsing, project completion, shell buffer management, and keymap structure.

    Emacs Meets the CLI: reg-rs

    reg-rs is a regression testing tool that captures command output and diffs against baselines. Like pjmai-rs, it had great terminal ergonomics but required context-switching away from Emacs.

    my-reg-rs.el (208 lines) puts regression testing under C-c r:

    Key Action
    C-c r r Run all tests
    C-c r v Run verbose
    C-c r l List tests
    C-c r s Show test details
    C-c r u Update/accept baselines
    C-c r a Add new test
    C-c r R Rerun last command

    Output goes to compilation-mode buffers, so next-error navigation works naturally. The package auto-detects the project root by checking for work/reg-rs/, .rgt/.tdb files, or falling back to project.el.

    All Together Now: A Multi-Agent Program Manager

    The bigger project this week. I’ve been running multiple Claude Code instances across repos and the coordination overhead was becoming the bottleneck—switching tabs, manually checking wikis, copying context between agents. All Together Now (ATN) is a Program Manager that owns the agent terminals and provides a unified control plane.

    The Architecture

    ATN runs as an Axum HTTP server that spawns N agents via portable-pty, streams their terminal output through SSE to a browser dashboard, and maintains a shared wiki for coordination state.

    All Together Now architecture: Browser Dashboard connects via SSE and REST to an Axum Server with PTY Manager and Wiki Store, which manages Agents and Wiki Files

    Four Phases in One Session

    Phase What Tests
    0+1 PTY session management—spawn, read/write, Ctrl-C, transcripts 5 integration tests
    2 Minimal web UI—SSE streaming, xterm.js terminal widget Working end-to-end
    3 Multi-agent dashboard—N agents, per-agent state machine, responsive grid 3-agent demo (alice, bob, carol)
    4 Wiki integration—REST CRUD, ETag-based CAS, seeded coordination pages 8 unit tests

    29 tests pass across the workspace. Zero clippy warnings.

    The Six Crates

    Crate Lines Role
    atn-core ~300 Domain types: AgentConfig, AgentState, events, routing
    atn-pty ~500 PTY sessions, serialized writer queues, state tracker
    atn-server ~270 Axum HTTP/SSE server, static UI
    atn-ui ~200 Yew WASM components (dashboard, wiki browser)
    atn-wiki ~300 File-backed wiki with CAS from wiki-rs
    atn-trail ~200 Agentrail integration for workflow tracking

    Why PTY Ownership Matters

    The key insight: if the Program Manager owns the pseudo-terminals, it can:

    1. Stream output to a web dashboard without agents knowing
    2. Inject commands into agent sessions (serialized, no interleaving)
    3. Detect state by parsing output (prompt markers, idle timeouts, question detection)
    4. Log transcripts for debugging and replay

    The serialized writer queue per agent prevents input corruption when multiple sources (human, coordinator, macros) write to the same terminal.

    Wiki as Coordination Layer

    ATN seeds five coordination pages on startup:

    Page Purpose
    Coordination/Goals Team objectives
    Coordination/Agents Who is doing what (auto-updated)
    Coordination/Requests Inter-agent feature/bug requests
    Coordination/Blockers Dependency tracking
    Coordination/Log Append-only event timeline

    The wiki uses ETag-based Compare-and-Swap from the wiki-rs project, so concurrent agent writes get conflict detection instead of silent data loss.

    What’s Next

    ATN Phase 5: Message routing—agents write JSON to an outbox, the PGM routes push events to the correct target agent or escalates to human review.

    Emacs packages: Phase 2 additions—transient menus for discoverability, completion annotations showing project paths and languages.

    Emacs as ATN frontend: The pjmai-rs and reg-rs Emacs packages prove the pattern—call a Rust binary from Elisp, parse structured output, manage buffers. The same approach will give Emacs users a native ATN interface: agent status, wiki edits, and command injection without leaving the editor.

    Tying it together: ATN + agentrail-rs integration, where each agent’s workflow progress is visible in the dashboard (and eventually in Emacs) and skills/experiences flow between sessions.


    Better tools, better integration. Follow for more Sharpen the Saw updates.

    Part 4 of the Sharpen the Saw Sundays series. View all parts | Next: Part 5 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1479 words8 min readAbstract

    TBT (8/?): wiki-rs --- Six Wikis, One Engine, Thirty Years of History

    I set out to demonstrate the history of wiki implementations—six storage architectures spanning thirty years, from Ward Cunningham’s flat files to git-backed commits. I ended up with a new approach to coordinating AI agents.

    wiki-rs started as a throwback project: vibe-code the evolution of wiki storage to have my own private wikis. But when I needed multiple AI agents to share state during a complex multi-repo refactoring, the server-based wikis turned out to be the answer—with a Compare-and-Swap API for safe concurrent edits.

    Resource Link
    Live Demo wiki-rs (3 client-side wikis)
    Source GitHub
    Video wiki-rs: Six Wikis, One Engine
    Video
    Agent Wiki Sample synced page
    Comments Discord

    The Throwback

    I was a reader of Ward Cunningham’s original WikiWikiWeb in the late 1990s. The concept was radical at the time: a website where any visitor could edit any page, with no approval process, no gatekeeping. Pages linked to each other with CamelCase words. If a page didn’t exist, the link showed up differently—click it, and you created it. The entire system ran on flat files.

    In the early 2000s at Sun Microsystems, I started installing wikis for my teams. The first was TiKi, a Ruby-based wiki—CGI scripts, flat-file storage, pre-Rails era. It was fragile but functional. Later I moved to VQWiki, a Java servlet-based wiki that could deploy as a WAR file and supported both file and database storage. VQWiki was reliable enough for engineering teams to depend on.

    Along the way I used TiddlyWiki for personal projects—an entire wiki in a single HTML file, no server required. And these days I use GitHub Wikis for public projects, which are just git-backed markdown repositories.

    Each of these represents a different answer to the same question: where do the pages live?

    The Storage Question

    Every wiki engine has to answer this:

    Era Engine Storage Trade-off
    1995 WikiWikiWeb Flat files Simple, no dependencies, no versioning
    ~2002 TiKi (Ruby) CGI + flat files Easy deployment, fragile under load
    ~2002 VQWiki (Java) Servlet + file/DB hybrid Reliable, but heavyweight
    2004 TiddlyWiki Single HTML file Zero server, but limited scalability
    Modern GitHub Wiki Git repository Full versioning, but requires git

    The storage architecture determines everything about a wiki: how it scales, how it versions, how it deploys, whether it needs a server, and how portable the data is.

    wiki-rs: Six Approaches in Rust

    I wanted to build all of these approaches in one codebase to see how they compare. wiki-rs implements six wiki variants, all sharing the same UI and wiki engine, differing only in storage:

    Variant Storage Server Required? Demo
    Ephemeral In-memory HashMap No Live
    Browser Memory localStorage No Live
    Export/Import JSON file download/upload No Live
    Server File Axum + flat .md files Yes Local
    Server DB Axum + SQLite Yes Local
    Server Git Axum + git commits Yes Local

    The three client-side wikis run entirely in the browser via WebAssembly—no server, no installation. The three server wikis use Axum and require a local backend.

    Shared Engine, Pluggable Storage

    The architecture uses two storage traits:

    • WikiStorage (sync) — for WASM frontends where async isn’t available
    • AsyncWikiStorage (async) — for server backends

    Each wiki variant is a thin wrapper (~30 lines) that implements the appropriate trait and calls the shared render_wiki() entry point. The wiki engine—parsing, rendering, link resolution, editing—is identical across all six.

    The full codebase is 11 crates in a Cargo workspace, totaling ~2,600 lines of Rust.

    Wiki Engine Features

    The engine handles the essentials:

    • Wiki links: [[PageName]] and [[PageName|display text]]
    • Red links: nonexistent pages show as red; clicking creates the page
    • Markdown: headings, bold, italic, code blocks, lists (via pulldown-cmark)
    • Page aging: five visual tiers (Fresh, Recent, Stale, Old, Ancient) based on when a page was last edited—complete with yellowing, parchment gradients, and folded-corner effects
    • Sub-wiki theming: five color themes detected by page title prefix (e.g., Tech/Rust gets the Ocean theme)
    • XSS protection: raw HTML filtered out; wiki links inside backticks aren’t expanded

    Import: VQWiki and TiddlyWiki

    Since I have old wiki content in both VQWiki and TiddlyWiki formats, the project includes markup converters for both:

    • VQWiki importer: converts VQWiki’s custom markup (!!! headings, '''bold''', [link|url]) to standard wiki markdown
    • TiddlyWiki importer: extracts tiddlers from TiddlyWiki HTML files and converts their markup

    Both converters have test suites validating the markup transformations.

    What I Learned

    Building six variants of the same wiki clarified the trade-offs:

    Ephemeral is great for demos and testing. No persistence means no state bugs, but close the tab and everything’s gone.

    Browser localStorage is surprisingly useful for personal wikis. No server, data persists across sessions, and the 5-10 MB limit is plenty for text. The limitation is portability—the data lives in one browser on one machine.

    Export/Import solves portability. Download the wiki as JSON, email it, upload it elsewhere. But there’s no real-time versioning.

    Server File is the closest to the original WikiWikiWeb. Flat .md files that you can read, grep, and back up with any tool. Simple and transparent, but no built-in versioning.

    Server SQLite adds transactions, queries, and atomic operations. The trade-off is opacity—your wiki is inside a database file, not human-readable files.

    Server Git is the most powerful. Every edit is a git commit with full history, diff, blame, and branch support. But it’s also the most complex and has the highest overhead per edit.

    From Throwback to AI Coordination

    A pattern I follow with these projects: think of a cool technology I used in the past, figure out how to recreate it in some demonstrable way, and think about how it could benefit from AI features—or how an AI agent could benefit from a modern tool based on the technology.

    While working on an ambitious multi-repo project with multiple AI agents, I needed to act as coordinator between agents to implement a major refactoring. Each agent worked in its own repo, but they had shared dependencies, sequencing constraints, and status updates that needed to flow between them. I was the bottleneck—manually relaying context from one agent session to another.

    I wondered if there was a way to delegate this coordination to an AI agent. And then I realized: my server-based wikis were already designed to share structured information. A wiki could serve as the shared state layer—goals, dependencies, requests, status, context—all on editable pages that any agent could read and update.

    The problem: multiple agents editing the same wiki pages simultaneously will corrupt each other’s work. So I added a Compare-and-Swap (CAS) API to the wiki server. Each edit includes the page’s current version hash. If the page changed since the agent last read it, the write is rejected and the agent must re-read, merge, and retry. This gives you serialized concurrent edits without locking—the same pattern databases use for optimistic concurrency.

    Then I needed a way to monitor and document what the agents were doing. So I added a tool to export the CAS wiki as a snapshot to a GitHub Wiki. Now the coordination state is visible, versioned, and browsable on GitHub—a living record of how the agents collaborated.

    During early testing, one agent overwrote another agent’s request on a shared page—a classic lost-update problem. The affected agent eventually noticed (its request had vanished), but the damage was done. That’s exactly what CAS prevents at the API level. But it also showed that structural serialization isn’t enough—agents can still make semantic conflicts even when their writes don’t collide. So I asked the wiki-rs agent to add a feature to help serialize semantic changes too, ensuring agents merge intent rather than just bytes.

    This is where throwback meets frontier: a thirty-year-old concept (the wiki), rebuilt in Rust, extended with concurrency primitives, and put to work as infrastructure for multi-agent AI coordination.

    Quality

    The project was built with a TDD red/green/refactor process:

    • 50 integration tests across unit, API, and Playwright browser tests
    • Zero clippy warnings
    • 69/69 on the sw-checklist quality gates

    The wiki is thirty years old and still the simplest way to organize knowledge. What’s on yours?

    Part 8 of the Throwback Thursday series. View all parts | Next: Part 9 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1201 words7 min readAbstract

    ML Frontier #04: Is Chain of Thought Real?

    Fourth ML Frontier episode. In 2022, Chain of Thought changed how we think about AI reasoning. By 2026, the question has shifted from “how to make CoT better” to “is it real reasoning at all?”

    Resource Link
    Papers 10 papers (2024–2026)
    Video ML Frontier 4: Is CoT Real?
    Video
    Comments Discord

    What Chain of Thought Promised

    Wei et al. (2022) showed that prompting language models to “think step by step” dramatically improved performance on math, logic, and multi-step reasoning tasks. The idea was simple: instead of jumping straight to an answer, generate intermediate reasoning steps. The model explains its work, and the answer improves.

    This became the foundation for everything from coding assistants to scientific reasoning pipelines. But a deeper question was always lurking: are those reasoning steps real?

    The Faithfulness Problem

    Recent research shows models can produce plausible-looking reasoning steps that don’t reflect their actual internal computation. The chain of thought looks like reasoning—it has logical connectives, intermediate conclusions, references to the problem—but the model may have arrived at the answer through entirely different internal pathways.

    This is the faithfulness gap. A model’s visible reasoning trace can be:

    • Post-hoc rationalization — constructing a justification after already deciding the answer
    • Biased by surface features — following patterns in the prompt rather than the problem structure
    • Unfaithful to internal state — the actual computation happening in the model’s hidden layers doesn’t match the text it generates

    “Reasoning Models Don’t Always Say What They Think” (arXiv 2505.05410) shows that even models specifically trained to reason via CoT produce traces that are often unfaithful to their actual decision process. A March 2026 follow-up, “Reasoning Models Struggle to Control their Chains of Thought” (arXiv 2603.05706), goes further: models can’t reliably steer or suppress their reasoning traces even when instructed to. And “Diagnosing Pathological Chain-of-Thought in Reasoning Models” (arXiv 2602.13904) catalogs specific failure modes where CoT reasoning becomes actively pathological.

    The implication is uncomfortable: when a model shows you its “thinking,” you can’t assume it’s showing you how it actually thinks.

    CoT Is Task-Dependent

    The research also reveals that Chain of Thought isn’t universally beneficial. It helps with:

    • Math and arithmetic — multi-step calculations benefit from intermediate results
    • Multi-hop logic — problems requiring sequential deductions
    • Complex planning — tasks with dependencies between steps

    But CoT can actually hurt performance on:

    • Pattern recognition — tasks where the answer is immediate/intuitive
    • Simple classification — forcing step-by-step reasoning adds noise
    • Tasks with misleading structure — when the “obvious” reasoning path leads away from the correct answer

    A 2024 meta-analysis, “To CoT or not to CoT?” (arXiv 2409.12183), confirms this systematically: CoT provides negligible or negative benefit on many task types including commonsense reasoning and factual retrieval. The blanket advice of “always use Chain of Thought” is wrong. The right approach depends on the task.

    Adaptive Reasoning: Knowing When to Think

    The field is moving toward conditional reasoning—models that decide when to think step by step and when to skip it. Wang and Zhou (2024) showed that chain-of-thought reasoning can emerge from models without explicit prompting, suggesting the capability is latent rather than purely prompt-dependent.

    This points toward a future where models dynamically allocate reasoning effort:

    • Simple questions get immediate answers
    • Complex questions trigger step-by-step decomposition
    • Ambiguous questions get clarifying sub-questions

    “s1: Simple Test-Time Scaling” (arXiv 2501.19393) demonstrates this with budget-forcing—a simple mechanism to control how much reasoning a model performs at test time, truncating or extending thinking adaptively. “Outcome-Based RL Provably Leads Transformers to Reason” (arXiv 2601.15158) shows that RL training can teach transformers when reasoning is needed, not just how to reason. The model itself becomes the judge of how much thinking a problem requires.

    Latent Reasoning: Thinking Without Showing

    Some of the most interesting current work explores latent reasoning—models that reason internally without generating visible steps. Instead of producing a text trace, the model uses its internal representations to perform multi-step computation within the forward pass.

    This connects to research on:

    • Implicit chain of thought — reasoning encoded in hidden states rather than output tokens
    • Pause tokens — giving models extra computation steps without requiring text output
    • Internal scratchpads — dedicated hidden-state computation that doesn’t appear in the response

    COCONUT (arXiv 2412.06769) demonstrates this concretely: LLMs reason using continuous hidden states as “thoughts” instead of generating discrete tokens. Two 2026 papers push further: “Latent Chain-of-Thought as Planning” (arXiv 2601.21358) decouples reasoning from verbalization entirely, and “Latent Reasoning with Supervised Thinking States” (arXiv 2602.08332) trains models to reason through supervised internal states.

    If latent reasoning works at scale, it could offer the accuracy benefits of CoT without the token cost or the faithfulness problem—because there’s no visible trace to be unfaithful.

    CoT as Part of an Ecosystem

    Chain of Thought is no longer a standalone technique. It’s one component in a broader ecosystem:

    Component Role
    CoT Step-by-step decomposition
    Tool use Offload computation to external systems
    Reflection Self-critique and error correction
    Planning loops Multi-turn strategy with backtracking
    Reinforcement learning Reward signals for reasoning quality

    This makes CoT the bridge concept between language models as predictors (next token), as reasoners (multi-step logic), and as agents (goal-directed behavior). Understanding where CoT works and where it breaks is essential for building systems that combine all three.

    The Open Questions

    Question Status
    Is CoT faithful to internal computation? Evidence says often not
    When should models use CoT? Task-dependent; adaptive approaches emerging
    Can models reason without visible steps? Latent reasoning research is promising
    Does CoT scale with model size? Yes, but with diminishing returns on simple tasks
    Will CoT survive as a technique? Likely evolves into adaptive/latent forms

    Papers

    Date Paper Link
    Jan 2022 Chain-of-Thought Prompting Elicits Reasoning in Large Language Models arXiv 2201.11903
    Sep 2024 To CoT or not to CoT? CoT Helps Mainly on Math and Symbolic Reasoning arXiv 2409.12183
    Dec 2024 Training LLMs to Reason in a Continuous Latent Space (COCONUT) arXiv 2412.06769
    Jan 2025 s1: Simple Test-Time Scaling arXiv 2501.19393
    May 2025 Reasoning Models Don’t Always Say What They Think arXiv 2505.05410
    Jan 2026 Outcome-Based RL Provably Leads Transformers to Reason arXiv 2601.15158
    Jan 2026 Latent Chain-of-Thought as Planning arXiv 2601.21358
    Feb 2026 Latent Reasoning with Supervised Thinking States arXiv 2602.08332
    Feb 2026 Diagnosing Pathological Chain-of-Thought in Reasoning Models arXiv 2602.13904
    Mar 2026 Reasoning Models Struggle to Control their Chains of Thought arXiv 2603.05706

    Is the reasoning real, or just a good story? Follow for more ML Frontier episodes exploring research at the edge.

    Part 4 of the Machine Learning Frontier series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1357 words7 min readAbstract

    Saw (3/?): agentrail-rs --- From Walking Skeleton to Dual Memory

    Third Sharpen the Saw update. Last time I mentioned agentrail-rs was evolving from avoid-compaction, a saga-based context checkpoint tool. This weekend it went from walking skeleton to a working ICRL pipeline with dual memory, distillation, domain executors, and a hybrid orchestrator loop.

    I also vibe-coded a C compiler, several Wiki implementations for a future TBT post, and kept sharpening the saw by incremental development on agentrail-rs—applying ideas from papers on ICL, ICRL, and XSkill. The plan for this week: develop the domain-specific Layer 2 parts for agentrail-rs, using three new projects as test cases—running the new C compiler inside a browser, developing a Macro Lisp in C, and running that Macro Lisp inside a browser. This requires agentrail to carry development skills for C, Rust, Lisp, and Web UI.

    The Problem agentrail Solves

    AI coding agents lose operational knowledge between sessions. An agent might figure out the right sequence of commands for a complex task—TTS generation, video compositing, file manipulation—then lose that knowledge when the session ends. Next time, it starts from scratch. Sometimes it succeeds. Sometimes it improvises and fails.

    agentrail-rs gives agents structured handoffs, deterministic step execution, and in-context reinforcement learning so they succeed on first attempts instead of guessing.

    What’s Working: Beyond the Walking Skeleton

    Phase 0 through Phase 5 (partial) are implemented. The CLI has 8 commands that manage the full saga lifecycle:

    agentrail init my-project      # Create a new saga
    agentrail plan                  # Show the step sequence
    agentrail next                  # Get instructions for the next step
    agentrail begin step-name       # Mark a step as in-progress
    agentrail complete step-name    # Mark a step as done
    agentrail status                # Show saga state
    agentrail history               # Show completed steps
    agentrail abort                 # Cancel the saga
    

    Everything persists to a .agentrail/ directory: TOML configs, JSON trajectories, JSONL session snapshots. The 24 integration tests all pass, and pre-commit quality gates enforce formatting, lints, and test coverage.

    Two-Layer Architecture

    The architecture separates the generic engine from domain-specific knowledge:

    Layer 1 (this repo) — task-agnostic inference-time learning:

    • Workflow state machine (sagas with typed steps)
    • Dual memory following the XSkill pattern: skills (strategic workflow documents) and experiences (tactical per-run records)
    • ICRL injection: retrieve successful experiences and inject them into agent prompts
    • Distillation: analyze experience batches, generate and update skill documents

    Layer 2 (separate repos, future) — domain-specific knowledge:

    • Per-domain repos (e.g., agentrail-domain-media, agentrail-domain-rust)
    • Skill documents, curated experience libraries, executor implementations, validators
    • Optional knowledge graphs for reward signals

    The separation means the engine never changes when you add a new domain. You just create a new Layer 2 repo with the right skill files and executors.

    The XSkill Connection

    The dual-memory pattern comes directly from the XSkill paper (arXiv 2603.12056). Their ablation analysis shows that removing either skills or experiences hurts performance—you need both.

    In agentrail-rs:

    Memory Type What It Stores How It’s Used
    Skills Structured workflow documents for a class of tasks Injected into agentrail next to give the agent a strategic playbook
    Experiences Tactical records from past runs (what worked, what failed) Injected into agentrail next to show the agent what succeeded before

    When you run agentrail next, it retrieves relevant skills and past trajectories and includes them in the output. The agent sees both how to approach this kind of task (skill) and what actually worked last time (experience).

    Research Foundations

    The architecture maps to specific research:

    Research How It’s Applied
    ICRL (Decision Transformer, Reflexion, Voyager) Agents learn from trajectory examples in context, not weight updates
    XSkill (dual memory) Skills + experiences, both necessary
    Knowledge Graphs as Reward Models Graph edges as verifiable reward signals (Phase 4)
    Sleepy Coder experiment LoRA fine-tuning couldn’t beat baseline, validating inference-time approach

    The Sleepy Coder result was pivotal. I’d spent weeks trying to fine-tune a small model for my specific agent tasks. The fine-tuned model performed worse than the base model with good prompts. That’s what pushed me toward ICRL: don’t change the model’s weights, change what it sees in context.

    Implementation Progress

    Phase Description Status
    0 Walking skeleton (CLI, persistence, tests) Done
    1 ICRL core loop (task types, trajectory retrieval, experience recording) Done
    2 Dual memory (Skill/Experience types, injection, distill command) Done (2a, 2d)
    3 Domain repo support (registry, executors, validators) Done (partial)
    4 Knowledge graph validation (graph-based rewards) Planned
    5 Hybrid orchestrator (auto-advance deterministic steps, escalate semantic work) Done (partial)

    Phase 1 added task_type to step configs, trajectory retrieval in agentrail next, and experience recording with --reward/--actions flags on complete. Phase 2a introduced the Skill type with TOML storage and injection into next output. Phase 2d added the distill command that analyzes experience batches to generate skill documents. Phase 3 brought the domain registry, executor trait, and validator trait. Phase 5 implemented the hybrid orchestrator loop where deterministic steps auto-advance and semantic work gets escalated to the agent.

    What remains: Phase 2b/2c (enriched Experience type, experience retrieval by embedding), Phase 3 completion (first real domain repo), and Phase 4 (knowledge graph rewards).

    Vibe Coding Projects

    agentrail-rs is being developed by building extensions to these vibe coding projects—each one is a real test case for domain-specific skills and experiences.

    Project Link
    cor24-rs sw-embed/cor24-rs — COR24 assembly emulator (Rust, WASM, embedded)
    tc24r sw-vibe-coding/tc24r — C compiler for COR24 (C, compiler design, browser)
    wiki-rs sw-vibe-coding/wiki-rs — Wiki implementations (Rust, web UI)

    What’s Next: Domain-Specific Layer 2

    The engine (Layer 1) is functional. The next challenge is building real Layer 2 domain repos and proving the architecture works on actual projects. This week I’m testing it against three new projects:

    1. A C compiler running in a browser — requires WebAssembly compilation skills
    2. A Macro Lisp implemented in C — requires C development and language implementation skills
    3. The Macro Lisp running inside a browser — combines all of the above with Web UI skills

    This is a deliberate stress test. Each project demands different domain expertise: C, Rust, Lisp, and Web UI. If agentrail-rs can carry skills and experiences across these domains and help the agent succeed on first attempts, the architecture works. If not, I’ll learn where it breaks.

    Crate Layout

    The project is a Cargo workspace (edition 2024) with clean separation:

    Crate Role
    agentrail-core Domain types: SagaConfig, StepConfig, Skill, Trajectory, HandoffPacket, JobSpec
    agentrail-store File-based persistence (.agentrail/), skill and trajectory storage
    agentrail-cli Binary with 8 commands + distill
    agentrail-exec Deterministic step executors with domain registry
    agentrail-validate Output validators with domain registry

    Papers

    Date Paper Link
    Feb 2025 OmniRL: In-Context RL Across Multiple Tasks arXiv 2502.02869
    Jan 2026 Knowledge Graphs are Implicit Reward Models arXiv 2601.15160
    Mar 2026 XSkill: Continual Learning from Experience and Skills arXiv 2603.12056
    Mar 2026 An Alternative Trajectory for Generative AI arXiv 2603.14147

    Better tools, better agents. Follow for more Sharpen the Saw updates.

    Part 3 of the Sharpen the Saw Sundays series. View all parts | Next: Part 4 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1943 words10 min readAbstract

    COR24-RS: Learn Assembly in Your Browser

    Learning assembly language feels like climbing an impossible staircase. Each step reveals another layer of complexity—registers, memory addressing, calling conventions. But it doesn’t have to be intimidating. The right tools make the invisible visible.

    This post introduces cor24-rs, a browser-based emulator for the COR24 instruction set architecture. Write assembly, step through instructions, watch registers change. All in your browser, no installation required.

    Resource Link
    Live Demo COR24 Assembly Emulator
    Source GitHub
    Video Browser-Based Assembly: COR24 RISC Emulator in Rust
    Video
    MakerLisp makerlisp.com (COR24 creators)
    COR24 Soft CPU FPGA Implementation
    COR24 Dev Board Hardware Kit
    Comments Discord

    What is COR24?

    COR24 (C-Oriented RISC 24-bit) is a soft CPU architecture designed by MakerLisp. The design priorities were simplicity, speed, and a good impedance match to C compilers on low-density FPGAs—no legacy requirements, no committee compromises. The “C-Oriented” in the name is literal: architectural decisions were informed by what a practical C compiler needs from this class of processor.

    Origin Story

    MakerLisp developed COR24 as a replacement for the eZ80, which was the best option they could find for their class of small embedded problems. During the pandemic-era chip shortage, mass-market microcontrollers became unavailable, so they designed their own CPU for FPGAs. The result: a 24-bit RISC architecture that runs at 101 MHz on inexpensive Lattice FPGAs—a simple, fast, and rational alternative built from the ground up with no legacy baggage.

    The CPU is written in Verilog and released under the MIT license. It’s both a practical embedded solution for small computing problems and an excellent architecture for learning CPU fundamentals. You can build your own hardware implementation or use the browser emulator to explore the architecture.


    Architecture Overview

    COR24 keeps things simple. Three general-purpose registers, five special-purpose registers, one condition flag, and instructions that are 1, 2, or 4 bytes long.

    Registers

    Register Purpose
    r0 General purpose / return value
    r1 General purpose / return address
    r2 General purpose
    fp Frame pointer (special)
    sp Stack pointer (special)
    z Constant zero (compare instructions only)
    iv Interrupt vector (special)
    ir Interrupt return (special)

    Only r0, r1, and r2 are truly general-purpose. The named registers (fp, sp, z, iv, ir) have dedicated roles. The z register provides a constant zero accessible only in compare instructions (ceq r0, z, clu z, r0, cls r0, z)—it is not a general-purpose register and cannot be used in mov, ALU, or load/store instructions. The architecture uses a separate condition flag (C) set by compare instructions and tested by branch instructions.

    Memory Model

    • 24-bit address space (16 MB addressable)
    • Byte-addressable with little-endian ordering
    • Memory-mapped I/O at 0xFF0000 - 0xFFFFFF
    • Stack grows downward (standard convention)

    Instruction Categories

    Category Instructions
    Arithmetic add, sub, mul
    Logic and, or, xor
    Shifts shl, sra, srl
    Compare ceq, cls, clu
    Branch bra, brf, brt
    Jump jmp, jal
    Load la, lc, lcu, lb, lbu, lw
    Store sb, sw
    Stack push, pop
    Move mov, sxt, zxt

    Instructions are 1, 2, or 4 bytes (never 3). Register-only operations are compact (1 byte). Loading 8-bit constants (sign- or zero-extended) uses 2-byte instructions (lc, lcu). Loading full 24-bit values—whether addresses or integers that don’t fit in 8 bits—requires 4-byte instructions (la). Note: data words are 3 bytes (24-bit), but instruction encoding never uses 3 bytes.

    The Dev Board

    COR24 dev board detail showing S2, D2, Reset, and Power

    The COR24-TB dev board exposes the CPU’s I/O in a hands-on layout. The S2 button is a user switch—press it and the CPU sees an input event your assembly code can poll or respond to. D2 is a user LED wired to a memory-mapped output address, so your code can toggle it directly with a store instruction. The board breaks out UART connectors for serial communication, with hardware support for an internal interrupt when data arrives—meaning your program doesn’t have to busy-wait on the serial port. Beyond these, the board has six additional GPIO pins intended for a four-wire SPI interface and a two-wire I2C bus. MakerLisp is actively developing bit-bang I2C support (temperature sensor reading is next), followed by an I2C real-time calendar clock, a 4-position 7-segment display via SPI, and SD card access via SPI. A Reset button and Power LED round out the essentials.

    The emulator models the S2 button, D2 LED, and UART with interrupt support, so programs written for the browser run the same way on real hardware.


    The Browser Emulator

    cor24-rs brings COR24 to the web using Rust compiled to WebAssembly. No downloads, no setup—just open the page and start coding.

    Features

    • Three Tabs - Assembly, C, and Rust pipelines, all running on the same COR24 CPU
    • Interactive Assembly Editor - Syntax highlighting, error messages, line numbers
    • Step-by-Step Execution - Execute one instruction at a time with log-scale speed control
    • Register & Memory Viewer - Watch CPU state change in real-time with highlighted changes
    • Instruction Trace - Last 100 executed instructions visible in the web UI
    • 11 Assembler Examples - Pre-loaded programs including Blink LED, Fibonacci, Countdown, Variables, and Assert
    • 12 Rust Pipeline Demos - From simple add to UART echo with interrupts
    • 2 C Pipeline Examples - Fibonacci and Sieve of Eratosthenes via MakerLisp’s CC24 compiler
    • Coding Challenges - Test your assembly skills with suggested exercises
    • ISA Reference - Complete instruction documentation inline with CPU state, interrupts, and memory map
    • Interactive Tutorial - Comprehensive introduction covering registers, instructions, I/O, and idioms
    • Self-Test Mode - ?selftest URL parameter runs all 15 examples automatically with pass/fail reporting
    • Animated Tours - ?showme-asm, ?showme-c, ?showme-rust walk through each pipeline
    • Realistic UART Timing - TX busy for 10 cycles per character; dropped characters when writing without polling

    Example: Fibonacci

    Here’s a recursive Fibonacci implementation in COR24 assembly:

    _fib:
            push    fp              ; Save frame pointer
            push    r2              ; Save r2
            push    r1              ; Save return address
            mov     fp,sp           ; Set up frame
            add     sp,-3           ; Local variable space
            lw      r2,9(fp)        ; Load argument n
    
            lc      r0,2            ; Load constant 2
            cls     r2,r0           ; Compare n < 2
            brf     L17             ; Branch if false
    
            lc      r0,1            ; Return 1
            bra     L16             ; Jump to epilogue
    
    L17:
            mov     r0,r2           ; r0 = n
            add     r0,-1           ; r0 = n - 1
            push    r0              ; Push argument
            la      r0,_fib         ; Load fib address
            jal     r1,(r0)         ; Call fib(n-1)
            add     sp,3            ; Clean up argument
            sw      r0,-3(fp)       ; Save result
    
            mov     r0,r2           ; r0 = n
            add     r0,-2           ; r0 = n - 2
            push    r0              ; Push argument
            la      r0,_fib         ; Load fib address
            jal     r1,(r0)         ; Call fib(n-2)
            add     sp,3            ; Clean up argument
            lw      r1,-3(fp)       ; Load fib(n-1)
            add     r0,r1           ; r0 = fib(n-1) + fib(n-2)
    
    L16:
            mov     sp,fp           ; Restore stack
            pop     r1              ; Restore return address
            pop     r2              ; Restore r2
            pop     fp              ; Restore frame pointer
            jmp     (r1)            ; Return
    

    This demonstrates the full calling convention: prologue/epilogue, argument passing via stack, and recursive calls.


    Command Line Tools

    Beyond the browser emulator, cor24-rs includes CLI tools for local development:

    # Assemble and run in the debugger
    cor24-dbg program.s
    
    # Or assemble and run directly
    cor24-run program.s
    
    # With LED visualization
    cor24-run program.s --leds
    

    The CLI debugger (cor24-dbg) supports breakpoints, step execution, UART I/O, LED/button simulation, and instruction trace. The --uart-never-ready flag forces TX to never clear, useful for testing polling behavior.


    Rust to COR24 Pipeline (Experimental)

    The project includes experimental support for compiling Rust to COR24:

    Rust (.rs) → WASM (.wasm) → COR24 Assembly (.s) → Binary
                ↑               ↑
             rustc          wasm2cor24
            (standard)       (this project)
    

    Write embedded Rust with #![no_std], compile to WebAssembly, then translate to COR24 assembly. The wasm2cor24 translator handles the stack-based IR conversion.

    This approach leverages Rust’s existing toolchain—no compiler modifications needed. The wasmparser crate handles WASM parsing, and COR24’s stack-oriented design maps reasonably well from WASM’s stack machine.


    Why Learn Assembly?

    Even if you never write production assembly, understanding it changes how you think about code:

    1. Performance intuition - Know what your high-level code compiles to
    2. Debugging - Read crash dumps and disassembly when things go wrong
    3. Security - Understand buffer overflows, ROP chains, exploitation
    4. Embedded systems - Some hardware requires low-level access
    5. Appreciation - Respect the layers beneath your abstractions

    COR24 is simple enough to fit in your head but realistic enough to represent real CPU design patterns. And unlike purely educational architectures, it’s also a practical platform for small embedded problems—the kind of work that used to require an eZ80 or similar microcontroller.


    Implementation Details

    The emulator core is written in Rust, compiled to WebAssembly via Trunk. Key components:

    Module Purpose
    cpu/state.rs CPU state management (registers, memory, flags)
    cpu/executor.rs Instruction execution engine with realistic UART timing
    cpu/decode_rom.rs Instruction decode ROM (extracted from hardware Verilog)
    assembler.rs Two-pass assembler with as24-compatible syntax enforcement
    challenge.rs Coding challenge definitions
    selftest.rs Automated test runner for all 15 examples
    app.rs Yew-based web application (3 tabs, animated tours)

    The decode ROM is particularly interesting—it’s extracted directly from the hardware Verilog implementation, ensuring the emulator matches the real CPU behavior exactly.


    Try It Yourself

    Live Demo: sw-embed.github.io/cor24-rs

    The demo includes:

    • Pre-loaded example programs
    • Interactive tutorials
    • Coding challenges with automated verification
    • Complete ISA reference

    Start with the “Hello World” example, then work through the challenges. By the time you complete them, you’ll understand registers, memory, stack operations, and function calls.


    Key Takeaways

    1. COR24 is a real CPU - Designed for FPGAs, runs at 101 MHz, MIT licensed, practical for embedded work
    2. cor24-rs makes it accessible - Browser-based, no installation required
    3. Assembly isn’t scary - With good tools, you can see every step
    4. Rust + WASM works - The entire emulator compiles to a web application
    5. Simple doesn’t mean toy - COR24’s design prioritizes C compiler compatibility and practical embedded I/O, not just teaching

    Resources


    Assembly language is the ground truth. Everything else is abstraction.

    Part 2 of the Embedded series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1003 words6 min readAbstract

    Bucket List (1/?): Things I've Always Wanted to Build

    Everyone has a list. Mine has been accumulating for decades—things I wanted to learn, build, or understand, but never had the time. A career in software engineering is consuming. You solve hard problems every day, but they’re someone else’s hard problems. Your own curiosities wait.

    Two things changed. I retired. And AI coding agents got good enough to make ambitious solo projects feasible.

    Resource Link
    Bucket List GitHub
    Comments Discord

    The Two Enablers

    Retirement gave me the time. No more deadlines, no more sprints, no more meetings about meetings. Just curiosity and a compiler.

    AI gave me reach. The AI writes the code. I give direction, goals, test criteria, architectural restrictions. I prioritize. I use AI to help me research what’s possible, what’s been done historically, how to approach things. I’m working at a higher level than coding—reprising the roles I held during my career: team lead, architect, product manager, engineering manager. I trust but verify the work of AI designers, planners, coders, testers, and technical writers.

    Projects that would have required a team, or a semester, or a level of domain expertise I didn’t have—now one person can attempt them. Not by typing faster, but by operating at the right level of abstraction.

    Together, retirement and AI unlocked the list.

    The List (Abridged)

    I’ve organized these loosely by category, but the real list is messier than this. Some items are decades old. Some I added last month. Some I’ve already done and blogged about. Some I’m actively working on. Many are still waiting.

    Embedded and Hardware

    Program microcontrollers in Rust (no_std). Not just blink-an-LED—real sensor networks, real communication protocols. I’ve been doing this with BMP280 pressure sensors and I2C multiplexers, building arrays of dozens of sensors for a patent proof-of-concept.

    Learn to program FPGAs. I’ve always been fascinated by hardware description languages—the idea that you’re not writing instructions, you’re describing circuits. This connects directly to another item on the list…

    Build a ternary computer. Base-3 computing. Three-valued logic instead of binary. This is a real project—I’m in planning mode, starting with emulation, with the goal of implementing it on an FPGA. Why? Because balanced ternary is mathematically elegant, and because “why not” is a valid engineering motivation when you’re retired.

    Program Arduinos, Raspberry Pis, ESP32-C3s, ESP32-C6s, and various other 8-bit, 16-bit, and 32-bit microcontrollers. I want to understand the full spectrum, not just the popular ones.

    Compilers and Languages

    Write my own C compiler. Sort of. Take an existing small C compiler and modify it for a custom ISA, adding features along the way. I never got to take a compiler class in college, and I’ve regretted it ever since. Every time I’ve used a compiler—which is every day of my career—I’ve been using a tool I didn’t fully understand. Time to fix that.

    Implement ToonTalk. A visual programming language I wrote about in the Throwback Thursday series. The original was built for teaching children to program through animated characters and spatial metaphors. I want to see if the concept can be modernized.

    Emulate Everything

    I’ve always been drawn to instruction set architectures. The idea that you can simulate an entire computer in software—every register, every memory access, every instruction decode cycle—is endlessly satisfying.

    The collection so far includes the IBM 1130 (a 1960s minicomputer I have personal history with), the MakerLisp COR24 (a modern 24-bit RISC for FPGAs), and plans for the RCA 1802, TI-1000, IBM 360/370/390, and RISC-V I32. Each one teaches something different about computer architecture.

    Machine Learning and AI

    Fine-tune an open-weights LLM. Not use one—train one. Understand the full pipeline: dataset preparation, tokenizer choices, LoRA adapters, evaluation. I’ve written about small models and neural net internals, but I want the hands-on experience of taking a base model and shaping its behavior.

    Creative and Media

    Program Blender 3D to create physics-based animations. Not art—engineering visualizations. Simulating how things move, collide, flow.

    Generate sound effects procedurally. No samples, no recordings—synthesize sounds from parameters. Explosions, rain, footsteps, all from math.

    Generate music. I’ve already built midi-cli-rs and music-pipe-rs, Unix-pipeline tools for algorithmic composition. There’s more to explore here.

    Practical Home Projects

    These are the ones my family actually cares about:

    • Cat tracker — know where the cats are without searching the house
    • Senior monitoring — help a family member with memory issues stay safe, without being intrusive
    • Spam call blocker — something smarter than a blocklist
    • Turkey deterrent — they invade the property regularly and are remarkably persistent
    • Wildfire detection — early warning for a fire-prone area
    • Automated exterior sprinkler system — part wildfire defense, part garden automation

    Each of these is a real project, not a hypothetical. Some are in progress. Some are in the planning stage. All of them combine embedded hardware, software, and problem-solving in ways that make them genuinely fun.

    Why Blog About It?

    Three reasons.

    Accountability to myself. Writing about what I’m doing forces me to think clearly about it. If I can’t explain it, I don’t understand it well enough.

    Sharing the approach. The combination of retirement + AI + decades of software experience creates an unusual vantage point. I’m not a student learning for the first time, and I’m not an expert in most of these domains. I’m an experienced engineer exploring unfamiliar territory with modern tools.

    The list itself. Maybe other people have similar lists. Maybe seeing someone actually working through theirs is encouraging.

    What’s Next

    Future Bucket List posts will cover a few items at a time, mixing categories. I’ll share what I’ve learned, what surprised me, what’s harder or easier than expected, and where AI helped or didn’t. No particular order. No schedule. Just whatever I’m working on.


    Everyone’s list is different. This is mine. What’s on yours?

    Part 1 of the Bucket List series. View all parts | Next: Part 2 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 2047 words11 min readAbstract

    TBT (7/?): reg-rs - Regression Testing from C++ to Java to Rust

    You ship a fix. Tests pass. Three weeks later a customer reports that a flag you didn’t touch now produces different output. Nothing in your test suite catches it because your tests check behavior, not output. What you needed was a regression test—a snapshot of what the command actually produced, compared against what it produces now.

    What is reg-rs?

    reg-rs (regress) is a CLI tool that captures the output of shell commands—stdout, stderr, and exit code—as “golden” baselines, then re-runs those commands and compares the results to detect regressions. Think of it as snapshot testing for any command-line tool.

    Quick Start with Aliases

    reg-rs ships with shell aliases that make the workflow fast. Source them once (or add to your .zshrc/.bashrc):

    source /path/to/reg-rs/bin/source-rg.sh
    

    Then the full create-run-check-update cycle looks like this:

    # 1. Create the thing you're testing
    echo 'echo "Hello, World!"' > greet.sh
    
    # 2. Create a regression test — captures the output as the baseline
    adrg greet 'bash greet.sh'
    
    # 3. Run the test — compares current output against the saved baseline
    rnrg greet
    #   PASS
    
    # 4. See the results
    lsrg greet
    #   PASS   greet   bash greet.sh
    
    # 5. Now change greet.sh (simulate a code change)
    echo 'echo "Hey there!"' > greet.sh
    
    # 6. Run the test again
    rnrg greet
    #   FAIL
    
    # 7. See what changed
    shrg greet -vv
    #   baseline: "Hello, World!"
    #   latest:   "Hey there!"
    #   diff:     - Hello, World!
    #             + Hey there!
    
    # 8. You decide this change is intentional — accept the new output
    uprg greet
    
    # 9. Run again — passes with the new baseline
    rnrg greet
    #   PASS
    

    The aliases: adrg (add), rnrg (run), lsrg (list), shrg (show), uprg (update/rebase), rmrg (remove), rsrg (reset), strg (status server), hlrg (help). Tab completion is included.

    Or use the full commands: reg-rs create, reg-rs run, reg-rs list, reg-rs show, reg-rs rebase, etc.

    Text-Based Test Format (.rgt)

    A recent major change: tests are now stored as plain text files instead of binary SQLite databases. Each test has up to three files:

    File Purpose Git-tracked?
    test.rgt TOML spec (command, timeout, metadata) Yes
    test.out Expected stdout baseline Yes
    test.err Expected stderr baseline (absent if empty) Yes
    test.tdb Runtime cache (latest results, diffs) No

    An .rgt file looks like:

    command = "git --version"
    timeout = 10
    exit_code = 0
    desc = "Version string format check"
    expects = "Prints version in semver format"
    

    The .out file is just the golden output, plain text:

    git version 2.44.0
    

    This makes tests git-friendly—baselines show up in diffs, code reviews, and blame. The .tdb cache is gitignored; it only stores runtime results for reporting. If you have existing .tdb tests, reg-rs migrate (or mgrg) converts them to the new format.

    Detecting a Regression

    Here’s a concrete example of reg-rs catching a version change:

    # Set up a baseline
    echo 'version 1.0.0' > version.txt
    adrg version_test 'cat version.txt'
    
    # Run it again---passes, output matches
    rnrg version_test
    # PASS
    
    # Simulate a change
    echo 'version 2.0.0' > version.txt
    
    # Run again---regression detected
    rnrg version_test
    # FAIL
    
    # See exactly what changed
    shrg version_test -vv
    # stdout differences:
    #   - version 1.0.0
    #   + version 2.0.0
    
    # Intentional change? Accept the new baseline
    uprg version_test
    

    Dogfooding: reg-rs Tests Itself

    reg-rs uses itself to regression-test its own CLI. The test directory contains golden baselines for every subcommand’s help output. After any code change, rnrg checks that no help text, flag names, or usage strings changed accidentally. The demo scripts that exercise this workflow run automatically as part of cargo test.

    Monitor: Web Dashboard

    reg-rs status -p test
    # or: strg test
    

    This launches an Axum web server on port 4740 with a live dashboard. The landing page shows summary counts (pass/fail/pending) across all projects, updating in real time via Server-Sent Events (SSE)—no polling, no page refresh. The SSE stream sends JSON payloads and the client updates the DOM directly, so you see pass counts climb and pending counts drop as each test completes.

    The detail view has collapsible sections for failures, passes, and pending tests. Failed tests show inline character-level diffs—expected baseline in green, actual output in yellow—so you can see exactly what changed and decide whether to investigate or rebase. A JSON API at /api/status is available for programmatic access.

    Motivation

    I’ve used regression testing tools for over 25 years, starting with regress at Forte Software in Oakland around 2000. The idea is simple and powerful: capture what a command produces, then verify it hasn’t changed. When I started learning Rust in 2020, I created a private implementation called rtt1. I’ve now forked and open-sourced it as reg-rs under an MIT license, with AI features already implemented: –describe generates test commands from natural language, and analyze triages failures using Claude. The PRD and subject study document the full roadmap and real-world testing patterns.

    The Throwback

    In 2000, I was working at Forte Software in Oakland, California. Forte had a C/C++-based regression testing tool called regress. The concept was straightforward: run a command, save the output, run it again later, diff the results. Simple, but it caught real bugs that unit tests missed—the kind where the output format changed, or an error message got reworded, or a flag silently started behaving differently.

    Sun Microsystems acquired Forte, and since Sun was focused on Java, I wrote jregress over the next year—a clean-room implementation, not a port. It was partly a learning exercise, partly practical: the Java development teams and QA needed a regression tool that lived in their ecosystem, and writing it in Java meant I could maintain it myself and add features as QA requested them. Oracle acquired Sun in 2010, and as far as I know, jregress is still being maintained and used there today. There may have been an attempt to open-source it, but I haven’t found it online.

    Era Tool Language Context
    2000 regress C/C++ Forte Software, Oakland
    ~2001 jregress Java Sun Microsystems (clean-room rewrite)
    2010+ jregress Java Oracle (still maintained?)
    2020 rtt1 Rust Private learning project
    2026 reg-rs Rust Open-sourced, MIT license

    The concept hasn’t changed in 25 years. What’s changed is the tooling around it: Rust gives you single-binary distribution, text-based .rgt files make tests git-friendly, clap gives you a polished CLI with shell aliases and tab completion, and Axum gives you a live monitoring dashboard with SSE. The next evolution is AI—using language models to generate test cases, explain regressions, and maintain baselines as code evolves.

    Advanced Features

    The basics—add, run, list/show—cover simple cases. Real CLI tools present harder challenges: non-deterministic output, interactive prompts, binary files, and slow test suites. reg-rs has features for all of these.

    Taming Non-Deterministic Output

    CLI output often contains timestamps, temp paths, PIDs, and version strings that change between runs. reg-rs provides two mechanisms to handle this.

    --preprocess (-P): Pipe stdout/stderr through a shell command before diffing:

    # Mask temp directory paths (macOS resolves /tmp to /private/var/...)
    adrg my_test 'my_tool run' \
      -P "sed 's|/tmp/[^ ]*|<TMPDIR>|g; s|/private/var/[^ ]*|<TMPDIR>|g'"
    

    --diff-mode (-M): Built-in normalization for common formats:

    # JSON: sorts keys and pretty-prints before diffing
    adrg api_test 'curl -s localhost:8080/status' -M json
    
    # Lines-unordered: sorts lines before diffing
    adrg completions 'mytool complete commands' -M lines-unordered
    

    Command Timeouts

    Interactive CLIs that prompt for input will hang indefinitely in non-interactive shells. The --timeout flag (in seconds) makes them fail fast:

    adrg pjmai_help 'pjmai-rs --help' --timeout 10
    

    Self-Documenting Tests

    Tests can carry their own documentation, stored in the .rgt file:

    adrg pjmai_help 'pjmai-rs --help' --timeout 10 \
      --desc "Verifies help text is stable" \
      --expects "Standard clap help output" \
      --flaky-note "None - deterministic"
    

    These metadata fields appear in failure reports at -vv verbosity and are consumed by the analyze subcommand for AI-powered triage.

    Parallel Execution

    The --parallel flag runs all matching tests concurrently, one thread per test:

    rnrg pjmai --parallel
    

    Each test has its own independent files, so there are no concurrency conflicts.

    Testing Binary Output

    Not all CLI tools produce text. favicon generates PNG and ICO images—binary output where line diffs are meaningless. The subject study documents approaches including SHA-256 checksums, base64 encoding, and hybrid strategies for visual comparison.

    AI Features

    Several AI features are implemented (not just planned). All require ANTHROPIC_API_KEY.

    AI-Powered Test Creation (--describe)

    Describe what you want to test in natural language instead of writing the shell command:

    reg-rs create -t status -D "show git status of current directory"
    # AI generates: git status
    

    Add --context (-C) to feed the AI existing help text for better command generation:

    reg-rs create -t pjmai_list \
      -D "test the list subcommand with no projects" \
      -C "pjmai-rs --help"
    

    AI Failure Analysis (analyze)

    When tests fail, the analyze subcommand sends the original output, latest output, and diff to Claude for triage:

    reg-rs analyze -p my_failing_test
    

    It classifies failures as true regressions, flaky tests, environmental changes, or stale baselines—helping you decide whether to investigate or rebase.

    Getting Started

    Clone and build:

    git clone https://github.com/sw-cli-tools/reg-rs.git
    cd reg-rs
    cargo build --release
    

    Set up aliases:

    source bin/source-rg.sh
    

    Create your first test:

    adrg hello 'echo hello world'
    rnrg hello
    lsrg hello
    

    References

    Resource Link
    reg-rs Repository github.com/sw-cli-tools/reg-rs
    User Guide docs/user-guide.md
    Subject Study Testing CLI tools with reg-rs
    PRD Product Requirements

    Historical Context

    Era Resource Notes
    2000 Forte Software / Sun regress was an internal C/C++ regression testing tool
    2000s jregress Clean-room Java implementation at Sun Microsystems
    2010 Oracle acquires Sun jregress continues in use internally
    2020 rtt1 Private Rust implementation, learning project
    2026 reg-rs Open-sourced fork under MIT license

    The best test is the one that catches the change nobody expected. Regression testing has been doing that for decades—now with better tools.

    Part 7 of the Throwback Thursday series. View all parts | Next: Part 8 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 707 words4 min readAbstract

    ML Frontier #03: Structure Beats Scale --- Knowledge Graphs and Domain-Specific Superintelligence

    Third ML Frontier episode. What if scaling AI didn’t mean bigger models, but better structure? A line of research from Princeton proposes an alternative trajectory: Domain-Specific Superintelligence built on Knowledge Graphs.

    The Premise: Structure Over Scale

    The dominant AI trajectory is clear: make models bigger, train on more data, throw more compute at the problem. It works, but it’s expensive, opaque, and increasingly difficult to verify.

    Princeton’s JHA Lab proposes a fundamentally different path. Instead of one giant general model, build smaller expert models grounded in structured knowledge—specifically, Knowledge Graphs. The result: Domain-Specific Superintelligence (DSS).

    Knowledge Graphs as Training Engines

    A Knowledge Graph (KG) is a structured representation of facts and relationships—nodes connected by labeled edges. In traditional AI pipelines, KGs serve as memory or lookup tables. The key insight here is that a KG can serve a much deeper role.

    Step 1 — Supervised Fine-Tuning (SFT). Use the graph to generate reasoning tasks. Paths through the graph become structured training problems. The model learns to follow real domain relationships, not just pattern-match on surface text. This is grounded learning—every training example traces back to verified structure.

    Step 2 — Reinforcement Learning with KG Rewards. This is the breakthrough. Every reasoning path in the graph becomes a verifiable reward signal. Valid multi-hop paths are rewarded; invalid reasoning is penalized. The graph itself is the reward model.

    The Implicit Reward Model

    Traditional RL for language models requires a separate reward model—often a black box trained on human preferences. The KG approach eliminates that dependency.

    Because the graph encodes real relationships, the reward signal is transparent and verifiable. There’s no black-box scoring. You can trace exactly why a reasoning path was rewarded or penalized. This is what the authors call an implicit reward model: the structure of knowledge itself provides the training signal.

    Zero-Shot Scaling Through Composition

    Train on simple paths, generalize to complex multi-hop reasoning. This is compositional generalization—the model learns reasoning primitives from short KG paths, then composes them into longer chains at inference time without having seen those specific chains during training.

    The result is zero-shot scaling: stronger reasoning without a larger model. Structure replaces scale.

    The Full Stack

    The research describes a concrete pipeline:

    Step Component Role
    1 Build KG (GraphMERT) Reliable knowledge graph construction and distillation
    2 Generate tasks (SFT) KG paths become structured training examples
    3 Train with KG rewards (RL) Graph validates reasoning, provides reward signal

    Why This Matters

    Three practical implications:

    Verifiable outputs. Every reasoning step maps to a KG path. You can audit why the model produced a particular answer—something large general models can’t offer.

    Domain accuracy. Expert models grounded in domain-specific KGs should outperform general models on specialized tasks, with fewer parameters.

    Smaller compute footprint. If structure can substitute for scale, the cost curve of AI changes fundamentally. Not every problem needs a trillion-parameter model.

    A Different Trajectory

    This isn’t a minor optimization. It’s a different thesis about how AI should be built:

    Current Trajectory Alternative Trajectory
    Bigger models Better structure
    General-purpose Domain-specific
    Black-box rewards Graph-derived rewards
    Brute-force pretraining Compositional reasoning
    Scale compute Scale knowledge

    Whether this pans out at production scale remains to be seen. But the research direction is compelling: less brute force, more structure.

    Papers

    Date Paper Link
    Jul 2025 Bottom-up Domain-Specific Superintelligence arXiv 2507.13966
    Oct 2025 GraphMERT: Reliable Knowledge Graph Distillation arXiv 2510.09580
    Jan 2026 Knowledge Graphs are Implicit Reward Models arXiv 2601.15160
    Mar 2026 An Alternative Trajectory for Generative AI arXiv 2603.14147

    Structure over scale. Follow for more ML Frontier episodes exploring research at the edge.

    Part 3 of the Machine Learning Frontier series. View all parts | Next: Part 4 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1192 words6 min readAbstract

    Embedded (1/?): BMP280 Driver --- From Prototype to Patent Proof-of-Concept

    A colleague had a patent application that needed empirical data: pressure measurements from multiple sensors at two physical locations, with enough redundancy to establish statistical confidence. Off-the-shelf solutions weren’t flexible enough, so we built the whole stack in Rust—from the sensor driver up through data collection, analysis, and plotting.

    Resource Link
    Source GitHub
    Docs Multi-sensor guide
    References Datasheets, patent, hardware
    Comments Discord

    The Problem

    We needed high-resolution barometric pressure data from dozens of sensors split across two physical locations. Each location needed multiple sensors—not just for coverage, but because averaging across several readings reduces noise and gives more trustworthy measurements. A single BMP280 reading has enough jitter that you want redundancy.

    The existing BMP280 driver crate worked fine for one sensor at the default address. We needed it to support both I2C addresses so we could put two sensors on each bus—and then use multiplexers to scale that to many.


    Why Fork the Driver?

    The original bmp280-ehal crate is a platform-agnostic, no_std Rust driver built on embedded-hal traits. It runs on anything from bare-metal microcontrollers to Raspberry Pi. But it had a limitation: it assumed the default I2C address (0x76) and had no way to talk to a second sensor at the alternate address (0x77).

    I needed the driver to address both 0x76 and 0x77—two sensors per bus instead of one. That change alone doubled capacity, and multiplexers would scale it from there. So I forked the driver and made three changes:

    Commit 1: Refactor constructor (0b66fda)

    Simplified the API by removing the implicit address tracking. The old constructor tied the driver to a single address at creation time. The refactored version makes read_calibration() public and removes the stored address, laying the groundwork for multi-address support.

    Commit 2: Per-sensor calibration (d393bd6)

    The structural change that enables multi-sensor support. Instead of flat calibration fields on the driver struct, this commit introduces:

    • TempComp and PressureComp structs for temperature and pressure compensation data
    • Sensors container holding Option<TempComp> and Option<PressureComp> for both addresses (0x76 and 0x77)
    • Every reading method now takes an explicit address parameter

    Each BMP280 ships with unique factory calibration coefficients baked into its NVM. The driver reads 24 bytes of calibration data per sensor and stores it independently. This matters because the compensation polynomial uses 12 coefficients—get them wrong and your pressure reading is meaningless.

    Commit 3: Documentation and multi-sensor guide (21f16dc)

    Updated the README and added comprehensive documentation in docs/multiple-sensor-support.md covering:

    • Two sensors on one bus (differential pressure)
    • I2C multiplexer topologies (TCA9548A, PCA9546A, TCA9543A)
    • Cascading mux trees for large sensor arrays
    • I2C extenders for long-distance distribution

    The Hardware Setup

    Doubling capacity with dual addresses

    The BMP280 supports two I2C addresses (0x76 and 0x77), selected by the SDO pin. The old driver only tracked one. The new driver stores calibration for both, doubling capacity in every topology:

    Topology Old driver New driver
    Single I2C bus 1 2
    8-channel mux 8 16
    8 muxes × 8 channels 64 128

    I2C multiplexer

    Multiplexer arrays

    For the patent prototype, we used TCA9548A 8-channel I2C multiplexers. Each mux channel is an electrically isolated I2C bus, so sensors on different channels can share the same address:

                              ┌─── ch 0 ─── BMP280 (0x76) + BMP280 (0x77)
      Pi ── I2C ── TCA9548A ──┼─── ch 1 ─── BMP280 (0x76) + BMP280 (0x77)
                   (addr 0x70) ├─── ch 2 ─── BMP280 (0x76) + BMP280 (0x77)
                              └─── ...         (up to 16 sensors per mux)
    

    A Raspberry Pi at each location ran a Rust application on Linux that cycled through mux channels, read calibrated pressure/temperature data from every sensor, and logged the raw values. A separate PC-based Rust application pulled the log files for analysis, producing plots and spreadsheets.


    BMP581 sensor

    What We Learned: BMP280 → BMP581

    The BMP280 worked for the initial proof-of-concept, but we hit its limits. The sensor’s resolution wasn’t granular enough for the pressure differentials we needed to measure. The BMP581—Bosch’s newer generation—offers significantly better resolution and noise characteristics.

    We also changed the architecture. Instead of running I2C extenders to reach distant sensor locations from a single controller, we gave each location its own Raspberry Pi with its own mux and BMP581 sensor array. The Pi boards communicate over gigabit LAN, which is simpler, more reliable, and eliminates the signal integrity issues that come with long I2C runs.

    Location A:  Pi ── I2C ── TCA9548A ── BMP581s ──┐
                                                      ├── Gigabit LAN
    Location B:  Pi ── I2C ── TCA9548A ── BMP581s ──┘
                                                      │
                                            Analysis PC (Rust)
    

    Using the Driver

    Basic usage with a single sensor:

    use bmp280_ehal::BMP280;
    
    let mut bmp = BMP280::new(i2c)?;
    let temp = bmp.temp(0x76);     // Celsius
    let pres = bmp.pressure(0x76); // Pascals
    

    Two sensors on one bus:

    let mut bmp = BMP280::new(i2c)?;
    bmp.read_calibration(0x77);
    
    let delta_p = bmp.pressure(0x76) - bmp.pressure(0x77);
    

    With a multiplexer:

    fn select_channel<I2C, E>(i2c: &mut I2C, ch: u8) -> Result<(), E>
    where I2C: embedded_hal::blocking::i2c::Write<Error = E>
    {
        i2c.write(0x70, &[1 << ch])
    }
    

    See the multi-sensor documentation for shared-bus patterns and cascaded mux topologies.

    Fun Fact: Some of the Raspberry Pi boards with I2C multiplexers and BMP581 sensor arrays were submerged in diving bell-like enclosures with low power and LAN cables tethered.


    References

    Reference Link
    BMP280 datasheet Bosch Sensortec (PDF)
    BMP581 product page Bosch Sensortec
    MIKROE Pressure 21 Click mikroe.com
    I2C Multiplexer (TCA9548A) SparkFun I2C Mux
    US Patent 12,188,836 B1 Google Patents
    Multi-sensor docs GitHub

    Part 1 of the Embedded series. View all parts | Next: Part 2 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 584 words3 min readAbstract

    AI Tools (1/?): XSkill --- A Memory Layer for Multimodal Agents

    Most AI agents can use tools. Far fewer can remember how to use them better next time. XSkill addresses that gap with a structured memory layer that accumulates know-how from past runs—without retraining the model.

    Resource Link
    Paper arXiv 2603.12056
    Project XSkill Project Page
    Code GitHub (MIT)
    Video XSkill: Memory Layer
    Video
    Comments Discord

    The Problem: Isolated Episodes

    Multimodal agents solve complex visual and tool-heavy tasks, but each run starts from scratch. An agent might figure out a multi-step workflow for extracting color data from an image—only to lose that knowledge when the next task begins. Useful lessons evaporate between sessions.

    Two Kinds of Memory

    XSkill introduces a dual-memory architecture that separates strategic knowledge from tactical knowledge:

    Skills are structured Markdown documents containing workflows and tool templates for a class of tasks. A skill says: here is the overall approach for this kind of problem.

    Experiences are smaller tactical lessons with triggering conditions, recommended actions, and failure notes. An experience says: when this specific pattern appears, use this tactic instead of guessing.

    That split matters. Ablation analysis shows that removing either type hurts performance—skills alone aren’t enough, and experiences alone aren’t enough.

    Two-Phase Framework

    The framework operates in a loop:

    Phase 1 — Accumulation. After completing rollout batches, the agent reviews past trajectories through visually-grounded summarization, cross-rollout critique, and hierarchical consolidation. This produces skill documents and experience items stored in persistent banks.

    Phase 2 — Inference. Given a new task, the agent decomposes it, retrieves relevant skills and experiences via semantic search, adapts them to the current visual context, and injects them into the system prompt.

    The key claim: agents improve through memory accumulation and retrieval, not parameter updates. No fine-tuning required.

    Results

    Evaluated across five benchmarks spanning visual tool use (VisualToolBench, TIR-Bench), multimodal search (MMSearch-Plus, MMBrowseComp), and comprehensive agent tasks (AgentVista):

    Backbone Avg@4 Pass@4
    Gemini-3-Flash + XSkill 40.34 58.95
    Gemini-2.5-Pro + XSkill 28.63 46.38
    o4-mini + XSkill 23.72 39.07
    GPT-4o-mini + XSkill 23.19 38.90

    Average gains of 2.6 to 6.7 points over baselines (Agent Workflow Memory, Dynamic CheatSheet, Agent-KB). Performance consistently improves as rollout count increases from 1 to 4.

    Practical impact: syntax errors drop from 20.3% to 11.4% with skills, and tool name errors fall from 2.85% to 0.32%.

    Concrete Example

    A visual task requires identifying the color of a region behind specific text in an image. Without memory, the agent guesses. With XSkill, it retrieves a structured workflow: locate the text, isolate the region, sample pixels via code interpreter, and infer the color from actual data. Code interpreter usage increases from 66.6% to 77.0% on VisualToolBench—the agent learns to measure instead of guess.

    Why This Matters

    XSkill sits at the intersection of agents, tools, multimodal reasoning, and continual improvement. The practical takeaway isn’t just that memory helps—it’s that different kinds of memory help in different ways. Strategic workflows and situational tactics serve complementary roles.

    Not a bigger model. A smarter memory layer.


    References

    Reference Link
    XSkill paper arXiv 2603.12056
    Project page xskill-agent.github.io
    GitHub repo (MIT) XSkill-Agent/XSkill

    Part 1 of the AI Tools series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 823 words5 min readAbstract

    ML Frontier #02: In-Context Reinforcement Learning

    Second ML Frontier episode. This one covers In-Context Reinforcement Learning—how transformers learn decision policies from trajectory examples in the prompt, without weight updates.

    Resource Link
    Papers 5 papers covered
    Video ML Frontier 2: ICRL
    Video
    Related Saw (2/?): agentrail-rs — practical ICRL application
    Comments Discord

    What is In-Context Reinforcement Learning?

    The model observes sequences of states, actions, and rewards stored in the prompt. Instead of updating weights through gradient descent, the transformer approximates a policy from those trajectory examples.

    Think of it like learning to cook by reading a journal of recipes that worked—and ones that didn’t—with notes on what went wrong.

    Why Does This Matter?

    AI agents often lose procedural knowledge when context is truncated between sessions. They know the goal but forget which API to call, which flags to use, which client library to reference, or how to validate output.

    The traditional approach—writing instructions in markdown files—isn’t reliable. Agents ignore rules even when they’re present. ICRL offers a different path: instead of telling the agent what to do, show it what worked and what didn’t, with reward signals attached.

    By embedding successful execution traces in the prompt, agents can reuse proven approaches instead of improvising from scratch.

    Research Evidence

    Decision Transformer (Chen et al., 2021)

    Paper: arXiv 2106.01345

    In brief: The paper that started it all. Instead of training an RL agent with value functions and policy gradients, just frame the problem as sequence prediction. Feed the transformer trajectories of (return-to-go, state, action) and let it predict the next action conditioned on the desired return. The transformer learns a policy by modeling sequences—no Bellman equations needed.

    Why it matters: Reframed RL as something transformers already do well: sequence modeling.

    Transformers Learn TD Methods (Wang et al., ICLR 2025)

    Paper: arXiv 2405.13861

    In brief: This paper shows that transformers don’t just pattern-match on trajectories—they actually approximate temporal-difference (TD) learning algorithms during the forward pass. The model internally implements something resembling TD updates, without being explicitly trained to do so.

    Why it matters: Transformers aren’t just memorizing trajectories. They’re learning the underlying RL algorithm in-context.

    OmniRL (2025)

    Paper: arXiv 2502.02869

    In brief: OmniRL proposes a transformer architecture that emulates actor-critic RL in-context, improving decision quality across multiple tasks. Rather than specializing in one environment, the model generalizes its in-context RL capabilities across diverse settings.

    Why it matters: ICRL isn’t limited to one task—it scales across environments.

    Reflexion (Shinn et al., NeurIPS 2023)

    Paper: arXiv 2303.11366

    In brief: Reflexion takes a different angle: instead of embedding raw trajectories, the agent generates verbal reflections on its failures and successes. These self-critiques are stored and injected into future prompts. The agent learns from its own written analysis of what went wrong.

    Why it matters: Shows that trajectory-based learning doesn’t require structured (state, action, reward) tuples—natural language reflections work too.

    Voyager (Wang et al., 2023)

    Paper: arXiv 2305.16291

    In brief: An open-ended Minecraft agent that builds a skill library from successful code executions. When Voyager solves a task, it stores the working code as a reusable skill. Future tasks can retrieve and compose these skills. The agent explores, learns, and accumulates capabilities without any weight updates.

    Why it matters: Demonstrates ICRL at scale—an agent that gets better over time by accumulating proven solutions.

    Papers

    Date Paper Link
    Jun 2021 Decision Transformer: RL via Sequence Modeling arXiv 2106.01345
    Mar 2023 Reflexion: Language Agents with Verbal Reinforcement Learning arXiv 2303.11366
    May 2023 Voyager: Open-Ended Embodied Agent with LLMs arXiv 2305.16291
    May 2024 Transformers Learn TD Methods for In-Context RL arXiv 2405.13861
    Feb 2025 OmniRL: In-Context RL Across Multiple Tasks arXiv 2502.02869

    Practical Application: agentrail-rs

    This isn’t just theory. I’m building agentrail-rs to apply ICRL to AI coding agents used for non-coding tasks—TTS generation, video compositing, file manipulation. The tool records trajectories (state, action, result, reward) and injects successful examples into future agent prompts. Early days, but the research says this should work.

    See Saw (2/?): agentrail-rs for more on the engineering side.

    Key Takeaways

    Concept One-liner
    ICRL Learn RL policies from trajectory examples in the prompt
    No Weight Updates The transformer adapts during inference, not training
    TD in the Forward Pass Transformers approximate RL algorithms internally
    Verbal Reflection Natural language self-critique works as trajectory data
    Skill Libraries Accumulate proven solutions for reuse across sessions

    In-Context RL turns agents from amnesiacs into learners. Follow for more ML Frontier episodes exploring research at the edge.

    Part 2 of the Machine Learning Frontier series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1876 words10 min readAbstract

    Saw (2/?): reg-rs, avoid-compaction, and agentrail-rs

    Welcome back to the Sharpen the Saw series, where I maintain existing tools, vibe-code new ones, and try new approaches to development workflows. Three tools, one pattern: each one hit a ceiling that required rethinking how it stores and shares information. This week covers reg-rs migrating from binary to text-based test definitions, avoid-compaction structuring multi-session AI agent workflows, and agentrail-rs adding reinforcement learning from an agent’s own success history.

    reg-rs: Three Improvements to Regression Testing

    reg-rs captures command output as golden baselines and flags regressions on re-run. This round of sharpening addressed three friction points: clunky commands, opaque binary storage, and noisy output.

    Shell Aliases

    The full command syntax (reg-rs run -p my_test -v) gets old fast. Shell aliases in source-rg.sh cut it to 4 characters with tab completion in zsh and bash:

    Alias Action Example
    rnrg Run tests rnrg my_test -v
    adrg Create test adrg my_test 'echo hi'
    lsrg List tests lsrg
    shrg Show details shrg my_test -vv
    uprg Rebase baselines uprg my_test
    rsrg Reset results rsrg my_test
    rmrg Remove test rmrg old_test
    mgrg Migrate .tdb to .rgt mgrg
    strg Status dashboard strg

    Muscle memory builds fast. hlrg prints the full cheat sheet with examples.

    Git-Friendly .rgt Format

    The legacy .tdb format stored tests in SQLite binary files. git diff showed noise, merge conflicts were unresolvable, and new developers needed setup scripts. Regression tests are documentation—they define what your CLI actually does—so hiding them in binary blobs defeated the purpose.

    The new .rgt format splits each test across git-tracked text files:

    File Contents Tracked?
    test.rgt TOML spec (command, timeout, preprocessing) Yes
    test.out Expected stdout baseline Yes
    test.err Expected stderr (only if non-empty) Yes
    test.tdb Runtime cache No (gitignored)

    A test definition reads like documentation:

    command = "myapp --version"
    timeout = 10
    exit_code = 0
    desc = "Version string format check"
    preprocess = "jq --sort-keys"
    diff_mode = "json"
    

    reg-rs create now writes .rgt directly—no intermediate .tdb step. Existing tests migrate with mgrg. PR reviewers see exactly what changed and why, git clone inherits every test, and merge conflicts resolve with standard tools.

    Output Verbosity Controls

    Previously, running tests dumped SQL debug info and full diffs regardless of context. Now output scales to what you need:

    Flag Output
    (none) Summary line: 3 passed, 1 failed (of 4 total)
    -v + failure details (diff counts per test)
    -vv + full diff output
    -q Nothing—exit code only

    Exit codes are now meaningful too: 0 for all pass, 1 for regressions detected, 2 for errors. This makes reg-rs usable in CI pipelines where you check $? rather than parse output.

    Sharpen the Saw — Habit 7 from Stephen Covey’s The 7 Habits of Highly Effective People is about preserving and enhancing your greatest asset: yourself and your tools. In software, that means taking time to fix accumulated friction, update dependencies, and learn new frameworks—even when shipping features feels more urgent. The payoff compounds: every hour spent sharpening saves many more down the line.

    avoid-compaction: Structured Multi-Session Agent Workflows

    avoid-compaction solves a problem anyone using AI coding agents hits eventually: context death. Long conversations get automatically compacted—the system summarizes older messages to make room for new ones, losing decisions, constraints, and procedural knowledge along the way.

    The Insight

    Rather than fighting compaction with longer context windows, avoid-compaction embraces frequent restarts as a feature. Each restart gives the agent a full, fresh context window. The trick is making handoffs explicit and structured so nothing is lost between sessions.

    The Saga/Step Model

    Work is organized into sagas (projects) composed of steps (focused units of work):

    .avoid-compaction/
      saga.toml                    # name, status, current step
      plan.md                      # evolving project plan
      planned-steps.md             # upcoming steps preview
      steps/001-add-routes/
        step.toml                  # status, description, context files
        prompt.md                  # what the agent was told to do
        summary.md                 # what the agent actually did
      steps/002-add-tests/
        ...
    

    Each session follows the same loop:

    1. New Claude session starts with fresh context
    2. Agent runs next to see the current step’s prompt and context
    3. Agent does the work
    4. Agent runs complete with a summary and next-step definition
    5. User restarts Claude
    6. Next session picks up exactly where the last left off

    Why This Matters

    The difference is reliability. Without structured handoffs, session 4 of a complex feature often forgets constraints from session 1. The agent improvises, makes contradictory decisions, or redoes work. With avoid-compaction:

    • Every session starts with full context for its specific task
    • Summaries accumulate so later sessions can reference earlier decisions
    • The plan evolves as work reveals new insights
    • Nothing is lost to compaction—it’s all in the filesystem

    Current Improvements

    The tool is going through a refactoring sequence to meet code quality standards:

    1. Merging small command modules into larger, cohesive modules
    2. Extracting shared display logic into reusable formatters
    3. Workspace conversion to split concerns across crates
    4. Session crate extraction for reusable JSONL handling

    Each spike is low-to-medium risk, guided by the principle that smaller modules with clear responsibilities are easier to test, review, and extend.

    agentrail-rs: Learning from Success

    agentrail-rs is the evolution of avoid-compaction, adding a critical capability: In-Context Reinforcement Learning (ICRL). Where avoid-compaction structures handoffs, agentrail-rs teaches agents from their own history.

    The 75% Problem

    I use AI coding agents for more than coding—TTS audio generation, video compositing, file manipulation, and other multi-step production tasks. In practice, agents succeed about 75% of the time on these workflows. The failures aren’t random—they’re procedural: the agent forgets which API to call, which flags to use, which client library to reference, or how to validate output.

    The traditional approach—writing instructions in markdown files like AGENTS.md or CLAUDE.md—isn’t reliable. Even when rules, instructions, and prohibitions are present, agents often ignore them. Claude, when called out, will say “You’re right, I should have done that”—and a few moments later make the same kind of mistake. Bigger prompts and more examples hit diminishing returns. The agent needs to learn from reward-based examples—both good and bad—delivered in-context, not static documentation it may or may not follow. That’s the core idea behind ICRL: show the agent what worked, what didn’t, and let the rewards guide its next attempt.

    How ICRL Works

    After each step, agentrail-rs records a trajectory:

    state → action → result → reward
    

    Successful trajectories are stored at .agentrail/trajectories/{task_type}/run_NNN.json. When the agent hits the same task type in a new session, the CLI retrieves the top N successful trajectories and injects them into the prompt: “Here’s how you succeeded at this before.”

    The agent reads its own success patterns and self-corrects—no weight updates, no fine-tuning, no training pipeline. Just examples from its own history, delivered in context.

    Four Step Types

    agentrail-rs distinguishes what needs an agent from what doesn’t:

    Step Type Who Executes Example
    Meta Agent Prepare handoff packets with success examples
    Production Agent Execute semantic work using prepared context
    Deterministic Machine Run TTS generation, video composition (no agent needed)
    Validation Machine Check outputs, record rewards for ICRL

    Deterministic steps can’t fail due to agent forgetfulness—they’re hard-specified. Validation steps create the reward signals that make ICRL work.

    Architecture

    The project is structured as a five-crate Cargo workspace:

    Crate Purpose Status
    agentrail-core Domain model, trajectories, handoff packets Complete
    agentrail-store Persistence (saga, step, trajectory, session) Complete
    agentrail-cli CLI commands Stub
    agentrail-exec Deterministic job executors Stub
    agentrail-validate Output validators Stub

    The core and store crates are done. The next phase wires up the CLI, then deterministic execution, then the full ICRL retrieval and injection loop.

    The Expected Payoff

    Once the trajectory system is live (I just started vibe-coding it today), agents working on repetitive task types should see reliability climb from ~75% toward deterministic. Each success makes the next attempt more likely to succeed, without any manual intervention. The goal is a self-improving loop: agents learn their own procedures through experience.

    Three Problems, Three Approaches

    These projects aren’t related by a common architecture or shared abstraction. They’re related because each one solves a different productivity problem I keep hitting:

    • reg-rs catches regressions that slip in whenever a feature is added or a fix applied—the kind of silent breakage that unit tests don’t cover because they test behavior, not actual output.
    • avoid-compaction is a direct reaction to Claude Code auto-compacting multiple times per day, with noticeable performance degradation after each compaction. Structured restarts with explicit handoffs beat a slowly decaying context window.
    • agentrail-rs tackles the opposite problem: not forgetting, but improvising. LLMs are probabilistic, and Claude keeps trying new (failing) approaches to routine tasks instead of sticking with the proven-working ones it has used and documented before. ICRL feeds successful trajectories back into context so the agent repeats what works.

    Different problems, different solutions, same goal: spend less time fighting tools and more time building.

    References

    Resource Link
    reg-rs github.com/sw-cli-tools/reg-rs
    avoid-compaction github.com/softwarewrighter/avoid-compaction
    agentrail-rs github.com/sw-vibe-coding/agentrail-rs
    Decision Transformer Chen et al., 2021 — framing RL as sequence modeling
    Transformers Learn TD Methods Wang et al., ICLR 2025 — transformers simulate temporal-difference learning in-context
    OmniRL 2025 — transformer architecture emulating actor-critic RL in-context
    Reflexion Shinn et al., NeurIPS 2023 — verbal self-reflection for agent improvement
    Voyager Wang et al., 2023 — open-ended learning agent with skill library
    “Sharpen the Saw” The 7 Habits of Highly Effective People (Stephen Covey)

    Habit 7: Sharpen the Saw. Spend less time fighting tools, more time building.

    Part 2 of the Sharpen the Saw Sundays series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 2797 words14 min readAbstract

    Rabbit-hole (1/?): Poor Man's Rust-to-Unsupported-ISA Translator

    You want to write Rust for a CPU that rustc doesn’t know about. There’s no LLVM backend, no target triple, no supported tier. You could write a compiler backend from scratch—or you could cheat.

    This post traces the rabbit hole from Rust source code down to registers on the COR24, a 24-bit RISC CPU that exists on real FPGA hardware. The trick: target a CPU that rustc does support (MSP430, a 16-bit TI microcontroller), then translate the assembly output.

    Resource Link
    Live Demo COR24 Assembly Emulator
    Source GitHub
    MakerLisp makerlisp.com (COR24 creators)
    Disclaimer Proof of concept only
    Comments Discord

    The Problem

    COR24 is a 24-bit RISC soft CPU designed by MakerLisp for FPGAs. It has 3 general-purpose registers, a 24-bit address space, and runs at 101 MHz on inexpensive Lattice FPGAs. It’s MIT-licensed, well-documented, and has real hardware you can buy.

    But LLVM doesn’t have a COR24 backend. Neither does GCC. Writing a full compiler backend could take a lot more effort. We need another way in.


    The Trick: Borrow a Target

    rustc supports MSP430, a 16-bit TI microcontroller, via LLVM’s MSP430 backend. It’s a nightly-only target (msp430-none-elf), but it works. The key insight: MSP430’s instruction set is close enough to COR24 that a translator can bridge the gap.

    The full pipeline:

    Rust Source (.rs)
         ↓  rustc --target msp430-none-elf --emit asm
    MSP430 Assembly (.msp430.s)
         ↓  msp430-to-cor24 translator
    COR24 Assembly (.cor24.s)
         ↓  COR24 assembler
    Machine Code → COR24 Emulator (or real FPGA hardware)
    

    No custom compiler. No modified LLVM. Just Rust’s nightly toolchain and a ~1,800-line translator written in Rust. This is a proof-of-concept—an educational demo, not a production tool for real COR24 hardware.

    Disclaimer

    This is a proof-of-concept and educational demo. The MSP430-to-COR24 translator is not intended for production use on real COR24 hardware. It demonstrates that the approach is feasible, not that it’s complete or reliable. If you’re building something real on COR24, use the native assembler and toolchain from MakerLisp.


    Level 1: The Compiler Optimizes Away Your Code

    Let’s start simple. Three numbers, one add:

    #![no_std]
    
    const RESULT_ADDR: u16 = 0x0100;
    
    #[inline(never)]
    #[no_mangle]
    pub fn demo_add() -> u16 {
        let a: u16 = 100;
        let b: u16 = 200;
        let c: u16 = 42;
        a + b + c  // Returns 342
    }
    
    #[no_mangle]
    pub unsafe fn start() -> ! {
        let result = demo_add();
        core::ptr::write_volatile(RESULT_ADDR as *mut u16, result);
        loop {}
    }
    

    Compile to MSP430 and the rabbit hole opens immediately. Here’s what rustc emits for demo_add:

    demo_add:
        mov     #342, r12
        ret
    

    Two instructions. LLVM constant-folded 100 + 200 + 42 into 342 at compile time. The addition doesn’t exist in the output—the compiler proved the answer is always the same and replaced the computation with a constant load.

    The translator converts this to COR24:

    demo_add:
        la      r0, 0x000156    ; load 342 (24-bit)
        jmp     (r1)            ; return via r1
    

    Run it in the emulator (scripts/demo-add.sh):

    Executed 3 instructions
    CPU halted (self-branch detected)
    
    === Registers ===
      r0:  0x000156  (     342)
    

    Two instructions in demo_add, three total to reach halt. The “add” demo that doesn’t add.


    Level 1.5: More Variables Than Registers

    What happens when Rust needs more live variables than COR24 has registers? MSP430 has 12 general-purpose registers. COR24 has 3. The translator has to spill the extras to the stack.

    The accumulate function keeps 5 values alive simultaneously:

    #[inline(never)]
    #[no_mangle]
    pub unsafe fn accumulate(seed: u16) -> u16 {
        let a = seed + 1;
        let b = a + seed;
        let c = b + a;
        let d = c + b;
        let e = d + c;
        let result = a ^ b ^ c ^ d ^ e;
        mem_write(RESULT_ADDR, result as u8);
        uart_putc(a);
        uart_putc(b);
        uart_putc(c);
        uart_putc(d);
        uart_putc(e);
        loop {}
    }
    

    The MSP430 assembly uses registers r6 through r10—five registers that don’t exist on COR24:

    accumulate:
        push    r6
        push    r7
        push    r8
        push    r9
        push    r10             ; save 5 callee-saved registers
        mov     r12, r10        ; seed
        mov     r10, r6
        inc     r6              ; a = seed + 1
        add     r6, r10         ; b = a + seed
        mov     r10, r9
        add     r6, r9          ; c = b + a
        ...
    

    The translator maps these to frame-pointer-relative stack slots, each 3 bytes (one COR24 word). Where MSP430 writes mov r10, r6, COR24 must load from one spill slot, operate, and store to another:

    accumulate:
        sw      r0, 30(fp)     ; spill seed (r10 → offset 30)
        lw      r0, 6(fp)      ; save spill slot for r6
        push    r0
        ...
        lw      r0, 18(fp)     ; load r10 (seed)
        sw      r0, 6(fp)      ; copy to r6 slot
        lw      r0, 6(fp)      ; load r6
        add     r0, 1          ; a = seed + 1
        sw      r0, 6(fp)      ; store r6 back
        ...
        la      r0, 0xFF0000   ; RESULT_ADDR
        ; call mmio_write
        push    r1
        la      r2, mmio_write
        jal     r1, (r2)       ; jal saves return addr in r1
        pop     r1
    

    It’s verbose—the COR24 output is much longer than the MSP430 input. But it’s correct. The emulator confirms the computation with 148 instructions and the XOR result stored to memory at 0x0100.


    Level 2: The Compiler Writes Your Destructor

    Rust’s Drop trait guarantees cleanup when a value goes out of scope. Does that work on a CPU with no OS, no allocator, no runtime?

    pub struct Guard { addr: u16 }
    
    impl Guard {
        #[inline(never)]
        #[no_mangle]
        pub fn guard_new(addr: u16) -> Guard {
            unsafe { mem_write(addr, 1); }  // mark: alive
            Guard { addr }
        }
    }
    
    impl Drop for Guard {
        #[inline(never)]
        fn drop(&mut self) {
            unsafe { mem_write(self.addr, 0); }  // mark: gone
        }
    }
    
    #[no_mangle]
    pub unsafe fn start() -> ! {
        {
            let _g = Guard::guard_new(STATUS_ADDR);
            // STATUS_ADDR = 1 (guard is alive)
        }
        // STATUS_ADDR = 0 (compiler called drop here)
    
        mem_write(STATUS_ADDR, 0xFF);  // proof we continued
        loop {}
    }
    

    Look at the MSP430 assembly for start—the compiler inserted the drop call:

    start:
        sub     #2, r1              ; allocate stack space
        mov     #256, r12           ; STATUS_ADDR
        call    #guard_new          ; create guard → writes 1
        mov     #256, 0(r1)         ; store Guard on stack
        mov     r1, r12             ; pass &Guard to drop
        call    #<Guard::drop>      ; compiler-inserted! → writes 0
        mov     #256, r12
        mov     #255, r13
        call    #mem_write          ; writes 0xFF
    .LBB4_1:
        jmp     .LBB4_1             ; halt
    

    The call #<Guard::drop> on line 6 is the compiler honoring the Drop contract. You didn’t write that call—rustc did. Memory at STATUS_ADDR goes: 0 → 1 → 0 → 0xFF, proving the destructor ran at the right moment.

    The translated COR24 assembly preserves this structure—each call becomes a jal (jump-and-link), which saves the return address in r1:

    start:
        sub     sp, 3               ; allocate stack space
        la      r0, 0x000100        ; STATUS_ADDR
        ; call guard_new
        push    r1
        la      r2, guard_new
        jal     r1, (r2)            ; create guard → writes 1
        pop     r1
        ...
        ; call <Guard::drop>        ; compiler-inserted!
        push    r1
        la      r2, <Guard::drop>
        jal     r1, (r2)            ; → writes 0
        pop     r1
    

    RAII works on bare metal, on an architecture the Rust compiler has never heard of.


    Level 3: Interrupts via asm! Passthrough

    Here’s where it gets interesting. COR24’s interrupt mechanism uses hardware registers that MSP430 doesn’t have:

    • iv: Interrupt vector—CPU jumps here when an interrupt fires
    • ir: Interrupt return—saved PC to return to after the ISR
    • jmp (ir): Return from interrupt

    There’s no MSP430 equivalent—the LLVM backend has no concept of these registers. But Rust’s asm! macro combined with the translator’s passthrough mechanism can handle it.

    The demo_echo_v2 example splits the problem: application logic in pure Rust, interrupt plumbing in asm! passthrough:

    #![feature(asm_experimental_arch)]
    
    /// Application logic --- pure Rust, compiled normally
    #[inline(never)]
    #[no_mangle]
    pub fn to_upper(ch: u16) -> u16 {
        if ch >= 0x61 && ch <= 0x7A {
            ch & 0xDF  // clear bit 5
        } else {
            ch
        }
    }
    
    #[inline(never)]
    #[no_mangle]
    pub unsafe fn handle_rx() {
        let ch = mmio_read(UART_DATA);
        if ch == 0x21 {               // '!'
            mmio_write(HALT_FLAG, 1);
        } else {
            uart_putc(to_upper(ch));
        }
    }
    

    The ISR wrapper uses asm! with a @cor24: prefix that the translator passes through verbatim:

    #[no_mangle]
    pub unsafe fn isr_handler() {
        // Save COR24 state (asm! --- no Rust equivalent)
        core::arch::asm!(
            "; @cor24: push r0",
            "; @cor24: push r1",
            "; @cor24: push r2",
            "; @cor24: mov r2, c",      // save condition flag
            "; @cor24: push r2",
        );
    
        handle_rx();  // ← pure Rust, compiled through the pipeline
    
        // Restore state and return from interrupt
        core::arch::asm!(
            "; @cor24: pop r2",
            "; @cor24: clu z, r2",      // restore condition flag
            "; @cor24: pop r2",
            "; @cor24: pop r1",
            "; @cor24: pop r0",
            "; @cor24: jmp (ir)",       // return from interrupt
            options(noreturn)
        );
    }
    

    The "; @cor24: ..." lines look like MSP430 comments (so rustc ignores them), but the translator recognizes the prefix and emits them as real COR24 instructions. In the final COR24 assembly:

    isr_handler:
        push r0
        push r1
        push r2
        mov r2, c                     ; save condition flag
        push r2
        ; call handle_rx              ← compiled Rust
        push    r1
        la      r2, handle_rx
        jal     r1, (r2)              ; jal saves return addr in r1
        pop     r1
        pop r2
        clu z, r2                     ; restore condition flag
        pop r2
        pop r1
        pop r0
        jmp (ir)                      ← hardware interrupt return
    

    The start function sets up the interrupt vector and enables UART reception:

    core::arch::asm!(
        "; @cor24: la r0, isr_handler",
        "; @cor24: mov iv, r0",         // iv = interrupt vector register
        "; @cor24: lc r0, 1",
        "; @cor24: la r1, 0xFF0010",    // UART interrupt enable register
        "; @cor24: sb r0, 0(r1)",       // enable UART RX interrupt
    );
    

    Type a letter, the hardware fires an interrupt, the ISR saves registers, calls handle_rx (compiled Rust), converts to uppercase, echoes it via UART, restores registers, and returns. The boundary between Rust and hardware is exactly where you’d expect it.


    What the Translator Actually Does

    The msp430-to-cor24 translator (~1,800 lines of Rust) handles the mechanical differences:

    Concern MSP430 COR24 Translation
    Word size 16-bit 24-bit Stack slots: 2 bytes → 3 bytes
    Registers r12-r14 (args) r0-r2 (args) Direct mapping
    Spilled regs r4-r11 (MSP430) None (3 GPRs only) Frame-pointer relative loads/stores
    I/O addresses 16-bit (0xFF00) 24-bit (0xFF0000) Address remapping
    Call convention call #func / ret jal r1, (r2) / jmp (r1) Uses COR24’s jump-and-link
    Tail calls call + ret pattern jmp (r2) Direct jump, no link
    Entry point Section order Reset vector at addr 0 la r0, start + jmp (r0)

    COR24’s calling convention centers on the jal (jump-and-link) instruction. jal r1, (r2) jumps to the address in r2 and saves the return address in r1. The callee returns with jmp (r1). Since the translator re-uses r1 for the return address, it saves and restores r1 around each call with push r1 / pop r1. COR24’s native C compiler convention uses a standard prologue (push fp; push r2; push r1; mov fp,sp) and passes arguments on the stack—the translator doesn’t follow that full protocol, but it does use the same jal/jmp (r1) mechanism for call and return.

    The entry point handling is worth noting. rustc emits functions in alphabetical section order, so the panic handler often lands at address 0. Every demo uses a #[no_mangle] pub unsafe fn start() -> ! as its entry point—a convention I chose for this project. The translator looks for a start label and emits a reset vector prologue (la r0, start + jmp (r0) at address 0), mimicking how real microcontrollers boot. This isn’t a Rust or MSP430 convention; it’s a project-level rule that keeps the translator simple and every demo consistent.


    Try It Yourself

    # Prerequisites
    rustup toolchain install nightly
    rustup target add msp430-none-elf --toolchain nightly
    
    # Clone and build
    git clone https://github.com/sw-embed/cor24-rs.git
    cd cor24-rs/rust-to-cor24
    
    # Run the add demo (full pipeline)
    bash scripts/demo-add.sh
    
    # Run the UART hello demo
    bash scripts/demo-uart-hello.sh
    
    # Or compile any demo project
    cargo run --bin msp430-to-cor24 -- --compile demos/demo_drop
    

    Each demo script traces the full pipeline: Rust source → MSP430 assembly → COR24 assembly → emulator output with register dumps.


    Key Takeaways

    1. You don’t need a compiler backend to target a new architecture. If a similar-enough target exists, a translator can bridge the gap.

    2. The Rust compiler is surprisingly good at bare-metal code. Constant folding, dead code elimination, and Drop all work correctly even when the output gets re-targeted to an architecture LLVM has never seen.

    3. asm! passthrough is the escape hatch. Hardware-specific operations (interrupt setup, condition flag save/restore) bypass the translation layer entirely using comment-prefix conventions.

    4. The pipeline is auditable. Every intermediate artifact (.msp430.s, .cor24.s, register dumps) is human-readable. You can trace any behavior from source to silicon.


    The rabbit hole goes deeper. Next time: what happens when the 16-bit intermediate can’t express a 24-bit value.

    Part 1 of the Down the Rabbit-Hole series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 736 words4 min readAbstract

    ML Frontier #01: Neural Collapse

    First Machine Learning Mondays post. ML Frontier explores cutting-edge research in machine learning.

    Resource Link
    Papers 5 papers covered
    Video ML Frontier #01: Neural Collapse
    Video
    Comments Discord

    What is Neural Collapse?

    During the final phase of training, deep network representations converge to a specific geometric pattern. Class means become equidistant and form a symmetric simplex structure in the feature space.

    Think of it like points arranging themselves at equal distances on a sphere.

    Why Does This Happen?

    When networks are trained past zero training error, representations continue simplifying. The network finds the most symmetric way to separate classes, forming equal angles between all class centers.

    This isn’t random—it’s mathematically optimal.

    2024-2025 Breakthroughs

    Recent research proves neural collapse is globally optimal in deep transformers and ResNets with regularization. As depth increases, the network provably converges to this collapsed geometry.

    This changes how we think about deep learning. Collapse explains why overparameterized networks generalize well. It also guides continual learning, where progressive collapse prevents catastrophic forgetting.

    Papers

    Date Paper Link
    Sep 2024 Beyond Unconstrained Features: Neural Collapse for Shallow Networks arXiv 2409.01832
    Oct 2024 Wide Neural Networks with Weight Decay Provably Exhibit Neural Collapse arXiv 2410.04887
    Jan 2025 Neural Collapse Beyond the Unconstrained Features Model arXiv 2501.19104
    May 2025 Rethinking Continual Learning with Progressive Neural Collapse arXiv 2505.24254
    May 2025 Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers arXiv 2505.15239

    Paper Summaries

    Beyond Unconstrained Features (Sep 2024)

    Paper: Hong & Ling

    In brief: Most neural collapse theory assumes you can put class markers anywhere you want—like sticking Post-it notes anywhere on a wall. But real shallow networks have limits. This paper shows neural collapse still emerges even in small networks with real data constraints, not just idealized deep networks.

    Why it matters: Neural collapse isn’t just a “big model” phenomenon—it happens in smaller, practical architectures too.

    Wide Networks with Weight Decay (Oct 2024)

    Paper: Jacot, Súkeník, Wang & Mondelli

    In brief: If you train a wide neural network with weight decay (a common regularization trick), this paper proves the network will exhibit neural collapse. It’s the first proof that end-to-end training (not just theory) leads to collapse.

    Why it matters: Weight decay isn’t just preventing overfitting—it’s actively pushing the network toward optimal geometry.

    Beyond the Unconstrained Features Model (Jan 2025)

    Paper: arXiv 2501.19104

    In brief: The “unconstrained features model” assumes networks can place representations anywhere. Real networks have architectural constraints. This paper extends neural collapse theory to realistic settings where the architecture limits what’s possible.

    Why it matters: The theory holds up in real-world conditions, not just toy examples.

    Progressive Neural Collapse for Continual Learning (May 2025)

    Paper: arXiv 2505.24254

    In brief: When you teach a network new things, it often forgets old things (catastrophic forgetting). This paper uses neural collapse to fix that: by carefully managing how class representations “collapse” over time, the network can keep learning new tasks without losing old knowledge.

    Why it matters: Neural collapse isn’t just a curiosity—it’s a tool for building better learning systems.

    Globally Optimal in Transformers and ResNets (May 2025)

    Paper: Súkeník et al.

    In brief: Imagine you have a box of crayons and need to organize them so each color is as far from every other color as possible. This paper proves that deep neural networks automatically find the best possible arrangement—not just a good one, but the mathematically perfect one. And this happens in both transformers (like GPT) and ResNets (like image classifiers).

    Why it matters: It’s not a coincidence that networks learn this way. It’s provably optimal.

    Key Takeaways

    Concept One-liner
    Neural Collapse Class representations converge to symmetric simplex geometry
    Why It Happens Training past zero error simplifies representations maximally
    Optimal Geometry Provably the best configuration in deep networks
    Practical Impact Explains generalization and enables continual learning

    Neural collapse reveals the hidden geometry of learning. Follow for more ML Frontier episodes exploring cutting-edge research.

    Part 1 of the Machine Learning Frontier series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 2775 words14 min readAbstract

    Saw (1/?): pjmai-rs, Rig, and langchain-rust

    Three tools, one theme: sharpening the foundation. This week covers pjmai-rs bug fixes and new features, plus a first look at two Rust frameworks for building LLM-powered applications—Rig and langchain-rust.

    Resource Link
    Repos sw-cli-tools/pjmai-rs, try-rig, try-langchain-rust
    Video Explainer
    References Links and Resources
    Comments Discord

    pjmai-rs: Fixing the Foundation

    Before adding AI features to any tool, you need a solid foundation. Since the last update, pjmai-rs received critical fixes and practical new features.

    The Rust 2024 Edition Bug

    Upgrading to Rust edition 2024 broke project removal silently. The IndexMap::remove() method changed semantics—it now preserves insertion order differently. The fix:

    // Rust 2024 broke this
    - projects.remove(name)
    
    // shift_remove maintains expected behavior
    + projects.shift_remove(name)
    

    A one-line fix, but the kind that silently corrupts data if you miss it. The 2024 edition migration guide mentions this change, but it’s easy to overlook in a large codebase.

    Shell Integration Improvements

    • Help flags: All aliases (chpj, hypj, stpj, etc.) now properly pass --help through
    • After-help messages: Every subcommand shows examples and related commands
    • Version matching: --version output now matches Cargo.toml
    • Argument validation: Better error messages for invalid flag combinations

    New Capabilities

    Feature Command What It Does
    Stack navigation stpj Push/pop project context with visibility
    History tracking hypj Revisit recently-visited projects by number
    Fuzzy completion chpj <TAB> Prefix > segment > substring, sorted by recency
    Environment config evpj Per-project env vars with auto-detection (Python, Node, Rust)
    Bulk operations rmpj --all Batch management with confirmation
    Subdirectory nav chpj proj src/ Tab-complete into subdirs

    These features are covered in detail in the Personal Software post.

    Sharpen the Saw — Habit 7 from Stephen Covey’s The 7 Habits of Highly Effective People is about preserving and enhancing your greatest asset: yourself and your tools. In software, that means taking time to fix accumulated friction, update dependencies, and learn new frameworks—even when shipping features feels more urgent. The payoff compounds: every hour spent sharpening saves many more down the line.

    Rig: Type-Safe AI Agents in Rust

    Rig (rig-core 0.32) is a Rust library for building LLM applications with a unified API across providers. I built try-rig to explore it hands-on with Ollama running locally—no cloud API keys needed.

    A Simple Agent

    The builder pattern makes agent construction readable:

    use rig::providers::ollama;
    use rig::client::Nothing;
    use rig::completion::Prompt;
    
    let client = ollama::Client::new(Nothing)?;
    
    let agent = client
        .agent("llama3.2")
        .preamble("You are a helpful assistant. Be concise.")
        .build();
    
    let response = agent.prompt("What is Rust?").await?;
    

    Swap ollama::Client for openai::Client or anthropic::Client and the rest stays the same.

    Tool-Equipped Agents

    Where Rig gets interesting is tools. Define a tool by implementing the Tool trait with typed args:

    #[derive(Deserialize, Serialize)]
    pub struct Calculator;
    
    impl Tool for Calculator {
        const NAME: &'static str = "calculator";
        type Error = CalcError;
        type Args = CalcArgs;
        type Output = f64;
    
        async fn definition(&self, _prompt: String) -> ToolDefinition {
            ToolDefinition {
                name: "calculator".to_string(),
                description: "Perform arithmetic: add, subtract, multiply, divide".to_string(),
                parameters: json!({ /* JSON Schema */ }),
            }
        }
    
        async fn call(&self, args: Self::Args) -> Result<Self::Output, Self::Error> {
            match args.operation.as_str() {
                "add" => Ok(args.x + args.y),
                "multiply" => Ok(args.x * args.y),
                // ...
            }
        }
    }
    

    Then chain tools onto the agent builder:

    let agent = client
        .agent(model)
        .preamble("Use tools for math, weather, files, date/time, and text.")
        .tool(Calculator)
        .tool(WeatherLookup)
        .tool(FileSearch)
        .tool(DateTime)
        .tool(StringTool)
        .build();
    

    The compiler verifies all tool types at build time. No runtime surprises from mismatched schemas.

    RAG with Embeddings

    Rig has built-in vector store support. The RAG agent in try-rig uses nomic-embed-text via Ollama for fully local embeddings:

    let embedding_model = client.embedding_model_with_ndims("nomic-embed-text", 768);
    
    let embeddings = EmbeddingsBuilder::new(embedding_model.clone())
        .documents(knowledge_entries)?
        .build()
        .await?;
    
    let vector_store = InMemoryVectorStore::from_documents(embeddings);
    let index = vector_store.index(embedding_model);
    
    let rag_agent = client
        .agent(model)
        .preamble("Use the provided context to answer accurately.")
        .dynamic_context(2, index)  // inject top 2 results
        .build();
    

    Multi-Agent Orchestration

    Rig agents can be used as tools for other agents. The try-rig demo builds a math specialist and a weather specialist, then hands both to an orchestrator:

    let calc_agent = client.agent(model)
        .preamble("You are a math specialist.")
        .name("math_agent")
        .description("Arithmetic: add, subtract, multiply, divide.")
        .tool(Calculator)
        .build();
    
    let weather_agent = client.agent(model)
        .preamble("You are a weather specialist.")
        .name("weather_agent")
        .tool(WeatherLookup)
        .build();
    
    let orchestrator = client.agent(model)
        .preamble("Route questions to math_agent or weather_agent.")
        .tool(calc_agent)
        .tool(weather_agent)
        .build();
    

    The orchestrator decides which specialist to call based on the question. Agents as tools—composable all the way down.

    Typed Extraction

    Rig can also extract structured data from unstructured text using schemars:

    #[derive(Debug, Deserialize, Serialize, JsonSchema)]
    pub struct ContactInfo {
        pub name: Option<String>,
        pub email: Option<String>,
        pub phone: Option<String>,
    }
    
    let extractor = client
        .extractor::<ContactInfo>(model)
        .preamble("Extract contact information from text.")
        .build();
    
    let contact = extractor.extract("Call Jane at 555-1234 or jane@example.com").await?;
    

    The output is a proper Rust struct, not a string you have to parse.

    The try-rig CLI

    All of these patterns are runnable from try-rig:

    try-rig ask "What is Rust?"           # Simple agent
    try-rig tools "What is 42 * 17?"      # Tool calling
    try-rig rag "Explain Rust ownership"   # RAG with embeddings
    try-rig multi "Weather in Tokyo?"      # Multi-agent routing
    try-rig extract "Call Jane at 555-1234"# Typed extraction
    try-rig stream "Explain TCP/IP"        # Streaming response
    

    Five times less memory than equivalent Python, zero Python dependencies, and the compiler catches your mistakes before runtime.

    langchain-rust: Chain Abstractions for Rust

    langchain-rust (v4.6.0) brings LangChain’s composable chain architecture to Rust. Where Rig focuses on type-safe agents, langchain-rust focuses on chain orchestration. The try-langchain-rust repo has 13 runnable examples across the full feature set.

    LLM Chains and Prompt Templates

    The chain builder composes prompts and LLMs into reusable pipelines:

    use langchain_rust::{
        chain::{Chain, LLMChainBuilder},
        fmt_message, fmt_template, message_formatter,
        prompt::HumanMessagePromptTemplate,
        prompt_args, schemas::messages::Message, template_fstring,
    };
    
    let prompt = message_formatter![
        fmt_message!(Message::new_system_message("You are a concise technical writer.")),
        fmt_template!(HumanMessagePromptTemplate::new(template_fstring!(
            "Explain {topic} in 2-3 sentences.", "topic"
        )))
    ];
    
    let chain = LLMChainBuilder::new()
        .prompt(prompt)
        .llm(llm)
        .build()?;
    
    let result = chain.invoke(prompt_args! { "topic" => "ownership in Rust" }).await?;
    

    Ollama works the same way—swap the LLM and everything else stays identical:

    let ollama = Ollama::default().with_model("llama3.2");
    // use ollama in place of llm above
    

    Sequential Chains

    Pipe one chain’s output into the next. This example generates a story concept, then a title, then an opening line:

    let concept_chain = LLMChainBuilder::new()
        .prompt(/* "Create a concept about " */)
        .llm(llm.clone())
        .output_key("concept")
        .build()?;
    
    let title_chain = LLMChainBuilder::new()
        .prompt(/* "Suggest a title for " */)
        .llm(llm.clone())
        .output_key("title")
        .build()?;
    
    let chain = sequential_chain!(concept_chain, title_chain, opening_chain);
    
    let output = chain.execute(prompt_args! { "topic" => "a robot that learns to paint" }).await?;
    println!("Title: {}", output["title"]);
    

    Conversational Memory

    Multi-turn dialogue with automatic context retention:

    let chain = ConversationalChainBuilder::new()
        .llm(llm)
        .memory(SimpleMemory::new().into())
        .build()?;
    
    chain.invoke(prompt_args! { "input" => "My name is Alice and I'm learning Rust." }).await?;
    // Turn 2: chain remembers Alice and Rust
    chain.invoke(prompt_args! { "input" => "What's my name?" }).await?;
    

    RAG with Vector Store

    The conversational retriever chain combines memory, vector search, and LLM generation. The try-langchain-rust demo uses SQLite for the vector store:

    let store = StoreBuilder::new()
        .embedder(OpenAiEmbedder::default())
        .connection_url("sqlite::memory:")
        .table("documents")
        .vector_dimensions(1536)
        .build().await?;
    
    store.initialize().await?;
    add_documents!(store, &documents).await?;
    
    let chain = ConversationalRetrieverChainBuilder::new()
        .llm(llm)
        .rephrase_question(true)
        .memory(SimpleMemory::new().into())
        .retriever(Retriever::new(store, 3))
        .prompt(prompt)
        .build()?;
    

    Multi-turn RAG conversations work out of the box—the chain rephrases follow-up questions using conversation history before searching the vector store.

    Agents and Semantic Routing

    Agents select tools autonomously. The demo uses a CommandExecutor tool:

    let agent = ConversationalAgentBuilder::new()
        .tools(&[Arc::new(CommandExecutor::default())])
        .build(llm)?;
    
    let executor = AgentExecutor::from_agent(agent)
        .with_memory(SimpleMemory::new().into());
    
    executor.invoke(prompt_args! { "input" => "List the files in the current directory" }).await?;
    

    Semantic routing dispatches queries by meaning—define example utterances for each route and the router classifies new inputs:

    let coding_route = Router::new("coding", &[
        "How do I write a function in Rust?",
        "Explain generics in programming",
    ]);
    let devops_route = Router::new("devops", &[
        "Set up a CI/CD pipeline",
        "Configure Docker containers",
    ]);
    
    let router = RouteLayerBuilder::default()
        .embedder(OpenAiEmbedder::default())
        .add_route(coding_route)
        .add_route(devops_route)
        .threshold(0.80)
        .build().await?;
    
    let route = router.call("Explain the borrow checker").await?;
    // → "coding"
    

    The try-langchain-rust Examples

    All 13 examples are runnable from try-langchain-rust:

    cargo run --example llm_chain         # Prompt templates + LLM chain
    cargo run --example conversational    # Multi-turn memory
    cargo run --example ollama            # Local LLM (no API key)
    cargo run --example streaming         # Token-by-token output
    cargo run --example vector_store      # SQLite similarity search
    cargo run --example doc_loader        # Text/CSV loading + splitting
    cargo run --example qa_chain          # Q&A over documents
    cargo run --example rag_chat          # Conversational RAG
    cargo run --example agent             # Agent with tools
    cargo run --example sequential        # Chained pipelines
    cargo run --example semantic_router   # Route by meaning
    

    Rig vs. langchain-rust

    Dimension Rig langchain-rust
    Focus Agent construction Chain orchestration
    Type safety Strong (Tool trait, typed extraction) Moderate (macro-based prompt building)
    RAG In-memory vector store, embeddings SQLite/Postgres/Qdrant + document loaders
    Multi-agent Agents as tools (composable) Agent executor with tool selection
    Memory Manual history management Built-in SimpleMemory, auto context
    Chains Single agent pipelines Sequential chains, conversational retriever
    Maturity v0.32, active development v4.6.0, stable API
    Local LLM Ollama native Ollama supported
    Best for Type-safe agents, tool calling Multi-step pipelines, RAG, document ingestion

    They’re complementary more than competing. A project could use Rig for the agent layer and langchain-rust for document ingestion and retrieval.

    What’s Next for pjmai-rs

    The Phase 4 roadmap for pjmai-rs includes AI integration:

    • AI context injection: ctpj --for-agent already outputs project metadata as JSON for AI prompts
    • Restricted PATH mode: Sandboxed environments for autonomous agents
    • AI-assisted discovery: Let agents find and register projects automatically

    The question isn’t whether pjmai-rs will use Rig or langchain-rust—it’s which patterns from each framework make sense for a CLI tool that helps AI agents navigate codebases.

    References

    Resource Link
    Rig Framework rig.rs
    Rig Docs docs.rs/rig-core
    langchain-rust crates.io/crates/langchain-rust
    langchain-rust Source github.com/Abraxas-365/langchain-rust
    Rust 2024 Edition Guide doc.rust-lang.org/edition-guide
    “Sharpen the Saw” The 7 Habits of Highly Effective People (Stephen Covey)
    pjmai-rs Background TBT: PJMAI-RS
    pjmai-rs Features Navigation History and Fuzzy Completion

    Habit 7: Sharpen the Saw. Fix the foundation first, then build higher.

    Part 1 of the Sharpen the Saw Sundays series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 594 words3 min readAbstract

    pjmai-rs: Navigation History and Fuzzy Completion

    Since the TBT post on pjmai-rs, development has continued. This post covers the new features that make project navigation even faster.

    Resource Link
    Repo sw-cli-tools/pjmai-rs
    Background TBT: PJMAI-RS
    Comments Discord

    The new hypj command (or pjmai history) shows where you’ve been:

    hypj
    # Output:
    # 1. webapp      ~/code/webapp
    # 2. api         ~/code/api
    # 3. config      ~/code/config
    # 4. docs        ~/code/docs
    

    Jump directly to a history entry:

    hypj 3    # Jump to config (entry 3)
    

    This is faster than remembering project names when you’re bouncing between several repos.

    Stack Management

    The push/pop workflow now has explicit stack visibility:

    stpj              # Show current stack
    stpj clear        # Clear the stack (with confirmation)
    

    When you use chpj (direct navigation) instead of push/pop, the stack is automatically cleared. This prevents confusion when mixing navigation styles.

    The popj command now shows context:

    popj
    # Output: Returning to webapp (1 remaining)
    

    Subdirectory Navigation

    Navigate directly into subdirectories with tab completion:

    chpj myproject<TAB>           # Complete project name
    chpj myproject <TAB>          # Complete subdirs: src, tests, docs
    chpj myproject src/<TAB>      # Complete nested: lib, bin
    chpj myproject src/lib<ENTER> # cd to ~/code/myproject/src/lib
    

    Both space and slash syntax work:

    chpj myproject src lib        # Space-separated
    chpj myproject src/lib        # Slash-separated
    

    Helpful error messages:

    chpj myproject nonexistent
    # Error: subdirectory 'nonexistent' not found in project 'myproject'
    
    chpj myproject README.md
    # Error: 'README.md' is a file, not a directory
    

    Smarter Fuzzy Completion

    Tab completion now uses tiered matching:

    1. Prefix matches first: webwebapp, webapi
    2. Segment matches second: After - boundaries, so api matches my-api
    3. Substring matches last: app finds webapp

    Within each tier, results are sorted by most recently used. The project you switched to five minutes ago appears before one you haven’t touched in weeks.

    Bulk Operations

    Two new flags for batch management:

    rmpj --all        # Remove all projects (with confirmation)
    scpj --reset      # Clear registry and re-scan (fresh start)
    

    The scan command also handles nickname collisions better now. Instead of numeric suffixes (webapp2), it uses owner-prefixed names (acme-webapp) based on the git remote.

    New Commands Summary

    Command Alias Description
    pjmai history [N] hypj Show or jump to navigation history
    pjmai stack show stpj Show the project stack
    pjmai stack clear stpj clear Clear the stack
    pjmai remove --all rmpj --all Remove all projects
    pjmai scan --reset scpj --reset Fresh re-scan

    What’s Next

    The focus continues on making project context switching invisible. Upcoming work:

    • Nono integration: Sandboxing untrusted projects
    • AI agent restricted mode: Curated PATH for autonomous agents
    • Multi-machine sync: Share project registry across machines

    The best developer tools are the ones you stop noticing.

    Part 7 of the Personal Software series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 996 words5 min readAbstract

    rank-wav: Ranking Audio Files by Acoustic Quality

    You’ve generated 50 variations of a synthesized sound. Or you’ve downloaded a sample pack with hundreds of one-shots. Now what? Listening to each one is tedious. You need a way to rank them—fast.

    Resource Link
    Video rank-wav Demo
    Video
    Original Repo sw-cli-tools/rank-wav-rs
    Current Dev Repo sw-music-tools/rank-wav-rs
    Motivation Why I Built This
    Related Music Generation Tools
    Comments Discord

    What is rank-wav?

    rank-wav is a Rust CLI that scans directories for WAV files and ranks them by acoustic features correlated with perceived sound quality. It extracts features like RMS energy, spectral centroid, and bandwidth, then computes two scores:

    • Pleasing: Favors warm, smooth sounds (low brightness, moderate energy)
    • Best: Favors clear, present sounds (balanced spectrum, strong signal)
    $ rank-wav ./samples --sort pleasing
    
    +---+------------------------+--------+--------+----------+-----------+----------+--------+
    | # | File                   | RMS    | ZCR    | Centroid | Bandwidth | Pleasing | Best   |
    +---+------------------------+--------+--------+----------+-----------+----------+--------+
    | 1 | motif-warm.wav         | 0.0271 | 0.0190 | 763      | 1079      | 0.812    | 0.641  |
    | 2 | motif-balanced.wav     | 0.0647 | 0.0480 | 1502     | 1515      | 0.487    | 0.844  |
    | 3 | motif-bright.wav       | 0.0361 | 0.0530 | 1782     | 1469      | 0.362    | 0.611  |
    +---+------------------------+--------+--------+----------+-----------+----------+--------+
    

    The Features

    Basic Metrics

    Feature What It Measures Quality Correlation
    RMS Signal strength/loudness Present vs weak
    ZCR Zero-crossing rate Noisiness
    Centroid Spectral center of mass Brightness
    Bandwidth Spectral spread Complexity

    Extended Metrics

    With the -e flag, rank-wav also computes:

    Feature What It Measures Quality Correlation
    Rolloff Frequency below which 85% of energy lies High-frequency content
    Flatness How noise-like vs tonal (0-1) Tonal quality
    Crest Peak to RMS ratio (dB) Dynamic range

    How Scoring Works

    The “Pleasing” Score

    The pleasing score favors sounds that are warm and easy to listen to:

    • Lower spectral centroid (less harsh, less bright)
    • Lower spectral bandwidth (less complex, more focused)
    • Lower zero-crossing rate (less noisy)
    • Moderate RMS (present but not aggressive)

    This is useful for: background music, ambient sounds, relaxation audio.

    The “Best” Score

    The best score favors sounds that are clear and impactful:

    • Strong RMS (present, not weak)
    • Moderate spectral centroid (balanced brightness)
    • Moderate bandwidth (neither thin nor muddy)
    • Low zero-crossing rate (clean signal)

    This is useful for: sound design, music production, sample selection.

    Use Cases

    Procedural Audio Triage

    You’ve generated 100 variations of a procedural sound. Instead of listening to all of them:

    rank-wav ./generated -r --sort best | head -20
    

    Listen to the top 20. If none work, adjust your synthesis parameters and try again.

    Sample Library Organization

    A 500-sample library is overwhelming. Rank by pleasing score to find the smoothest, warmest options first:

    rank-wav ./samples -r --sort pleasing --json > ranked.json
    

    Then use the JSON to build a playlist of just the top tier.

    A/B Testing Synthesis Parameters

    Compare two batches of outputs:

    rank-wav ./batch-a -r --sort best
    rank-wav ./batch-b -r --sort best
    

    Which batch has higher average scores? That tells you which parameter set produces better results.

    Motivation

    I am trying to automate video production and I want unique music intros/outros for most videos. I do not want to manually specify inputs to music generation or manually review the outputs to choose one. Instead I want an AI Agent to generate different audio wav files, based on an idea or description from me, and the AI Agent can pick the best ones for the preview I review before uploading.

    Technical Implementation

    Pure Rust

    rank-wav uses only Rust crates with no C dependencies:

    • hound for WAV parsing (8/16/24/32-bit int, 32-bit float)
    • rustfft for FFT-based spectral analysis
    • clap for CLI parsing
    • tabled for formatted output

    No system library dependencies—clone, build, and run.

    Windowed FFT

    To compute spectral features, rank-wav:

    1. Extracts the center segment of the audio (up to 16384 samples)
    2. Applies a Hann window to reduce spectral leakage
    3. Computes the FFT
    4. Calculates centroid, bandwidth, rolloff, and flatness from the magnitude spectrum

    Batch Normalization

    All features are normalized relative to the current batch. This means:

    • Scores are meaningful within a comparison set
    • No need for absolute calibration
    • Rankings work regardless of overall loudness

    The trade-off: scores from different runs aren’t directly comparable.

    Installation

    git clone https://github.com/sw-music-tools/rank-wav-rs
    cd rank-wav-rs
    cargo install --path .
    

    The binary installs as rank-wav.

    Example Workflow

    # Scan recursively, sort by best, output JSON
    rank-wav ./my-samples -r -e --sort best --json > results.json
    
    # Quick table of top pleasing sounds
    rank-wav ./my-samples -r --sort pleasing
    
    # Check a single directory (non-recursive)
    rank-wav ./one-shots
    

    Why Not Just Listen?

    You should still listen—but to the top candidates, not all of them. rank-wav is a filter that surfaces the most promising files based on acoustic characteristics. It’s not a replacement for your ears; it’s a tool to make your ears more productive.

    For generating the WAV files that rank-wav ranks:

    Project Description
    midi-cli-rs CLI for MIDI file manipulation and synthesis
    music-pipe-rs Pipeline for AI-driven music generation

    When you have too many sounds to listen to, let the math do the first pass.

    Part 6 of the Personal Software series. View all parts | Next: Part 7 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1918 words10 min readAbstract

    TBT (6/?): PJMAI-RS - A Shell That Knows Your Projects

    Every developer has faced this: you remember the repo name, but not the full path. You start typing cd ~/github/ and then tab-complete your way through three levels of directories, or worse, open a file browser. For a task you do dozens of times a day, that friction adds up.

    What is PJMAI-RS?

    PJMAI-RS (Project Manager AI - Rust) is a CLI tool that maintains a registry of your project directories. You give each project a short nickname, then switch to it instantly:

    chpj blog     # jump to ~/github/softwarewrighter/blog-planning
    chpj api      # jump to ~/work/services/customer-api
    chpj notes    # open ~/Documents/notes.md in your editor
    

    No more remembering paths. No more tab-completion marathons. Just type the nickname.

    The Shell Integration Problem

    Here’s the fundamental challenge: CLI tools run as subprocesses. A subprocess cannot change the parent shell’s working directory. When your Rust binary calls chdir(), it changes its own directory, then exits—leaving your shell exactly where it started.

    Most tools solve this with wrapper functions that eval output or source scripts. PJMAI-RS uses a cleaner approach: exit code signaling.

    Exit Code Meaning Shell Action
    0 Normal output Print to console
    2 Directory path Execute cd <path>
    3 File path Execute source <path>
    4 Error Display error message
    5 Script Execute eval <output>

    A minimal shell wrapper checks the exit code and takes the appropriate action. The Rust binary stays focused on logic; the shell handles environment manipulation.

    Quick Switching Features

    Fuzzy Matching

    PJMAI-RS finds projects using cascading match strategies:

    1. Exact match: blog matches blog
    2. Prefix match: bl matches blog
    3. Substring match: log matches blog
    4. Case-insensitive: BLOG matches blog

    Typos and partial names usually work. If multiple projects match, it lists them.

    Stack Navigation

    Sometimes you need to check something in another project, then return:

    chpj webapp       # working on the webapp
    pspj api          # push webapp to stack, switch to api
    # ... check something ...
    popj              # pop back to webapp
    

    The stack handles nested pushes. You can dive three projects deep and pop back through each one.

    Per-Project Environments

    Each project can define its own environment:

    evpj api NODE_ENV=development    # set env var for this project
    evpj api PATH_PREPEND=./bin      # add to PATH when entering
    

    Or create a .pjmai.sh file in the project root:

    # .pjmai.sh
    export DATABASE_URL="postgres://localhost/dev"
    source .venv/bin/activate
    

    PJMAI-RS uses hash-based approval: the first time it sees a .pjmai.sh, it asks for permission and records the hash. Future visits source it automatically. If the file changes, it asks again.

    This prevents drive-by script execution while enabling seamless per-project setup.

    Auto-Detection

    PJMAI-RS detects common development environments:

    Environment Detection Action
    Python venv .venv/, venv/ directories Activate virtual environment
    Node.js package.json + .nvmrc Switch Node version
    Rust Cargo.toml Set up cargo environment
    direnv .envrc Respect direnv configuration

    When you chpj to a Python project, it activates the venv. Jump to a Node project, it switches to the right Node version. No manual setup.

    Repository Scanning

    Don’t want to add projects one by one? Scan for them:

    scpj ~/github     # find all git repos
    

    PJMAI-RS walks the directory tree, finds git repositories, parses remote URLs to suggest groups (by GitHub org, for example), and generates unique nicknames. Collisions get suffixes.

    A single command can populate your entire project list.

    Motivation

    On one of my development systems I have 200 repositories, and each system I use has a different set of repos. I was spending too much time using cd, du, fd, and ls commands to find and navigate, so I updated my private pjm1 project to this public pjmai-rs project. I want to have AI Agents use this tool to navigate related projects. I want this tool to use AI to analyze status of projects, and more. I want AI Agents to not have to explicitly activate virtual Python environments. I want to restrict AI agents via nono.

    AI Agent Support

    The design explicitly supports AI coding agents:

    • --json flag: Every command outputs machine-readable JSON
    • Context export: pjmai context generates project metadata optimized for system prompts
    • Structured errors: Errors include suggestions the agent can act on

    When an AI agent needs to know “what project am I in?” or “what build commands are available?”, it can query PJMAI-RS directly.

    Project Groups

    Projects are automatically grouped by directory structure. If you have multiple repos under ~/github/sw-cli-tools/, they form a group:

    $ shgp
    Group: sw-cli-tools
    Path: ~/github/sw-cli-tools
    Projects: 4
    
    $ shgp --all
    Group: sw-cli-tools
    Path: ~/github/sw-cli-tools
    Projects: 4
    
      umap2                ~/github/sw-cli-tools/umap
      favicon2             ~/github/sw-cli-tools/favicon
    > pjmai-rs             ~/github/sw-cli-tools/pjmai-rs
      sw-cli2              ~/github/sw-cli-tools/sw-cli
    

    The > marks the current project. Groups are inferred from git remote URLs during scanning, so projects from the same GitHub org cluster together.

    Shell Aliases

    After running pjmai setup, you get short aliases:

    Alias Command Purpose
    adpj add Add a project
    chpj change Switch to a project
    lspj list List all projects
    rmpj remove Remove a project
    shpj show Show project details
    mvpj rename Rename a project
    pspj push Push and switch
    popj pop Pop and return
    prpj prompt Current project for shell prompt
    scpj scan Scan for repositories
    evpj env Set environment config
    ctpj context Export context for AI
    srcpj - Source .pjmai.sh manually
    hlpj aliases Show all aliases

    Group aliases:

    Alias Command Purpose
    lsgp group list List all groups
    shgp group show Show current/named group
    prgp group prompt Current group for shell prompt

    The pattern: two-letter operation + pj for projects, + gp for groups.

    Why Rust?

    Years ago, while learning Rust, I created a private project called pjm1 to explore how a subprocess could signal directory changes to its parent shell. PJMAI-RS is a fork of that project, created for this blog post with additional features.

    Rust brings practical benefits:

    • Speed: Instant startup, fast scanning
    • Distribution: Single binary, no runtime dependencies
    • Shell completions: Generated at compile time for Bash, Zsh, Fish, PowerShell
    • Learning: A good vehicle for understanding systems programming concepts

    The Throwback

    The core idea—giving projects nicknames and switching fast—isn’t new. I first used something like this around 2000, based on shell scripts by Russ Tremain (vspms). Those scripts worked well. Over the years I built my own variations: first as shell functions, then a shell script, then pjm1 in Rust, now PJMAI-RS.

    What’s changed isn’t the concept but the context. Modern tooling (clap for arg parsing, serde for serialization, proper exit code signaling) makes the Rust implementation clean. AI agent support makes it relevant to how development workflows are evolving.

    Getting Started

    Install with cargo:

    cargo install pjmai
    

    Or clone and build:

    git clone https://github.com/softwarewrighter/sw-cli-tools
    cd sw-cli-tools/pjmai-rs
    cargo install --path .
    

    Then configure your shell:

    pjmai setup >> ~/.bashrc   # or ~/.zshrc
    source ~/.bashrc
    

    Add your first project:

    adpj myproject ~/path/to/project
    chpj myproject
    

    Current Status

    PJMAI-RS is at version 0.4, completing phase three (full environment configuration).

    What’s Next

    Phase 4: Sandboxing

    The next major feature is sandboxing for untrusted projects. Three integration paths are planned:

    Nono Integration

    nono-rs is an anti-sudo tool that intercepts and blocks privileged commands. When you’re reviewing untrusted code or letting an AI agent run commands, you don’t want accidental (or malicious) sudo rm -rf /.

    [[project]]
    name = "untrusted-code"
    [project.sandbox]
    use_nono = true
    nono_mode = "deny"  # deny, log, or prompt
    

    When you switch to a nono-protected project:

    $ chpj untrusted-code
    🔒 Nono active: sudo commands will be blocked
    

    The agent can run cargo build, git status, ls—but sudo gets intercepted.

    AI Agent Restricted Mode

    For AI coding agents, PJMAI-RS will support restricted PATH configurations:

    $ chpj myproject --agent
    🔒 AI Agent mode: restricted PATH active
    Allowed: cargo, git, ls, find, grep, pjmai
    Blocked: rm, sudo, curl, wget, ssh
    

    The agent gets a curated set of safe commands. Everything else is blocked at the shell level.

    Container Integration

    For full isolation, projects can be configured to run inside containers:

    [[project]]
    name = "isolated-dev"
    [project.container]
    type = "docker"  # or podman, lima
    image = "rust:1.75-slim"
    enter_on_switch = true
    

    Switching to the project drops you into the container automatically.

    Phase 5: Multi-Machine Sync

    Share your project registry across machines:

    • Sync via git repository
    • Import/export configurations
    • Handle path differences between machines (home directory mappings)

    References

    Resource Link
    PJMAI-RS Repository github.com/softwarewrighter/sw-cli-tools
    nono-rs (Anti-Sudo) docs.rs/crate/nono-rs
    Clap (CLI Parser) clap.rs
    Shell Exit Codes Exit Status (Wikipedia)

    Historical Context

    Era Resource Link
    1980s BSD SPMS 4.3BSD SPMS README
    1980s CMU SEI SCM Support Materials for Software Configuration Management
    2013 vspms github.com/rustt/vspms

    Note: The chpj-style commands were informal add-ons shared between developers, not part of the official SPMS distribution. Documentation from that era is hard to find online.


    Sometimes the best tools are the ones that remove friction from things you do constantly. Switching projects is one of those things.

    Part 6 of the Throwback Thursday series. View all parts | Next: Part 7 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 3559 words18 min readAbstract

    Five ML Concepts - #30: The Journey So Far

    30 episodes. 145 machine learning concepts.

    Resource Link
    Full Series Five ML Concepts Episodes 1-29
    Video Five ML Concepts #30
    Video
    Papers Index Complete Concept Index
    Comments Discord

    The Journey So Far

    For the past thirty episodes, we’ve explored 145 machine learning concepts in under 30 seconds each.

    From backpropagation to scaling laws. From dropout to distribution shift. From RAG to reward hacking.

    We covered:

    • Foundations — the building blocks of neural networks and learning algorithms
    • Failure modes — how models break, overfit, forget, and hallucinate
    • Deployment realities — what happens when models meet production
    • Alignment challenges — ensuring models do what we actually want

    What’s Changing

    Machine learning is evolving rapidly. The foundational primitives are now well-established—the concepts we covered form a stable vocabulary.

    But new research is reshaping how we apply these primitives:

    • Memory and retrieval architectures
    • Reasoning and planning systems
    • Sparsity and efficiency at scale
    • Robustness and generalization
    • Alignment and safety

    30 seconds per concept was a good start. But some ideas deserve more depth.

    What’s Next: Frontier ML Thinking

    Starting soon: Frontier ML Thinking.

    One concept. Two minutes. Deeper implications.

    We’ll explore the cutting edge—ideas from papers published in the last 12 months that build on the foundations we’ve covered.

    If You’re New Here

    Start with Five ML Concepts Episodes 1–29. Each episode covers five concepts in five minutes total. The full series provides a foundation in modern machine learning vocabulary.

    If You’ve Been Here the Whole Time

    You’re ready for the frontier.


    Why the Papers Look “Old”

    When I tabulated the papers behind the 145 concepts in this series, something looked odd: almost none of the cited papers were from the last two years.

    This is not a mistake—it’s a feature of how ML knowledge evolves.

    Seminal papers don’t keep getting re-written

    Most concepts in this series are primitives: backpropagation, transformers, RAG, dropout, calibration, and so on. Each primitive has an origin paper that introduced it. Once the primitive exists, later research focuses on:

    • Scaling it
    • Combining it with other ideas
    • Benchmarking it
    • Making it more efficient
    • Making it safer

    That kind of work produces new papers, but not new “origin papers.”

    What this reveals about the field

    The core intellectual breakthroughs of modern ML largely occurred between 2016 and 2022. The frontier has since shifted from inventing new primitives to:

    • Memory and retrieval systems
    • Continual learning
    • Agent architectures
    • Tool use and planning
    • Sparsity and efficiency at scale
    • Alignment and safety

    That’s exactly what Frontier ML Thinking will explore: ideas from papers published in the last 12 months that build on these foundations.


    Complete Concept Index

    All 145 concepts organized chronologically by seminal paper year.

    Pre-1990

    1950s1950s1950s

    1958

    Concept Links (Post, Video, Paper)
    Perceptron Post 5 | Video 5 | (1958) The Perceptron

    1960s1960s1960s

    1960s

     
    None

    1970s1970s1970s

    1970s

     
    None

    1980s1980s1980s

    1986

    Concept Links (Post, Video, Paper)
    Backpropagation Post 1 | Video 1 | (1986) Learning representations by back-propagating errors
    RNN Post 11 | Video 11 | (1986) Learning representations

    1989

    Concept Links (Post, Video, Paper)
    Universal Approximation Post 13 | Video 13 | (1989) Approximation by Superpositions

    1990s1990s1990s

    1990s

    1995

    Concept Links (Post, Video, Paper)
    Cross-Validation Post 7 | Video 7 | (1995) A Study of Cross-Validation

    1997

    Concept Links (Post, Video, Paper)
    LSTM Post 22 | Video 22 | (1997) Long Short-Term Memory

    1998

    Concept Links (Post, Video, Paper)
    Early Stopping Post 13 | Video 13 | (1998) Early Stopping - But When?

    2000s2000s2000s

    2000s

    2000

    Concept Links (Post, Video, Paper)
    Ensembling Post 18 | Video 18 | (2000) Ensemble Methods

    2002

    Concept Links (Post, Video, Paper)
    Cold Start Problems Post 14 | Video 14 | (2002) Addressing Cold Start

    2003

    Concept Links (Post, Video, Paper)
    Perplexity Post 15 | Video 15 | (2003) A Neural Probabilistic Language Model

    2006

    Concept Links (Post, Video, Paper)
    Autoencoders Post 19 | Video 19 | (2006) Reducing Dimensionality
    ROC / AUC Post 14 | Video 14 | (2006) An Introduction to ROC Analysis

    2007

    Concept Links (Post, Video, Paper)
    Precision vs Recall Post 12 | Video 12 | (2007) The Truth of the F-Measure

    2009

    Concept Links (Post, Video, Paper)
    A/B Testing Models Post 16 | Video 16 | (2009) Controlled Experiments
    Bias-Variance Tradeoff Post 8 | Video 8 | (2009) Elements of Statistical Learning
    Correlation vs Causation Post 19 | Video 19 | (2009) Causality
    Covariate Shift Post 19 | Video 19 | (2009) Dataset Shift in ML
    Curriculum Learning Post 19 | Video 19 | (2009) Curriculum Learning
    Curse of Dimensionality Post 15 | Video 15 | (2009) Elements of Statistical Learning
    Distribution Shift Post 11 | Video 11 | (2009) Dataset Shift in ML
    Why ML Is Fragile Post 18 | Video 18 | (2009) Distribution Shift
    Why More Data Beats Better Models Post 22 | Video 22 | (2009) Unreasonable Effectiveness of Data

    2010s2010s2010s

    2010s

    2010

    Concept Links (Post, Video, Paper)
    Transfer Learning Post 4 | Video 4 | (2010) A Survey on Transfer Learning
    Weight Initialization Post 15 | Video 15 | (2010) Understanding Difficulty of Training

    2011

    Concept Links (Post, Video, Paper)
    Spurious Correlations Post 14 | Video 14 | (2011) Unbiased Look at Dataset Bias

    2012

    Concept Links (Post, Video, Paper)
    CNN Post 10 | Video 10 | (2012) ImageNet Classification with Deep CNNs
    Data Leakage Post 24 | Video 24 | (2012) Leakage in Data Mining

    2013

    Concept Links (Post, Video, Paper)
    Adversarial Examples Post 25 | Video 25 | (2013) Intriguing properties of neural networks
    Embedding Post 1 | Video 1 | (2013) Word2Vec
    Gradient Clipping Post 14 | Video 14 | (2013) Difficulty of Training RNNs
    Latent Space Post 5 | Video 5 | (2013) Auto-Encoding Variational Bayes
    Representation Learning Post 25 | Video 25 | (2013) Representation Learning: A Review
    VAEs Post 20 | Video 20 | (2013) Auto-Encoding Variational Bayes

    2014

    Concept Links (Post, Video, Paper)
    Adam Post 4 | Video 4 | (2014) Adam: Stochastic Optimization
    Attention Post 2 | Video 2 | (2014) Neural Machine Translation
    Dropout Post 9 | Video 9 | (2014) Dropout: Prevent Overfitting
    Encoder-Decoder Post 10 | Video 10 | (2014) Sequence to Sequence Learning
    GRU Post 21 | Video 21 | (2014) Gated Recurrent Neural Networks
    Memory-Augmented Networks Post 27 | Video 27 | (2014) Neural Turing Machines
    Mode Collapse Post 24 | Video 24 | (2014) Generative Adversarial Nets
    Overfitting Post 3 | Video 3 | (2014) Dropout
    Regularization Post 6 | Video 6 | (2014) Dropout
    Temperature Post 2 | Video 2 | (2014) Properties of Neural MT

    2015

    Concept Links (Post, Video, Paper)
    Batch Normalization Post 16 | Video 16 | (2015) Batch Normalization
    Distillation Post 10 | Video 10 | (2015) Distilling Knowledge
    Label Smoothing Post 25 | Video 25 | (2015) Rethinking Inception
    Learning Rate Post 2 | Video 2 | (2015) Cyclical Learning Rates
    Tokenization Post 3 | Video 3 | (2015) Subword Units

    2016

    Concept Links (Post, Video, Paper)
    Activation Functions Post 4 | Video 4 | (2016) Deep Learning Book
    Benchmark Leakage Post 17 | Video 17 | (2016) Rethinking Inception
    Checkpointing Post 13 | Video 13 | (2016) Sublinear Memory Cost
    Epoch Post 18 | Video 18 | (2016) Deep Learning Book
    Gradient Descent Post 2 | Video 2 | (2016) Overview of Gradient Descent
    Inference Post 9 | Video 9 | (2016) Deep Learning Book
    Learning Rate Schedules Post 23 | Video 23 | (2016) SGDR: Warm Restarts
    Loss Surface Sharpness Post 23 | Video 23 | (2016) Large-Batch Training
    Reward Hacking Post 24 | Video 24 | (2016) Concrete Problems in AI Safety
    Softmax Post 11 | Video 11 | (2016) Deep Learning Book
    Train/Validation/Test Split Post 16 | Video 16 | (2016) Deep Learning Book

    2017

    Concept Links (Post, Video, Paper)
    Batch Size Post 12 | Video 12 | (2017) Large-Batch Training
    Calibration Post 13 | Video 13 | (2017) On Calibration
    Catastrophic Forgetting Post 15 | Video 15 | (2017) Overcoming Catastrophic Forgetting
    Conditional Computation Post 28 | Video 28 | (2017) Sparsely-Gated MoE
    Context Window Post 7 | Video 7 | (2017) Attention Is All You Need
    Elastic Weight Consolidation Post 27 | Video 27 | (2017) Overcoming Catastrophic Forgetting (EWC)
    Gradient Noise Post 20 | Video 20 | (2017) SGD as Approximate Bayesian Inference
    Loss Function Post 3 | Video 3 | (2017) Survey of Loss Functions
    Miscalibration Post 25 | Video 25 | (2017) On Calibration
    Mixed Precision Post 8 | Video 8 | (2017) Mixed Precision Training
    MoE Post 11 | Video 11 | (2017) Sparsely-Gated MoE
    OOD Inputs Post 12 | Video 12 | (2017) Detecting Misclassified Examples
    Optimization vs Generalization Post 16 | Video 16 | (2017) Rethinking Generalization
    Overconfidence Post 16 | Video 16 | (2017) On Calibration
    Parameter Routing Post 27 | Video 27 | (2017) Sparsely-Gated MoE
    Positional Encoding Post 6 | Video 6 | (2017) Attention Is All You Need
    Self-Attention Post 7 | Video 7 | (2017) Attention Is All You Need
    Sparse Activation Post 28 | Video 28 | (2017) Sparsely-Gated MoE
    Transformer Post 1 | Video 1 | (2017) Attention Is All You Need
    Uncertainty Estimation Post 20 | Video 20 | (2017) Uncertainties in Bayesian DL
    Warmup Post 24 | Video 24 | (2017) Accurate Large Minibatch SGD
    Why Interpretability Is Hard Post 20 | Video 20 | (2017) Rigorous Science of Interpretability

    2018

    Concept Links (Post, Video, Paper)
    BERT Post 6 | Video 6 | (2018) BERT: Pre-training
    Concept Drift vs Data Drift Post 17 | Video 17 | (2018) Learning under Concept Drift
    Inductive Bias Post 12 | Video 12 | (2018) Relational Inductive Biases
    Loss Landscapes Post 14 | Video 14 | (2018) Visualizing Loss Landscape
    Pre-training Post 5 | Video 5 | (2018) BERT

    2019

    Concept Links (Post, Video, Paper)
    Data Augmentation Post 26 | Video 26 | (2019) Survey on Data Augmentation
    Double Descent Post 25 | Video 25 | (2019) Deep Double Descent
    GPT Post 7 | Video 7 | (2019) Language Models are Unsupervised Multitask Learners
    Inference Parallelism Post 28 | Video 28 | (2019) Megatron-LM
    Lottery Ticket Hypothesis Post 28 | Video 28 | (2019) The Lottery Ticket Hypothesis
    Manifold Hypothesis Post 26 | Video 26 | (2019) Intro to VAEs
    Monitoring & Drift Detection Post 15 | Video 15 | (2019) Detecting Dataset Shift
    Replay Buffers Post 27 | Video 27 | (2019) Experience Replay
    Weight Decay Post 17 | Video 17 | (2019) Decoupled Weight Decay

    2020s2020s2020s

    2020s

    2020

    Concept Links (Post, Video, Paper)
    Diffusion Models Post 8 | Video 8 | (2020) Denoising Diffusion
    Few-shot Learning Post 10 | Video 10 | (2020) Language Models are Few-Shot Learners
    Fine-tuning Post 3 | Video 3 | (2020) Survey on Transfer Learning
    ICL (In-Context Learning) Post 5 | Video 5 | (2020) Language Models are Few-Shot Learners
    Neural Collapse Post 29 | Video 29 | (2020) Prevalence of Neural Collapse
    Preference Learning Post 18 | Video 18 | (2020) Learning to Summarize
    Prompting Post 6 | Video 6 | (2020) Language Models are Few-Shot Learners
    RAG Post 10 | Video 10 | (2020) Retrieval-Augmented Generation
    Scaling Laws Post 17 | Video 17 | (2020) Scaling Laws for Neural Language Models
    Self-Training Instability Post 29 | Video 29 | (2020) Understanding Self-Training
    Shortcut Learning Post 13 | Video 13 | (2020) Shortcut Learning in DNNs

    2021

    Concept Links (Post, Video, Paper)
    Failure Analysis Post 19 | Video 19 | (2021) Practical ML for CV
    Human-in-the-Loop Systems Post 20 | Video 20 | (2021) Human-in-the-Loop ML
    Latency vs Throughput Post 12 | Video 12 | (2021) Efficient Large-Scale Training
    LoRA Post 3 | Video 3 | (2021) LoRA: Low-Rank Adaptation
    Mechanistic Interpretability Post 29 | Video 29 | (2021) Transformer Circuits
    Quantization Post 9 | Video 9 | (2021) Survey of Quantization Methods
    RoPE Post 6 | Video 6 | (2021) RoFormer
    SAM Post 29 | Video 29 | (2021) Sharpness-Aware Minimization
    VLM Post 4 | Video 4 | (2021) CLIP

    2022

    Concept Links (Post, Video, Paper)
    Chain of Thought Post 11 | Video 11 | (2022) Chain-of-Thought Prompting
    Compute Optimality Hypothesis Post 28 | Video 28 | (2022) Chinchilla
    Constitutional AI Post 26 | Video 26 | (2022) Constitutional AI
    Cost vs Quality Tradeoffs Post 18 | Video 18 | (2022) Efficient Transformers
    Emergent Behavior Post 23 | Video 23 | (2022) Emergent Abilities
    Flash Attention Post 9 | Video 9 | (2022) FlashAttention
    Goodhart’s Law Post 26 | Video 26 | (2022) Goodhart’s Law and ML
    Grokking Post 29 | Video 29 | (2022) Grokking
    KV Cache Post 8 | Video 8 | (2022) Fast Transformer Decoding
    RLHF Post 9 | Video 9 | (2022) Training with Human Feedback
    Shadow Deployment Post 17 | Video 17 | (2022) Reliable ML
    Speculative Decoding Post 5 | Video 5 | (2022) Fast Inference via Speculative Decoding
    Superposition Post 4 | Video 4 | (2022) Toy Models of Superposition

    2023

    Concept Links (Post, Video, Paper)
    DPO Post 2 | Video 2 | (2023) Direct Preference Optimization
    GQA Post 7 | Video 7 | (2023) GQA: Training Generalized Multi-Query
    Hallucination Post 1 | Video 1 | (2023) Survey of Hallucination
    Jailbreaks Post 21 | Video 21 | (2023) Jailbroken
    Mamba Post 1 | Video 1 | (2023) Mamba: Linear-Time Sequence Modeling
    Model Editing Post 27 | Video 27 | (2023) Editing LLMs
    Model Steerability Post 22 | Video 22 | (2023) Controllable Generation
    Planning vs Prediction Post 21 | Video 21 | (2023) AI/ML Gap
    Prompt Injection Post 21 | Video 21 | (2023) Prompt Injection Attack
    RSFT Post 22 | Video 22 | (2023) Scaling Mathematical Reasoning
    Tool Use Post 23 | Video 23 | (2023) Toolformer

    2024

    Concept Links (Post, Video, Paper)
    MLA Post 8 | Video 8 | (2024) DeepSeek-V2

    2025 and Beyond

    Since 2024, no widely-adopted new fundamental ML concepts have emerged. Research has shifted from inventing primitives to composing, scaling, and applying them. Papers from 2025–2026 will be covered in our new series: Frontier ML Thinking—one concept, two minutes, deeper implications.


    Thank you for following along. The journey continues.

    Part 30 of the Five ML Concepts series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 462 words3 min readAbstract

    Five ML Concepts - #29

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #29
    Video
    Comments Discord

    References

    Concept Reference
    Neural Collapse Prevalence of Neural Collapse (Papyan et al. 2020)
    Grokking Grokking: Generalization Beyond Overfitting (Power et al. 2022)
    SAM Sharpness-Aware Minimization (Foret et al. 2021)
    Mechanistic Interpretability Transformer Circuits (Anthropic 2021)
    Self-Training Instability Understanding Self-Training (Wei et al. 2020)

    Today’s Five

    1. Neural Collapse

    In overparameterized networks trained to zero loss, class representations converge late in training to a symmetric, maximally separated structure. The last-layer features and classifiers align into a simplex equiangular tight frame.

    This geometric phenomenon appears universally across architectures.

    Like students settling into evenly spaced seats by the end of class.

    2. Grokking

    In some tasks, especially small algorithmic ones, models memorize quickly but only later suddenly generalize. The jump from memorization to understanding can happen long after training loss reaches zero.

    Weight decay and longer training appear necessary for this phase transition.

    Like cramming facts for an exam, then later realizing you truly understand.

    3. SAM (Sharpness-Aware Minimization)

    Instead of minimizing loss at a single point, SAM minimizes loss under small weight perturbations, finding flatter regions. Flatter minima tend to generalize better than sharp ones.

    The optimizer seeks robustness to parameter noise.

    Like choosing a wide hilltop instead of balancing on a sharp peak.

    4. Mechanistic Interpretability

    Researchers analyze activations and internal circuits to understand how specific computations are implemented inside models. The goal is reverse-engineering neural networks into understandable components.

    This reveals attention heads, induction heads, and other interpretable patterns.

    Like mapping the wiring of an unknown machine to see how it works.

    5. Self-Training Instability

    When models train on their own generated data, feedback loops can amplify small errors over time. Each iteration compounds mistakes, causing distributional drift.

    Careful filtering and external grounding help mitigate this.

    Like copying a copy repeatedly until the meaning drifts.

    Quick Reference

    Concept One-liner
    Neural Collapse Late-stage geometric convergence of class representations
    Grokking Sudden generalization after prolonged memorization
    SAM Optimizing for flat loss regions under perturbations
    Mechanistic Interpretability Analyzing internal circuits of neural networks
    Self-Training Instability Feedback loops that amplify errors in self-generated data

    Short, accurate ML explainers. Follow for more.

    Part 29 of the Five ML Concepts series. View all parts | Next: Part 30 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 448 words3 min readAbstract

    Five ML Concepts - #28

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #28
    Video
    Comments Discord

    References

    Concept Reference
    Lottery Ticket Hypothesis The Lottery Ticket Hypothesis (Frankle & Carlin 2019)
    Sparse Activation Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
    Conditional Computation Sparsely-Gated MoE + Switch Transformers
    Inference Parallelism Megatron-LM (Shoeybi et al. 2019)
    Compute Optimality Chinchilla Scaling Laws (Hoffmann et al. 2022)

    Today’s Five

    1. Lottery Ticket Hypothesis

    Large neural networks contain smaller subnetworks that, when trained from the right initialization, achieve similar performance. These “winning tickets” exist before training begins.

    The key insight: you can find and train just the winning subnetwork.

    Like finding a winning lottery ticket hidden among many losing ones.

    2. Sparse Activation

    Only a subset of neurons activate for each input, even in models with many parameters. This allows large capacity without using everything at once.

    Mixture-of-experts architectures explicitly design for this pattern.

    Like a library where only relevant books light up for each query.

    3. Conditional Computation

    The model dynamically activates only certain components depending on the input. Different inputs route to different experts or pathways.

    This improves efficiency and scalability without proportional compute increase.

    Like routing patients to the right specialist instead of seeing every doctor.

    4. Inference Parallelism

    Model execution can be split across multiple devices to reduce latency or increase throughput. Tensor parallelism splits layers; pipeline parallelism splits stages.

    Essential for serving large models in production.

    Like dividing a puzzle so multiple people work on it simultaneously.

    5. Compute Optimality Hypothesis

    Empirical scaling laws suggest performance improves when model size, data, and compute are balanced. Adding only one resource may not yield optimal gains.

    Chinchilla showed many models were undertrained relative to their size.

    Like baking a cake where proportions matter more than just adding extra ingredients.

    Quick Reference

    Concept One-liner
    Lottery Ticket Hypothesis Small winning subnetworks hidden in large models
    Sparse Activation Using only part of a model per input
    Conditional Computation Dynamically routing inputs for efficiency
    Inference Parallelism Distributing inference across devices
    Compute Optimality Balancing model size, data, and compute

    Short, accurate ML explainers. Follow for more.

    Part 28 of the Five ML Concepts series. View all parts | Next: Part 29 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 899 words5 min readAbstract

    How AI Learns Part 7: Designing a Continuous Learning Agent

    A robust continuous learning agent contains:

    • Core model (rarely updated)
    • Adapters (modular skills)
    • External memory (facts)
    • Context manager (Recursive Language Model (RLM)-style)
    • Logging & evaluation loop
    Resource Link
    Related RLM | Engram | Sleepy Coder
    Comments Discord

    The Layered Architecture

    Four-layer architecture showing Context Manager, External Memory, Adapters, and Core Weights with feedback and evaluation loops
    Continuous learning is layered coordination.

    Layer by Layer

    Layer 4: Core Weights (Bottom)

    The foundation. Trained once, changed rarely.

    Aspect Details
    Contains General reasoning, language, base knowledge
    Update frequency Months or never
    Update method Full fine-tune or major consolidation
    Risk of change High (forgetting, capability shifts)

    Rule: Don’t touch this unless you have a very good reason.

    Layer 3: Adapters (Parameter-Efficient Fine-Tuning (PEFT) / Low-Rank Adaptation (LoRA))

    Modular skills that plug into the base.

    Aspect Details
    Contains Task-specific capabilities
    Update frequency Weekly to monthly
    Update method Lightweight PEFT training
    Risk of change Medium (isolated, but validate)

    Rule: Train adapters for validated, recurring patterns. Version them. Enable rollback.

    Layer 2: External Memory

    Facts, experiences, and retrieved knowledge.

    Aspect Details
    Contains Documents, logs, structured data
    Update frequency Continuous
    Update method Database writes
    Risk of change Low (doesn’t affect weights)

    Rule: Store experiences here first. Memory is cheap and safe.

    Layer 1: Context Manager (Top)

    The RLM-style interface that rebuilds focus each step.

    Aspect Details
    Contains Current context, retrieved data, active state
    Update frequency Per call
    Update method Reconstruction from memory + query
    Risk of change None (ephemeral)

    Rule: Don’t drag context forward. Rebuild it.

    The Feedback Loop

    Logging

    Capture everything the agent does:

    • Prompts received
    • Actions taken
    • Tool calls made
    • Errors encountered
    • User signals

    This is your training data.

    Evaluation

    Before any update reaches production:

    Check Purpose
    Retention tests Did old skills degrade?
    Forward transfer Did new skills improve?
    Regression suite Known failure cases
    Safety checks Harmful outputs?

    Without evaluation, you’re updating blind.

    Deployment

    Updates should be:

    • Modular: Can isolate and rollback
    • Versioned: Know what changed when
    • Staged: Test before full rollout
    • Monitored: Track post-deployment metrics

    The Error Flow

    Where do errors go?

    Error occurs
        ↓
    Log it (immediate)
        ↓
    Store in memory (same day)
        ↓
    Pattern emerges over multiple occurrences
        ↓
    Train adapter update (weekly/monthly)
        ↓
    Validate update (before deployment)
        ↓
    Deploy with rollback capability
    

    Errors feed into memory first. Only validated, recurring improvements reach adapters. Core weights almost never change.

    What This Architecture Achieves

    Problem Solution
    Catastrophic forgetting Core weights frozen; adapters isolated
    Context rot RLM rebuilds focus each step
    Hallucination Memory grounds responses
    Slow adaptation Memory updates continuously
    Unsafe changes Evaluation before deployment

    Design Principles

    1. Separate Storage from Reasoning

    Facts belong in memory. Reasoning belongs in weights. Don’t blur them.

    2. Separate Speed from Permanence

    Fast learning (memory) is temporary. Slow learning (weights) is permanent. Match the update speed to the desired permanence.

    3. Evaluate Before Consolidating

    Every update to adapters or weights must be validated. Regressions are silent killers.

    4. Enable Rollback

    Version everything. If an update causes problems, you must be able to undo it.

    5. Log Everything

    You cannot improve what you cannot measure. Structured logging is the foundation of continuous learning.

    The Big Picture

    AI does not learn in one place.

    It learns in layers:

    • Permanent (weights)
    • Modular (adapters)
    • External (memory)
    • Temporary (context)

    Continuous learning is not constant weight updates.

    It is careful coordination across time scales.

    Continuous learning systems don’t constantly retrain. They carefully consolidate what works.

    References

    Concept Paper
    LoRA LoRA: Low-Rank Adaptation (Hu et al. 2021)
    RAG Retrieval-Augmented Generation (Lewis et al. 2020)
    RLM Recursive Language Models (Zhou et al. 2024)
    Share Shared LoRA Subspaces (2025)
    Engram Engram: Conditional Memory (DeepSeek 2025)

    Series Summary

    Part Key Insight
    1. Time Scales Learning happens at different layers and speeds
    2. Forgetting vs Rot Different failures need different fixes
    3. Weight-Based Change the brain carefully
    4. Memory-Based Store facts outside the brain
    5. Context & RLM Rebuild focus instead of dragging baggage
    6. Continuous Learning Learn in memory, consolidate in weights
    7. Full Architecture Layered coordination enables safe improvement

    Continuous learning is layered coordination.

    Part 7 of the How AI Learns series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 424 words3 min readAbstract

    Five ML Concepts - #27

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #27
    Video
    Comments Discord

    References

    Concept Reference
    Elastic Weight Consolidation Overcoming catastrophic forgetting (Kirkpatrick et al. 2017)
    Replay Buffers Experience Replay for Continual Learning (Rolnick et al. 2019)
    Parameter Routing Sparsely-Gated Mixture-of-Experts (Shazeer et al. 2017)
    Memory-Augmented Networks Neural Turing Machines (Graves et al. 2014)
    Model Editing Editing Large Language Models (Yao et al. 2023)

    Today’s Five

    1. Elastic Weight Consolidation

    Adding a penalty that discourages changing parameters important to previous tasks. Importance is estimated using Fisher information from prior training.

    This helps models learn new tasks without catastrophic forgetting.

    Like protecting well-worn neural pathways while building new ones.

    2. Replay Buffers

    Storing examples from earlier tasks and mixing them into new training. Past data is replayed alongside current examples during optimization.

    This reinforces previous knowledge while learning new data.

    Like reviewing old flashcards while studying new material.

    3. Parameter Routing

    Activating different subsets of model parameters depending on the task or input. Mixture-of-experts and conditional computation route inputs to specialized weights.

    Enables specialization without fully separate models.

    Like having different experts handle different questions.

    4. Memory-Augmented Networks

    Adding external memory modules that neural networks can read from and write to. The model learns to store and retrieve information during inference.

    Extends beyond purely weight-based memory to explicit storage.

    Like giving a calculator access to a notepad.

    5. Model Editing

    Targeted weight updates to modify specific behaviors without full retraining. Locate and adjust the parameters responsible for particular facts or behaviors.

    Allows fast corrections and knowledge updates post-training.

    Like editing a specific entry in an encyclopedia instead of rewriting the whole book.

    Quick Reference

    Concept One-liner
    Elastic Weight Consolidation Protecting important parameters during new learning
    Replay Buffers Mixing past examples to prevent forgetting
    Parameter Routing Activating task-specific parameter subsets
    Memory-Augmented Networks External memory modules for neural networks
    Model Editing Targeted weight updates without full retraining

    Short, accurate ML explainers. Follow for more.

    Part 27 of the Five ML Concepts series. View all parts | Next: Part 28 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 696 words4 min readAbstract

    How AI Learns Part 6: Toward Continuous Learning

    Continuous learning aims to:

    • Learn new skills
    • Retain old skills
    • Avoid retraining from scratch
    • Avoid catastrophic forgetting
    Resource Link
    Related Sleepy Coder Part 1 | Sleepy Coder Part 2
    Comments Discord

    The Continuous Learning Loop

    Flow diagram showing Agent to Logs to Evaluate to Cluster to Train to Validate to Deploy cycle, with Memory branch
    Periodic consolidation, not constant updates.

    The Core Tradeoff

    Goal Description
    Plasticity Learn new things quickly
    Stability Retain old things reliably

    You cannot maximize both simultaneously. The art is in the balance.

    Approaches to Continuous Learning

    1. Replay-Based Methods

    Keep (or synthesize) some old data. Periodically retrain on old + new.

    How it works:

    • Store representative examples from each task
    • Mix old data into new training batches
    • Periodically consolidate

    Recent work: FOREVER adapts replay timing using “model-centric time” (based on optimizer update magnitude) rather than fixed training steps.

    Pros Cons
    Strong retention Storage costs
    Conceptually simple Privacy concerns
    Well-understood Data governance complexity

    2. Replay-Free Regularization

    Constrain weight updates to avoid interference, without storing old data.

    Efficient Lifelong Learning Algorithm (ELLA) (Jan 2026): Regularizes updates using subspace de-correlation. Reduces interference while allowing transfer.

    Share (Feb 2026): Maintains a single evolving shared low-rank subspace. Integrates new tasks without storing many adapters.

    Pros Cons
    No replay needed Still active research
    Privacy-friendly Evaluation complexity
    Constant memory Subtle failure modes

    3. Modular Adapters

    Keep base model frozen. Train task-specific adapters. Merge or switch as needed.

    Evolution:

    1. Low-Rank Adaptation (LoRA): Individual adapters per task
    2. Shared LoRA spaces: Adapters share subspace
    3. Adapter banks: Library of skills to compose
    Pros Cons
    Modular, versioned Adapter proliferation
    Low forgetting risk Routing complexity
    Easy rollback Composition challenges

    4. Memory-First Learning

    Store experiences in external memory. Only consolidate to weights what’s proven stable.

    Pattern:

    • New information → Memory (fast)
    • Validated patterns → Adapters (slow)
    • Fundamental capabilities → Weights (rare)

    This separates the speed of learning from the permanence of changes.

    The Practical Loop

    A working continuous learning system:

    1. Run agent (with Recursive Language Model (RLM) context management)
    2. Collect traces: prompts, tool calls, outcomes, failures
    3. Score outcomes: tests, static analysis, user signals
    4. Cluster recurring failure patterns
    5. Train lightweight updates (LoRA/adapters)
    6. Validate retention (did old skills degrade?)
    7. Deploy modular update (with rollback capability)
    

    This is not real-time learning. It’s periodic consolidation.

    Human analogy: Sleep. Process experiences, consolidate important patterns, prune noise.

    Time Scales of Update

    Frequency What Changes Method
    Every query Nothing (inference only) -
    Per session Memory Retrieval-Augmented Generation (RAG)/Engram
    Daily Adapters (maybe) Lightweight Parameter-Efficient Fine-Tuning (PEFT)
    Weekly Validated adapters Reviewed updates
    Monthly Core weights Major consolidation

    Most systems should:

    • Update memory frequently
    • Update adapters occasionally
    • Update core weights rarely

    Evaluation Is Critical

    Continuous learning without continuous evaluation is dangerous.

    Required:

    • Retention tests (what got worse?)
    • Forward transfer tests (what got better?)
    • Regression detection
    • Rollback capability

    Without these, you’re flying blind.

    References

    Concept Paper
    ELLA Subspace Learning for Lifelong ML (2024)
    Share Shared LoRA Subspaces (2025)
    FOREVER Model-Centric Replay (2024)
    EWC Overcoming Catastrophic Forgetting (Kirkpatrick et al. 2017)

    Coming Next

    In Part 7, we’ll put it all together: designing a practical continuous learning agent with layered architecture, logging, feedback loops, and safety.


    Learn often in memory. Consolidate carefully in weights.

    Part 6 of the How AI Learns series. View all parts | Next: Part 7 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 429 words3 min readAbstract

    Five ML Concepts - #26

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #26
    Video
    Comments Discord

    References

    Concept Reference
    Data Augmentation A survey on Image Data Augmentation (Shorten & Khoshgoftaar 2019)
    Caching Strategies Systems engineering practice (no canonical paper)
    Constitutional AI Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022)
    Goodhart’s Law Goodhart’s Law and Machine Learning (Sevilla et al. 2022)
    Manifold Hypothesis An Introduction to Variational Autoencoders (Kingma & Welling 2019)

    Today’s Five

    1. Data Augmentation

    Creating additional training examples using label-preserving transformations. Rotate, flip, crop, or color-shift images without changing what they represent.

    Effectively increases dataset size and improves generalization.

    Like practicing piano pieces at different tempos to build flexibility.

    2. Caching Strategies

    Storing previous computation results to reduce repeated work and latency. Cache embeddings, KV states, or frequently requested outputs.

    Essential for production inference at scale.

    Like keeping frequently used books on your desk instead of the library.

    3. Constitutional AI

    Training models to follow explicit written principles alongside other alignment methods. The constitution provides clear rules for behavior.

    Models critique and revise their own outputs against these principles.

    Like giving someone written house rules instead of vague instructions.

    4. Goodhart’s Law

    When a measure becomes a target, it can stop being a good measure. Optimizing for a proxy metric can diverge from the true objective.

    A core challenge in reward modeling and evaluation design.

    Like studying only for the test instead of learning the subject.

    5. Manifold Hypothesis

    The idea that real-world data lies on lower-dimensional structures within high-dimensional space. Images of faces don’t fill all possible pixel combinations.

    This structure is what representation learning exploits.

    Like faces varying along a few key features instead of every pixel independently.

    Quick Reference

    Concept One-liner
    Data Augmentation Expanding training data with transformations
    Caching Strategies Reducing latency by reusing computation
    Constitutional AI Training models to follow explicit principles
    Goodhart’s Law Optimizing metrics distorts objectives
    Manifold Hypothesis Data lies on lower-dimensional structures

    Short, accurate ML explainers. Follow for more.

    Part 26 of the Five ML Concepts series. View all parts | Next: Part 27 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 685 words4 min readAbstract

    music-pipe-rs: Web Demo and Multi-Instrument Arrangements

    Since the initial music-pipe-rs post, the project has grown. There’s now a web demo with playable examples, a new seq stage for explicit note sequences, and multi-instrument arrangements that work in GarageBand.

    Resource Link
    Video YouTube
    Live Demo music-pipe-rs Samples
    Source GitHub
    Previous Unix Pipelines for MIDI
    Comments Discord

    Web Demo

    The live demo showcases pre-built examples with playable audio:

    Tab Style Description
    Bach Toccata (Organ) Classical Multi-voice church organ with octave doubling and pedal bass
    Bach Toccata (8-bit) Chiptune Gyruss-inspired arcade version with square wave
    Bach-esque Algorithmic Procedurally generated baroque-style background music
    Baroque Chamber Ensemble Six-channel piece with strings, harpsichord, and recorder

    Each tab shows the pipeline script alongside playable audio. See exactly what commands produce each result.

    The seq Stage

    The new seq stage allows explicit note sequences instead of algorithmic generation:

    seed | seq "C4/4 D4/4 E4/4 F4/4 G4/2" | to-midi --out scale.mid
    

    Notation: NOTE/DURATION where duration is in beats. Combine with other stages:

    seed | seq "D5/4 C#5/8 R/4 B4/4" | transpose --semitones 5 | humanize | to-midi --out melody.mid
    

    The R represents rests. This enables transcribing existing melodies or composing precise phrases.

    Multi-Instrument Arrangements

    The Baroque chamber piece demonstrates six-channel composition:

    {
        seed 42 | seq "..." --ch 0 --patch 48;  # Strings melody
        seed 42 | seq "..." --ch 1 --patch 6;   # Harpsichord
        seed 42 | seq "..." --ch 2 --patch 74;  # Recorder
        # ... additional voices
    } | humanize | to-midi --out baroque.mid
    

    Each instrument gets its own channel and General MIDI patch. The same seed ensures timing coherence across parts.

    GarageBand Integration

    Import the MIDI files directly into GarageBand:

    1. Generate arrangement: ./examples/trio-demo.sh
    2. Open GarageBand, create new project
    3. Drag the .mid file into the workspace
    4. GarageBand creates tracks for each channel
    5. Assign software instruments to taste

    The demo includes a jazz trio arrangement:

    • Piano: Bluesy melody with chords and swing
    • Bass: Walking bass line with acoustic bass patch
    • Drums: Hi-hat, snare, kick with dynamic variation

    All generated from pipeline scripts.

    Inspiration

    This project was inspired by research into generative music tools and techniques:

    References

    Topic Link
    Analog Synthesizers Code Self Study
    Drum Synthesis JavaScript Drum Synthesis
    Generative Music Code Self Study
    Music Projects Software and Hardware
    FOSS Music Tools Open Source Music Production
    Eurorack Programming Patch.Init() Tutorial
    Opusmodus Algorithmic Composition in Lisp

    The key insight from Opusmodus: algorithmic composition isn’t random music—it’s programmable composition. Motif transformation, rule systems, deterministic generation. music-pipe-rs brings these ideas to Unix pipes.

    What’s Next

    The pipeline architecture makes extension natural:

    • More generators: Markov chains, L-systems, cellular automata
    • More transforms: Inversion, retrograde, quantization
    • Live mode: Real-time MIDI output with clock sync

    Each new capability is just another stage in the pipeline.


    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Part 5 of the Personal Software series. View all parts | Next: Part 6 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 905 words5 min readAbstract

    Lucy 20%: Upgrading My Home AI Cluster

    Lucy is getting an upgrade. I’m adding an X99 motherboard with an RTX 3090 to expand my AI cluster from 10% to 20% brain power.

    Resource Link
    Video Lucy 20% Upgrade
    Video
    Previous Lucy 10%
    Video
    Comments Discord

    New Hardware: Queenbee

    The cluster uses bee-themed naming. The new node is called queenbee:

    Component Specification
    Motherboard X99
    CPU Intel Xeon E5-2660 v4 (28 threads)
    RAM 64 GB DDR4 ECC
    GPU RTX 3090 (24GB VRAM)
    Storage 1TB NVMe SSD + 4TB HDD

    New AI Capabilities

    With queenbee online, Lucy gains several new abilities:

    Capability Model What It Does
    Voice Cloning VoxCPM High-quality text-to-speech with voice cloning
    Text-to-Image FLUX schnell Fast image generation from text prompts
    Text-to-Video Wan 2.2 Generate video clips from text descriptions
    Image-to-Video SVD Animate still images into video

    The Active Cluster

    Currently active for AI workloads:

    Node Role GPU
    hive MuseTalk lip-sync 2x P40 (48GB total)
    queenbee Generative AI workloads RTX 3090 (24GB)

    Together, they handle the full pipeline: generate images, animate them to video, add lip-synced speech, and produce the final output. See the full apiary inventory below.

    Why Local AI?

    Running AI locally means:

    • Privacy - Data never leaves my network
    • No API costs - Unlimited generations after hardware investment
    • Customization - Full control over models and parameters
    • Learning - Deep understanding of how these systems work

    The 24GB of VRAM on the 3090 opens up models that wouldn’t fit on smaller cards. FLUX schnell produces high-quality images in seconds. VoxCPM creates natural-sounding speech that can clone voices from short audio samples.

    Bee-Themed Host Names

    The full apiary (current and planned nodes):

    Host System CPU Cores RAM GPU
    apiary HPE DL360 G10 1x Xeon Gold 5188 12C/24T 188G -
    bees HPE DL360 G9 2x E5-2650 v4 24C/48T 128G -
    brood HPE DL380 G9 2x E5-2680 v4 28C/56T 64G 2x P100-16G
    colony Supermicro 6028U 2x E5-2680 v3 24C/48T TBD 2x K80-24G
    drones HPE DL380 G9 2x E5-2620 v4 16C/32T 256G -
    hive HPE DL380 G9 2x E5-2698 v3 32C/64T 128G 2x P40-24G
    honeycomb HPE DL180 G9 1x E5-2609 v4 8C/8T TBD -
    queenbee X99 1x E5-2660 v4 14C/28T 64G RTX 3090-24G
    swarm HPE DL380 G9 2x E5-2698 v3 32C/64T 374G 2x P100-12G
    workers HPE DL560 G8 4x E5-4617 v1 TBD 640G TBD

    Notes: Some nodes pending upgrade or configuration. Workers may upgrade to 4x E5-4657L v2 (48C/96T). Honeycomb needs unbrick. K80 GPUs are old and difficult to configure (limited CUDA version support)—will be replaced with M40 GPUs.

    Power and Control

    Remote management is essential for a home datacenter. The HPE servers include iLO (Integrated Lights-Out) for out-of-band access to BIOS, diagnostics, monitoring, and power control—even when the OS is down.

    Category Technology Purpose
    Remote Management HPE iLO BIOS access, diagnostics, monitoring, power control
    IP KVM JetKVM, Sipeed KVM Console access for non-HPE servers (planned)
    Power Monitoring Kill-A-Watt, clones Per-outlet power consumption tracking
    Smart Outlets Home Assistant + Zigbee Remote power control, scheduling, automation
    Additional Circuits Bluetti LFP power stations Extra capacity to run more servers, remote control via BT/WiFi/Zigbee

    The combination of iLO and smart outlets means I can remotely power-cycle any server, access its console, and monitor power draw—all from my phone or Home Assistant dashboard. The Bluetti stations primarily provide additional circuits so I can run more servers simultaneously—home electrical limits are a real constraint. More LFP power stations will be needed to power Lucy at 100%.

    Networking

    Each server has 3 or more NICs, segmented by purpose:

    Speed Purpose Switch
    1G iLO/KVM management 1G switch
    2.5G SSH, SCP, Chrome Remote Desktop 2x 2.5G switches
    10G fiber Server-to-server data transfer (large models) 10G switch

    The 10G backbone is essential for moving multi-gigabyte model files between nodes. Loading a 70B parameter model over 1G would take forever—10G fiber makes it practical. The 2.5G network handles interactive work and smaller transfers (using USB NICs where needed), while the 1G management network stays isolated for out-of-band access.

    Additional networking notes:

    • WiFi 7 for wireless connectivity
    • Managed switches with VLANs planned for better network segmentation
    • Linux network bonding experiments to increase aggregate transfer rates
    • Sneaker net - most servers have hot-swap SAS SSDs and hard drives, so physically moving drives between nodes is sometimes the fastest option for very large transfers

    What’s Next

    The 20% milestone is just a step. Future upgrades could include:

    • Additional GPU nodes for parallel processing
    • Larger language models for local inference
    • Real-time video generation pipelines
    • Integration with more specialized models

    The bee hive keeps growing.


    Building AI infrastructure one node at a time.

    Part 3 of the General Technology series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 636 words4 min readAbstract

    How AI Learns Part 5: Context Engineering & Recursive Reasoning

    Large context windows are not a complete solution.

    As context grows:

    • Attention dilutes
    • Errors compound
    • Reasoning quality degrades
    Resource Link
    Related RLM | ICL Revisited
    Comments Discord

    The Context Problem

    Transformers have finite attention. With limited attention heads and capacity, the model cannot attend equally to everything. As tokens accumulate:

    • Earlier instructions lose influence
    • Patterns average toward generic responses
    • Multi-step reasoning fails

    This is context rot—not forgetting weights, but losing signal in noise.

    In-Context Learning (ICL)

    The model adapts temporarily via examples in the prompt.

    Aspect ICL
    Updates weights? No
    Persists across sessions? No
    Speed Instant
    Mechanism Activations, not gradients

    ICL is powerful but ephemeral. It’s working memory, not learning.

    Limitation: As context grows, ICL examples compete with other content for attention.

    Recursive Language Models (RLM)

    Circular flow diagram showing LLM connected to Tools, Memory, Context, and Evaluation in a recursive loop
    Rebuild context each step instead of dragging it forward.

    RLMs decompose reasoning into multiple passes. Instead of dragging entire context forward:

    1. Query relevant memory
    2. Retrieve what’s needed now
    3. Execute tools
    4. Evaluate results
    5. Reconstruct focused context
    6. Repeat

    This treats context as a dynamic environment, not a static blob.

    Why RLM Works

    Traditional approach:

    [System prompt + 50k tokens of history + query]
    

    RLM approach:

    [System prompt + retrieved relevant context + current query]
    

    Each reasoning step starts fresh with focused attention.

    Context Engineering Techniques

    Technique How It Helps
    Summarization Compress old context, preserve essentials
    Chunking Process in segments, aggregate results
    Retrieval Pull relevant content, not everything
    Tool offloading Store state externally, query on demand
    Structured prompts Clear sections, explicit priorities

    Tool Use as Context Management

    Tools aren’t just for actions—they’re for state management.

    Instead of keeping everything in context:

    • Store in files, databases, or structured formats
    • Query when needed
    • Return focused results

    This converts unbounded context into bounded queries.

    The Agent Loop

    Modern agents combine these ideas:

    while not done:
        # 1. Assess current state
        relevant = retrieve_from_memory(query)
    
        # 2. Build focused context
        context = [system_prompt, relevant, current_task]
    
        # 3. Reason
        action = llm(context)
    
        # 4. Execute
        result = execute_tool(action)
    
        # 5. Update memory
        memory.store(result)
    
        # 6. Evaluate
        if goal_achieved(result):
            done = True
    

    Each iteration rebuilds context. No rot accumulation.

    Test-Time Adaptation

    A related technique: temporarily update weights during inference.

    Aspect Test-Time Learning
    Updates weights? Yes, lightly (LoRA)
    Persists? No (rolled back)
    Purpose Adapt to input distribution

    This sits between ICL (no updates) and fine-tuning (permanent updates).

    Key Insight

    Context is not a static buffer. It’s a dynamic workspace.

    Systems that treat context as “append everything” will rot. Systems that actively manage context stay coherent.

    References

    Concept Paper
    RLM Recursive Language Models (Zhou et al. 2024)
    ICL What Can Transformers Learn In-Context? (Garg et al. 2022)
    Test-Time Training TTT for Language Models (2024)
    Chain-of-Thought Chain-of-Thought Prompting (Wei et al. 2022)

    Coming Next

    In Part 6, we’ll connect all of this to continuous learning: replay methods, subspace regularization, adapter evolution, and consolidation loops.


    Rebuild focus instead of dragging baggage.

    Part 5 of the How AI Learns series. View all parts | Next: Part 6 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 411 words3 min readAbstract

    Five ML Concepts - #25

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #25
    Video
    Comments Discord

    References

    Concept Reference
    Label Smoothing Rethinking the Inception Architecture (Szegedy et al. 2015)
    Miscalibration On Calibration of Modern Neural Networks (Guo et al. 2017)
    Representation Learning Representation Learning: A Review (Bengio et al. 2013)
    Adversarial Examples Intriguing properties of neural networks (Szegedy et al. 2013)
    Double Descent Deep Double Descent (Nakkiran et al. 2019)

    Today’s Five

    1. Label Smoothing

    Replacing hard one-hot labels with softened target distributions during training. Instead of 100% confidence in one class, distribute small probability to other classes.

    Reduces overconfidence and can improve generalization.

    Like allowing small uncertainty instead of absolute certainty.

    2. Miscalibration

    When predicted confidence does not match observed accuracy. A model that says “90% confident” should be right 90% of the time.

    Modern neural networks tend to be overconfident. Temperature scaling can help.

    Like a forecast that sounds certain but is often wrong.

    3. Representation Learning

    Learning useful internal features automatically from raw data. Instead of hand-crafting features, the model discovers what matters.

    The foundation of deep learning’s success across domains.

    Like detecting edges before recognizing full objects.

    4. Adversarial Examples

    Inputs modified to cause incorrect predictions. Small, often imperceptible changes can flip model outputs.

    A security concern and a window into model vulnerabilities.

    Like subtle changes that fool a system without obvious differences.

    5. Double Descent

    Test error that decreases, increases, then decreases again as model capacity grows. The classical bias-variance tradeoff captures only the first part.

    Modern overparameterized models operate in the second descent regime.

    Like getting worse before getting better—twice.

    Quick Reference

    Concept One-liner
    Label Smoothing Softening targets to reduce overconfidence
    Miscalibration Confidence not matching accuracy
    Representation Learning Automatically learning useful features
    Adversarial Examples Inputs crafted to cause errors
    Double Descent Test error decreasing twice with model size

    Short, accurate ML explainers. Follow for more.

    Part 25 of the Five ML Concepts series. View all parts | Next: Part 26 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 632 words4 min readAbstract

    How AI Learns Part 4: Memory-Based Learning

    Modern AI systems increasingly rely on external memory.

    This shifts “learning” away from parameters.

    Resource Link
    Related Engram | Engram Revisited | Multi-hop RAG
    Comments Discord

    The Memory Paradigm

    Diagram showing brain (model) connected to notebook (memory) with RAG, CAG, and Engram types
    Store facts outside the brain.

    Why External Memory?

    Most “learning new facts” should not modify weights.

    Weights are for generalization. They encode reasoning patterns, language structure, and capability.

    Memory is for storage. It holds specific facts, documents, and experiences.

    If you store everything in weights:

    • You create interference
    • You risk forgetting
    • You must retrain

    If you store facts in memory:

    • No forgetting
    • Fast updates
    • Survives model upgrades

    Retrieval-Augmented Generation (RAG)

    Documents are embedded into vectors. At query time:

    1. Embed the query
    2. Search the vector database
    3. Retrieve relevant documents
    4. Inject into prompt
    5. Generate grounded response

    The model does not need to remember facts internally. It retrieves them on demand.

    RAG Benefits

    Benefit Description
    No forgetting External storage, not weights
    Persistent Survives restarts and model changes
    Scalable Add documents without retraining
    Verifiable Can cite sources

    RAG Challenges

    • Retrieval precision (wrong docs = bad answers)
    • Latency (search takes time)
    • Index maintenance
    • Chunk boundaries

    Cache-Augmented Generation (CAG)

    Instead of retrieving from vector DB, cache previous context or KV states.

    Use cases:

    • Repeated knowledge tasks
    • Multi-turn conversations
    • Pre-computed context windows

    Benefits over RAG:

    • Often faster (no embedding + search)
    • More deterministic
    • Good for structured repeated workflows

    Trade-offs:

    • Less flexible
    • Cache management complexity

    Engram-Style Memory

    Recent proposals (e.g., DeepSeek research) introduce conditional memory modules with direct indexing.

    Instead of scanning long context or searching vectors:

    • Memory slots indexed directly
    • O(1) lookup instead of O(n) attention
    • Separates static knowledge from dynamic reasoning

    The goal: Constant-time memory access that doesn’t scale with context length.

    This changes the compute story:

    • Don’t waste attention on “known facts”
    • Reserve compute for reasoning
    • Avoid context rot

    Model Editing

    A related technique: surgically patch specific facts without full fine-tuning.

    Example: The model says “The capital of Australia is Sydney.” You edit the specific association to “Canberra” without retraining.

    Pros:

    • Targeted fixes
    • Fast

    Cons:

    • Side effects possible
    • Consistency not guaranteed

    The Key Distinction

    Aspect Weight Learning Memory Learning
    Location Parameters External storage
    Persistence Model lifetime Storage lifetime
    Forgetting risk High None
    Update speed Slow (training) Fast (database)
    Survives model change? No Yes

    When to Use What

    Situation Approach
    Need new reasoning capability Weight-based (fine-tune)
    Need to know new facts Memory-based (RAG)
    Need domain expertise Weight-based (LoRA)
    Need to cite sources Memory-based (RAG)
    Frequently changing data Memory-based (RAG/CAG)

    References

    Concept Paper
    RAG Retrieval-Augmented Generation (Lewis et al. 2020)
    Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
    REALM REALM: Retrieval-Augmented Pre-Training (Guu et al. 2020)
    Model Editing Editing Factual Knowledge (De Cao et al. 2021)

    Coming Next

    In Part 5, we’ll examine context engineering and recursive reasoning: ICL, RLM, and techniques that prevent context rot during inference.


    The brain stays stable. The notebook grows.

    Part 4 of the How AI Learns series. View all parts | Next: Part 5 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 431 words3 min readAbstract

    Five ML Concepts - #24

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #24
    Video
    Comments Discord

    References

    Concept Reference
    Warmup Accurate, Large Minibatch SGD (Goyal et al. 2017)
    Data Leakage Leakage in Data Mining (Kaufman et al. 2012)
    Mode Collapse Generative Adversarial Nets (Goodfellow et al. 2014)
    Blue/Green Deployment MLOps best practice (no canonical paper)
    Reward Hacking Concrete Problems in AI Safety (Amodei et al. 2016)

    Today’s Five

    1. Warmup

    Gradually increasing the learning rate at the start of training as part of a learning rate schedule. This helps stabilize early training when gradients can be noisy.

    Warmup is especially important for large batch training.

    Like stretching before a sprint instead of starting at full speed.

    2. Data Leakage

    When information unavailable at deployment accidentally influences model training. This creates artificially high validation scores that don’t reflect real-world performance.

    Common sources include future data, preprocessing on full dataset, or duplicate samples.

    Like memorizing test answers instead of learning the material.

    3. Mode Collapse

    When a generative model produces limited output diversity. The generator learns to produce only a few outputs that fool the discriminator.

    A major challenge in GAN training that various architectures attempt to address.

    Like a musician who only plays one song no matter the request.

    4. Blue/Green Deployment

    Maintaining two production environments and switching traffic between them. One serves live traffic while the other is updated and tested.

    Enables instant rollback if problems occur.

    Like having a backup stage ready so the show never stops.

    5. Reward Hacking

    When agents exploit reward functions in unintended ways. The agent optimizes the reward signal rather than the intended objective.

    A key challenge in reinforcement learning and AI alignment.

    Like gaming the grading rubric instead of learning the material.

    Quick Reference

    Concept One-liner
    Warmup Gradually increasing learning rate at start
    Data Leakage Training on unavailable deployment info
    Mode Collapse Limited generative output variety
    Blue/Green Deployment Switching between parallel environments
    Reward Hacking Exploiting reward function flaws

    Short, accurate ML explainers. Follow for more.

    Part 24 of the Five ML Concepts series. View all parts | Next: Part 25 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1236 words7 min readAbstract

    TBT (5/?): IBM 1130 System Emulator - Experience 1960s Computing

    The IBM 1130, introduced in 1965, was a 16-bit minicomputer that brought computing to universities and small businesses. This browser-based system emulator recreates the complete experience: console panel with authentic indicator lights, keypunch, printer, and assembly programming.

    Status: Work in progress. Core features functional, enhancements planned.

    The System

    This isn’t just an assembly emulator—it’s a full system visualization:

    Component What It Does
    Console Panel Authentic indicator lights, toggle switches, speed control
    Assembler Game Write and execute IBM 1130 code with real-time visualization
    Keypunch IBM 029 text cards and 1442 object deck visualization
    Printer IBM 1131 console printer with greenbar paper

    Console Panel

    The console panel recreates the physical operator interface with all indicator light groups documented in IBM’s Functional Characteristics manual.

    Register Display (6 rows × 16 positions)

    Row Register Bits Shown Purpose
    1 IAR 15 Instruction Address Register (program counter)
    2 SAR 15 Storage Address Register (memory access)
    3 SBR 16 Storage Buffer Register (data word)
    4 AFR 16 Arithmetic Factor Register (operand)
    5 ACC 16 Accumulator (main arithmetic register)
    6 EXT 16 Extension (double-precision, multiply/divide)

    Right-Side Indicators

    Beyond the register displays, the console shows:

    • Operation Register (5 bits) - Binary op-code of current instruction
    • Format/Tag Indicators - Long instruction format, index register selection
    • Cycle Control (T0-T7) - Internal timing pulses for debugging
    • Status Lights - Wait, Run, Fetch, Execute, Indirect Address

    Control Panel Lights

    Light Purpose
    DISK UNLOCK Safe to swap 2315 disk cartridge
    FILE READY Disk drive up to speed
    FORMS CHECK Printer out of paper
    RUN CPU executing instructions
    PARITY Memory parity error
    FREEZE Fatal hardware error

    Operator Controls

    • 16-bit toggle switches for manual data entry
    • 7-position speed knob - Single Step, SMC, INT RUN, RUN, SI, DISP, LOAD
    • Lamp test to verify all indicators function
    • Emergency stop button

    Assembler Game

    Learn the IBM 1130 instruction set interactively:

    • Complete instruction set - LD, STO, LDX, STX, A, S, AND, OR, SLA, SRA, BSC, BSI, WAIT
    • Memory-mapped index registers - XR1-3 at addresses 1, 2, 3 (historically accurate)
    • Step-by-step execution with change highlighting
    • Interactive examples covering arithmetic, indexing, shifts
    • Progressive challenges with validation

    Keypunch

    The keypunch simulation supports two card types:

    IBM 029 Text Cards

    • Hollerith encoding - Standard character-to-punch mapping
    • Visual card display - Watch holes appear as you type
    • Multi-card decks - Manage multiple cards

    IBM 1130 Object Deck (1442 Output)

    • Binary card visualization - Machine code punch patterns
    • Object deck format - Matches authentic assembler output
    • No character printing - Pure binary data cards

    The IBM 029 Keypunch produced human-readable text cards. For binary object decks (compiled programs), the IBM 1442 Card Read-Punch would create cards with arbitrary punch patterns that don’t map to characters.

    Printer

    The IBM 1131 Console Printer simulation:

    • Greenbar paper rendering - Authentic line printer output
    • Typewriter-style characters - Period-appropriate appearance
    • Console output - System messages and program output

    Technology

    Component Choice
    Language Rust
    Target WebAssembly
    UI Framework Yew
    Build Tool Trunk
    Hosting GitHub Pages

    Planned Enhancements

    This is a work in progress. Planned features include:

    • Additional challenges (10 total)
    • Code save/load functionality
    • URL sharing of programs
    • Breakpoints and memory watches
    • Keyboard shortcuts
    • Full 1442 Card Read-Punch integration

    IBM Documentation References

    Document Description
    GA26-5881 Functional Characteristics - Console panel details
    GA26-5717 Operating Procedures - Operator instructions
    GA26-5914 Physical Planning - System dimensions
    Bitsavers Collection Complete IBM 1130 documentation archive

    Project Goals

    This is an early proof-of-concept for trying out components that could be extended to produce a more realistic system of devices that could actually run programs. The modular architecture allows each peripheral (console, keypunch, printer) to be developed and refined independently.

    A key goal is educational challenges that teach assembly language step by step. The assembler game provides progressive exercises that build understanding from basic load/store operations through arithmetic, indexing, and control flow.

    Historical Significance

    The IBM 1130 was the first computer for many programmers in the late 1960s and 1970s. Its clean architecture and accessible price point (~$32,000) made it ideal for education.

    A Transitional Technology

    The IBM 1130 arrived after mechanical calculators and vacuum tube computers, but before dense integrated circuits and microprocessors. This was a unique moment in computing history when machines were complex enough to be powerful, yet simple enough to be fully understood by one person.

    The system shipped with complete schematics and diagnostic listings. A field engineer could use an oscilloscope to probe the pins on every transistor. The “integrated circuit” of the era was a small can with a 4×4 pin grid containing just two transistors, mounted on a pluggable card connected via a wire-wrapped backplane. When something failed, you could see it, touch it, and replace it.

    Non-Volatile Core Memory

    One remarkable feature: magnetic core memory was non-volatile. You could stop the system, power down overnight, come back in the morning, power up, and start your program exactly where it left off—without reloading from cards, tape, or disk.

    Each bit was stored as the magnetic polarity of a tiny ferrite ring. No electricity required to maintain state. This made the 1130 remarkably resilient and practical for environments where power wasn’t guaranteed.

    Notable fact: The Forth programming language was developed on the IBM 1130 by Charles Moore in the late 1960s.

    Personal Experience

    In the late 1970s, I worked as an IBM Customer Engineer maintaining a large number of IBM 1130 and 1800 systems used primarily by IBM manufacturing facilities in Kingston, Poughkeepsie, and East Fishkill, New York.

    Field service on these machines was hands-on in ways that seem almost unimaginable today. I would often hand-assemble code on paper, converting mnemonics to binary, then enter machine code via the console toggle switches to create a small program. That program’s job? To punch another program onto a card.

    I could then insert that punched card into a diagnostic deck to loop on an error condition while I used an oscilloscope and logic schematics to diagnose a failing circuit card. The blinking lights weren’t decoration—they were essential debugging tools that showed exactly what the CPU was doing at each moment.

    This emulator recreates that experience: the same indicator lights, the same toggle switches, the same intimate connection between human and machine that made these systems so memorable to work with.


    Experience 1960s computing in your browser. Work in progress.

    Part 5 of the Throwback Thursday series. View all parts | Next: Part 6 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 654 words4 min readAbstract

    How AI Learns Part 3: Weight-Based Learning

    Weight-based learning modifies the neural network itself.

    It is slow. It is powerful. It is dangerous.

    The Weight-Based Methods

    Diagram showing LoRA adapters, distillation flow, and alignment pipeline
    Weight-based learning modifies the brain itself.

    Pretraining

    This creates the base model.

    It encodes language structure, reasoning patterns, and general world knowledge. The process:

    • Trains on terabytes of text
    • Uses self-supervised learning (predict next token)
    • Runs for weeks or months
    • Costs millions of dollars

    This learning is rarely repeated for cost reasons. The result is a foundation that everything else builds upon.

    Fine-Tuning

    Fine-tuning adapts models for specific tasks.

    Standard Fine-Tuning

    Adjust some or all weights using task-specific data.

    Pros:

    • Can significantly change behavior
    • Works with small datasets

    Cons:

    • Risk of catastrophic forgetting
    • Expensive if you modify all weights
    • Hard to undo

    Supervised Fine-Tuning (SFT)

    Train on instruction → response pairs.

    This teaches the model to:

    • Follow directions
    • Produce helpful outputs
    • Maintain conversation structure

    Risk: Can reduce other capabilities if data is narrow.

    Preference Optimization

    Instead of “correct answers,” train from comparisons: preferred vs rejected responses.

    Method Description
    Reinforcement Learning from Human Feedback (RLHF) Reward model + reinforcement learning
    Direct Preference Optimization (DPO) Simpler alternative to RLHF
    RLAIF AI-generated preferences

    Pros: Strong style/safety/helpfulness steering

    Cons: Can drift (“over-align”), may conflict with domain competence

    Parameter-Efficient Fine-Tuning (PEFT)

    Instead of changing all weights, inject small trainable modules.

    LoRA (Low-Rank Adaptation)

    Insert small low-rank matrices into transformer layers. Only train these matrices.

    Benefits:

    • Faster training: Fewer parameters to update
    • Modular: Can swap adapters
    • Version control: Different adapters for different tasks
    • Lower forgetting risk: Base weights frozen

    Other PEFT Methods

    • Prompt tuning: Learn soft prompts
    • Prefix tuning: Prepend learned vectors
    • Adapters: Small bottleneck layers
    • IA³: Learned vectors that scale activations

    Shared LoRA Subspaces

    Multiple tasks share adapter subspaces to reduce interference.

    Recent work (ELLA, Share) maintains evolving shared low-rank subspaces that:

    • Reduce interference between tasks
    • Enable continual learning
    • Keep memory constant

    Distillation

    Train a smaller model using a larger model as teacher.

    Aspect Teacher Student
    Size Large Small
    Cost High inference Low inference
    Knowledge Full Compressed

    Distillation benefits:

    • Speeds up inference
    • Often improves consistency
    • Can reduce hallucination
    • Makes deployment cheaper

    This is not runtime learning—it’s offline structural learning.

    The Alignment Pipeline

    Modern models typically go through:

    1. Pretraining → General competence
    2. SFT → Follow instructions
    3. RLHF/DPO → Align with preferences
    4. Safety fine-tuning → Reduce harmful outputs

    Each step modifies weights. Each step risks forgetting previous capabilities.

    Key Insight

    Fine-tuning changes the brain. RAG changes the notes on the desk.

    Weight-based learning is the core capability layer. It’s slow to change, expensive to update, and risky to modify—but it forms the stable foundation that everything else builds upon.

    References

    Concept Paper
    LoRA LoRA: Low-Rank Adaptation (Hu et al. 2021)
    RLHF Training LMs with Human Feedback (Ouyang et al. 2022)
    DPO Direct Preference Optimization (Rafailov et al. 2023)
    Distillation Distilling Knowledge in Neural Networks (Hinton et al. 2015)
    Adapters Parameter-Efficient Transfer Learning (Houlsby et al. 2019)

    Coming Next

    In Part 4, we’ll explore memory-based learning: RAG, CAG, Engram, and other techniques that learn without touching weights.


    Change the brain carefully.

    Part 3 of the How AI Learns series. View all parts | Next: Part 4 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 445 words3 min readAbstract

    Five ML Concepts - #23

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #23
    Video
    Comments Discord

    References

    Concept Reference
    Emergent Behavior Emergent Abilities of Large Language Models (Wei et al. 2022)
    Tool Use Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al. 2023)
    Loss Surface Sharpness On Large-Batch Training for Deep Learning (Keskar et al. 2016)
    Learning Rate Schedules SGDR: Stochastic Gradient Descent with Warm Restarts (Loshchilov & Hutter 2016)
    Canary Deployment MLOps best practice (no canonical paper)

    Today’s Five

    1. Emergent Behavior

    Some capabilities appear only when models reach sufficient scale. These behaviors were not directly programmed but arise from learned representations.

    Emergence is a key phenomenon in large language models.

    Like a child learning words and then suddenly understanding full sentences.

    2. Tool Use

    Modern AI systems can generate structured commands to call external tools. These include search engines, calculators, or code interpreters.

    This extends model capabilities beyond internal knowledge.

    Like asking a librarian to look something up instead of guessing.

    3. Loss Surface Sharpness

    Sharp minima are sensitive to small weight changes. Flatter minima tend to be more robust and often generalize better.

    Training methods that find flatter regions can improve test performance.

    Like standing on a plateau instead of balancing on a narrow peak.

    4. Learning Rate Schedules

    Instead of keeping the learning rate constant, training often starts high and gradually reduces it. Schedules like step decay or cosine annealing improve convergence.

    Warm restarts can help escape local minima.

    Like running fast at first, then slowing down to finish precisely.

    5. Canary Deployment

    A new model version is rolled out to a small percentage of users first. If problems appear, rollout stops before affecting everyone.

    Essential MLOps practice for safe production updates.

    Like tasting food before serving it to all your guests.

    Quick Reference

    Concept One-liner
    Emergent Behavior Capabilities appearing at sufficient scale
    Tool Use AI calling external tools
    Loss Surface Sharpness Flatter minima generalize better
    Learning Rate Schedules Adjusting learning rate during training
    Canary Deployment Gradually rolling out new models safely

    Short, accurate ML explainers. Follow for more.

    Part 23 of the Five ML Concepts series. View all parts | Next: Part 24 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 646 words4 min readAbstract

    How AI Learns Part 2: Catastrophic Forgetting vs Context Rot

    There are two fundamentally different failure modes in modern AI systems.

    They are often confused. They should not be.

    Resource Link
    Related Sleepy Coder: Routing Prevents Forgetting | RLM
    Comments Discord

    The Two Failures

    Split diagram showing catastrophic forgetting (weight interference) vs context rot (attention dilution)
    Two different failure modes require two different solutions.

    Catastrophic Forgetting (Weight-Space Failure)

    When you fine-tune a model on new tasks, performance on older tasks may degrade.

    This happens because gradient descent updates overlap in parameter space. The model does not “know” which weights correspond to which task. It optimizes globally.

    Example: Fine-tune a model on medical text. Its ability to write code degrades. The new learning overwrote old capabilities.

    Why It Happens

    Neural networks store knowledge distributed across many weights. When you update those weights for Task D, you modify the same parameters that encoded Task A. The old knowledge gets overwritten.

    This is the stability vs plasticity tradeoff:

    • Plasticity: Learn new things quickly
    • Stability: Retain old things reliably

    You cannot maximize both simultaneously.

    Solutions

    Method How It Helps
    Replay Train on old + new data
    Subspace regularization Constrain weight updates to avoid interference
    Shared Low-Rank Adaptation (LoRA) spaces Modular updates that don’t overwrite base weights
    Freezing base weights Keep foundation stable, train adapters only

    Context Rot (Inference-Time Failure)

    Context rot is not weight damage.

    It happens when:

    • Prompts grow too large
    • Earlier instructions get diluted
    • Attention spreads thin
    • The model begins averaging patterns instead of reasoning

    Example: A 50,000 token conversation. The original system prompt is still there, but the model stops following it. Earlier context gets “forgotten” even though it’s technically present.

    Why It Happens

    Transformer attention is finite. With limited attention heads and capacity, the model cannot attend equally to everything. As context grows, earlier tokens receive less attention weight.

    This creates:

    • Instruction drift: Original instructions lose influence
    • Pattern averaging: The model reverts to generic responses
    • Lost coherence: Multi-step reasoning fails

    Solutions

    Method How It Helps
    Retrieval-based context Pull relevant passages, not everything
    Recursive Language Models (RLM) Rebuild context each step
    Summarization Compress old context
    Memory indexing Constant-time lookup instead of linear attention
    Structured tool calls Offload state to external systems

    The Critical Distinction

    Aspect Catastrophic Forgetting Context Rot
    Where Weights Prompt window
    When During training During inference
    Persists? Permanently Session only
    Analogy Brain damage Working memory overload

    Why This Matters

    If you confuse these failure modes, you apply the wrong fix.

    • Forgetting problem? Don’t add more context. Fix your training.
    • Context rot problem? Don’t retrain. Fix your context management.

    Many “AI agents that forget” discussions conflate both. Modern systems need solutions for both simultaneously.

    References

    Concept Paper
    Catastrophic Forgetting Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017)
    Continual Learning Survey A Comprehensive Survey of Continual Learning (Wang et al. 2023)
    ELLA ELLA: Subspace Learning for Lifelong Machine Learning (2024)
    Share Share: Shared LoRA Subspaces for Continual Learning (2025)
    RLM Recursive Language Models (Zhou et al. 2024)

    Coming Next

    In Part 3, we’ll examine weight-based learning in detail: pretraining, fine-tuning, LoRA, alignment methods, and distillation.


    Different failures need different fixes.

    Part 2 of the How AI Learns series. View all parts | Next: Part 3 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 477 words3 min readAbstract

    Five ML Concepts - #22

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #22
    Video
    Comments Discord

    References

    Concept Reference
    RSFT Scaling Relationship on Learning Mathematical Reasoning (Yuan et al. 2023)
    Model Steerability Controllable Generation from Pre-trained Language Models (Zhang et al. 2023)
    LSTM Long Short-Term Memory (Hochreiter & Schmidhuber 1997)
    More Data Beats Better Models The Unreasonable Effectiveness of Data (Halevy et al. 2009)
    System Reliability vs Quality MLOps best practice (no canonical paper)

    Today’s Five

    1. RSFT (Rejection Sampling Fine-Tuning)

    A method where many model outputs are generated, weaker ones are filtered out, and the best samples are used for further fine-tuning. It improves output quality without full reinforcement learning.

    The model learns from its own best attempts.

    Like practicing many attempts and studying only your best ones.

    2. Model Steerability

    The ability to adjust a model’s behavior through prompts, parameters, or control mechanisms. This allows flexible behavior without retraining.

    Steerable models can adapt to different tasks or styles at inference time.

    Like steering a car instead of letting it move in a fixed direction.

    3. LSTM (Long Short-Term Memory)

    A recurrent neural network architecture with gates that regulate memory flow. It was designed to mitigate vanishing gradient problems in sequence modeling.

    LSTMs decide what to remember and what to forget at each time step.

    Like a notebook where you choose what to keep and what to forget.

    4. Why More Data Beats Better Models

    In many cases, adding high-quality data improves performance more than small architecture improvements. Data scale often matters as much as model design.

    This is sometimes called “the unreasonable effectiveness of data.”

    Like practicing with many real conversations instead of perfecting one grammar rule.

    5. System Reliability vs Model Quality

    A slightly less accurate model that runs reliably can outperform a fragile but slightly better one. Engineers balance uptime, latency, and stability against pure accuracy.

    Production systems need both correctness and dependability.

    Like choosing a reliable car over a faster one that breaks down often.

    Quick Reference

    Concept One-liner
    RSFT Fine-tuning on filtered best outputs
    Model Steerability Adjusting behavior at inference time
    LSTM Gated memory for sequence modeling
    More Data Beats Better Models Data scale trumps architecture tweaks
    System Reliability vs Quality Balancing accuracy with uptime

    Short, accurate ML explainers. Follow for more.

    Part 22 of the Five ML Concepts series. View all parts | Next: Part 23 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1398 words7 min readAbstract

    Many-Eyes Learning: Intrinsic Rewards and Diversity

    In Part 1, we demonstrated that multiple scouts dramatically improve learning in sparse-reward environments. Five scouts achieved 60% success where a single scout achieved 0%.

    This post explores how scouts explore: intrinsic rewards that drive novelty-seeking behavior, and what happens when you mix different exploration strategies.

    Recap: The Many-Eyes Architecture

    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
    │   Scout 1   │  │   Scout 2   │  │   Scout N   │
    │ (strategy A)│  │ (strategy B)│  │ (strategy N)│
    └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
           │                │                │
           v                v                v
    ┌─────────────────────────────────────────────────┐
    │              Experience Buffer                   │
    └─────────────────────────────────────────────────┘
                           │
                           v
    ┌─────────────────────────────────────────────────┐
    │               Shared Learner                     │
    └─────────────────────────────────────────────────┘
    

    Scouts are information gatherers, not independent learners. They explore with different strategies, pool their discoveries, and a shared learner benefits from the combined experience.

    New Scout Strategies

    CuriousScout: Count-Based Novelty

    IRPO formalizes intrinsic rewards as the mechanism that drives scout exploration. CuriousScout implements count-based curiosity:

    class CuriousScout(Scout):
        def __init__(self, bonus_scale: float = 1.0):
            self.state_counts = defaultdict(int)
            self.bonus_scale = bonus_scale
    
        def intrinsic_reward(self, state):
            count = self.state_counts[state]
            return self.bonus_scale / sqrt(count + 1)
    

    How it works:

    • Track how many times each state has been visited
    • Reward = bonus_scale / √(count + 1)
    • Novel states get high rewards; familiar states get diminishing returns

    The intuition: A curious scout is drawn to unexplored territory. The first visit to a state is exciting (reward = 1.0). The fourth visit is mundane (reward = 0.5). This creates natural pressure to explore widely.

    OptimisticScout: Optimism Under Uncertainty

    A different philosophy: assume unknown states are valuable until proven otherwise.

    class OptimisticScout(Scout):
        def __init__(self, optimism: float = 10.0):
            self.optimism = optimism
    
        def initial_q_value(self):
            return self.optimism  # Instead of 0
    

    How it works:

    • Initialize all Q-values to a high value (e.g., 10.0)
    • The agent is “optimistic” about unvisited state-action pairs
    • As it explores and receives actual rewards, Q-values decay toward reality

    The intuition: If you’ve never tried something, assume it might be great. This drives exploration without explicit novelty bonuses.

    Strategy Comparison

    Strategy Mechanism Best For
    Random Uniform random actions Baseline, maximum coverage
    Epsilon-Greedy Random with probability ε, greedy otherwise Balancing exploit/explore
    CuriousScout Novelty bonus for unvisited states Systematic coverage
    OptimisticScout High initial Q-values Early exploration pressure

    The Diversity Experiment

    Does mixing strategies help, or is it enough to have multiple scouts with the same good strategy?

    Setup

    • 7x7 sparse grid, 100 training episodes
    • All configurations use exactly 5 scouts (fair comparison)
    • 5 random seeds for statistical significance

    Configurations

    1. Homogeneous Random: 5 identical random scouts
    2. Homogeneous Epsilon: 5 identical epsilon-greedy scouts (ε=0.2)
    3. Diverse Mix: Random + 2 epsilon-greedy (ε=0.1, 0.3) + CuriousScout + OptimisticScout

    Results

    Configuration Success Rate
    Random baseline 7%
    Homogeneous random 20%
    Homogeneous epsilon 40%
    Diverse mix 40%

    Analysis

    Finding: Strategy quality matters more than diversity in simple environments.

    • Epsilon-greedy (homogeneous or mixed) outperforms pure random
    • Diverse mix performs the same as homogeneous epsilon-greedy
    • Having 5 good scouts beats having 5 diverse but weaker scouts

    Why doesn’t diversity help here?

    In a simple 7x7 grid, the exploration problem is primarily about coverage, not strategy complementarity. Five epsilon-greedy scouts with different random seeds already explore different regions due to stochastic action selection.

    Diversity likely provides more benefit in:

    • Complex environments with multiple local optima
    • Tasks requiring different behavioral modes
    • Environments with deceptive reward structures

    Web Visualization

    The web visualization demonstrates Many-Eyes Learning with real-time parallel scout movement. (The upcoming video walks through this demo—the post focuses on the underlying mechanism.)

    Many-Eyes Web Visualization

    How It Works

    The web version uses Q-learning with a shared Q-table (simpler than DQN for clarity). All scouts contribute to the same Q-table—the core “many eyes” concept: more explorers = faster Q-value convergence.

    Scout Role Epsilon Behavior
    Random Baseline 1.0 (constant) Always random, never follows policy
    Scouts 1-N Learning Agents 0.5-0.8 → 0.01 Epsilon-greedy with decay

    Exploration Modes

    The UI provides a dropdown to select different exploration strategies:

    Mode Heatmap Diversity Learning Performance
    Shared Policy Low (identical paths) Best (lowest avg steps)
    Diverse Paths High (distinct paths) Worse (biases override optimal)
    High Exploration High Worst (never fully exploits)
    Boltzmann Moderate Moderate

    The Diversity vs Performance Trade-off

    There’s a fundamental trade-off between visual diversity and learning performance:

    • Shared Policy wins on performance: The “many eyes” benefit comes from diverse exploration during learning (finding the goal faster). But once Q-values converge, all scouts should follow the same optimal policy.

    • Diverse Paths sacrifices performance for visuals: Scout-specific directional biases (Scout 1 prefers right, Scout 2 prefers down) create visually interesting heatmaps but suboptimal behavior.

    • High Exploration never converges: Fixed 50% random actions means scouts never fully exploit the learned policy.

    Key insight: For best learning, use Shared Policy. Use other modes to visualize how different exploration strategies affect the learning process, but expect higher average steps.

    Learning Phases

    Phase Episodes Avg Steps Behavior
    Random 1-5 ~70 All scouts exploring randomly
    Early Learning 5-15 40-60 Policy starts forming
    Convergence 15-30 15-25 Clear optimal path emerges
    Stable 30+ 12-18 Near-optimal with random scout noise

    Why “Average Steps to Goal”?

    Success rate is coarse-grained—with 5 scouts, only 6 values are possible (0%, 20%, 40%, 60%, 80%, 100%). After ~10 episodes, all scouts typically reach the goal. Average steps shows continued policy refinement, dropping from ~70 (random) to ~8 (optimal).

    Running the Visualization

    ./scripts/serve.sh   # Open http://localhost:3200
    
    • Yew/WASM frontend with FastAPI backend
    • Speed control from 1x to 100x
    • Replay mode to step through recorded training

    What’s Next

    Potential future directions:

    Direction Why It Matters
    Larger environments Test scaling to 15x15, 25x25 grids
    Scout communication Real-time sharing vs passive pooling
    Adaptive intrinsic rewards Learn the reward function (closer to full IRPO)
    Multi-goal environments Multiple sparse rewards to discover

    Key Takeaways

    1. Intrinsic rewards drive exploration. CuriousScout and OptimisticScout implement different philosophies: novelty bonuses vs optimistic initialization.

    2. Strategy quality > diversity in simple environments. Five good scouts beat five diverse but weaker scouts.

    3. Diversity during learning, convergence after. The “many eyes” benefit comes from diverse exploration during learning. Once Q-values converge, all scouts should follow the same optimal policy.

    4. Shared Q-table enables collective learning. All scouts contribute to one Q-table—more explorers means faster convergence.

    5. Visual diversity costs performance. Modes like “Diverse Paths” create interesting heatmaps but suboptimal behavior. Use “Shared Policy” for best learning results.

    References

    Concept Paper
    IRPO Intrinsic Reward Policy Optimization (Cho & Tran 2026)
    Reagent Reasoning Reward Models for Agents (Fan et al. 2026)
    ICM Curiosity-driven Exploration (Pathak et al. 2017)

    Diverse exploration, convergent execution. Many eyes see more, but the best path is the one they all agree on.

    Part 6 of the Machine Learning series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 597 words3 min readAbstract

    How AI Learns Part 1: The Many Meanings of Learning

    When people say, “AI learned something,” they usually mean one of at least four very different things.

    Large Language Models (LLMs) do not learn in one single way. They learn at different time scales, in different locations, and with very different consequences. To understand modern AI systems—especially agents—we need to separate these layers.

    Resource Link
    Related ICL Revisited | RLM | Engram
    Comments Discord

    Four Time Scales of Learning

    Concentric rings showing four time scales of learning: core weights, adapters, external memory, and prompt/context
    Learning happens at different layers with different persistence and speed.

    1. Pretraining (Years)

    This is the foundation.

    The model trains on massive datasets using gradient descent. The result is a set of weights—billions of parameters—encoding statistical structure of language and knowledge.

    This learning:

    • Is slow and expensive
    • Persists across restarts
    • Cannot easily be reversed
    • Is vulnerable to interference if modified later

    Think of this as long-term biological memory.

    2. Fine-Tuning (Days to Weeks)

    Fine-tuning modifies the weights further, but with narrower data.

    This includes:

    • Instruction tuning (following directions)
    • Alignment methods (Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO))
    • Domain adaptation
    • Parameter-efficient methods like Low-Rank Adaptation (LoRA)

    This is still weight-based learning.

    It persists across restarts. It risks catastrophic forgetting. It modifies the brain itself.

    3. Memory-Based Learning (Seconds to Minutes)

    This is where many modern systems shift.

    Instead of changing weights, they store information externally:

    • RAG (Retrieval-Augmented Generation)
    • CAG (Cache-Augmented Generation)
    • Vector databases
    • Engram-style memory modules

    The model retrieves relevant memory per query.

    The brain stays stable. The notebook grows.

    This learning:

    • Persists across restarts
    • Survives model upgrades
    • Does not cause forgetting
    • Is fast

    4. In-Context Learning (Milliseconds)

    This is temporary reasoning scaffolding.

    Information exists only in the prompt window.

    It:

    • Does not update weights
    • Does not persist across sessions
    • Is powerful but fragile
    • Suffers from context rot

    This is working memory.

    Why This Matters

    Most discussions collapse all of this into “the model learned.”

    But:

    • Updating weights risks forgetting
    • Updating memory does not
    • Updating prompts does not persist
    • Updating adapters can be modular and reversible

    Continuous learning systems must coordinate all four.

    Persistence Comparison

    Mechanism Persists Across Chat? Persists Across Restart? Persists Across Model Change?
    Pretraining Yes Yes No
    Fine-tune Yes Yes No
    LoRA Yes Yes Usually
    Distillation Yes Yes No
    ICL No No No
    RAG Yes Yes Yes
    Engram Yes Yes Yes
    CAG Yes Yes Yes

    That last column is subtle but powerful for agents.

    References

    Concept Paper
    LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
    RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
    ICL What Can Transformers Learn In-Context? (Garg et al. 2022)
    Engram Engram: Conditional Memory via Scalable Lookup (DeepSeek 2025)
    DPO Direct Preference Optimization (Rafailov et al. 2023)

    Coming Next

    In Part 2, we’ll examine the two fundamental failure modes that arise from confusing these layers: catastrophic forgetting and context rot.


    Learning happens in layers of permanence.

    Part 1 of the How AI Learns series. View all parts | Next: Part 2 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1155 words6 min readAbstract

    music-pipe-rs: Unix Pipelines for MIDI Composition

    After building midi-cli-rs for quick mood-based generation, I wanted something more surgical. What if music generation worked like Unix commands—small tools connected by pipes?

    Resource Link
    Code music-pipe-rs
    Related midi-cli-rs
    Next Web Demo and Multi-Instrument
    Comments Discord

    The Unix Philosophy for Music

    Most generative music tools are monolithic. You get one application with a closed workflow. If you want to inspect intermediate results, you can’t. If you want to swap one transformation for another, you rebuild everything.

    Unix solved this decades ago: small tools that do one thing well, connected by pipes. Each tool reads from stdin, writes to stdout. You can inspect any point in the pipeline with head, filter with grep, transform with jq.

    music-pipe-rs applies this philosophy to MIDI composition.

    A Pipeline in Action

    seed 12345 | motif --notes 16 --bpm 120 | humanize | to-midi --out melody.mid
    

    Four stages:

    1. seed establishes the random seed for the entire pipeline
    2. motif generates a melodic pattern (using the pipeline seed)
    3. humanize adds timing and velocity variation (using the same seed)
    4. to-midi converts the event stream to a standard .mid file

    The output plays in any DAW.

    Seed-First Architecture

    The seed stage goes at the head of the pipeline:

    # Explicit seed for reproducibility
    seed 12345 | motif --notes 16 | humanize | to-midi --out melody.mid
    
    # Auto-generated seed (printed to stderr)
    seed | motif --notes 16 | humanize | to-midi --out melody.mid
    # stderr: seed: 1708732845
    

    All downstream stages read the seed from the event stream. No --seed arguments scattered across the pipeline. One seed, set once, used everywhere.

    This means:

    • Same seed = identical output across all random stages
    • Different seed = different composition with same structure
    • Reproducibility is trivial: just save the seed number

    JSONL: The Intermediate Format

    Between stages, events flow as JSONL (JSON Lines). Each line is a complete event:

    {"type":"Seed","seed":12345}
    {"type":"NoteOn","t":0,"ch":0,"key":60,"vel":80}
    {"type":"NoteOff","t":480,"ch":0,"key":60}
    

    This format is human-readable and tool-friendly:

    # See the first 10 events
    seed 42 | motif --notes 8 | head -10
    
    # Count how many NoteOn events
    seed 42 | motif --notes 16 | grep NoteOn | wc -l
    
    # Pretty-print with jq
    seed 42 | motif --notes 4 | jq .
    

    No binary formats to decode. No proprietary protocols. Just text.

    Visualization with viz

    The viz stage prints a sparkline to stderr while passing events through:

    seed 12345 | motif --notes 16 | viz | humanize | to-midi --out melody.mid
    

    Output on stderr:

    ▃▅▇▅▃▁▂▄▆▇▆▄▂▁▃▅
    

    For more detail, use piano roll mode:

    seed 12345 | motif --notes 16 | viz --roll
    
     G6 │···█············│
    F#6 │·····█··········│
     F6 │····█···········│
     G5 │·██·········█···│
     F5 │···········█····│
     E5 │·········██···█·│
     C5 │█·····███····█·█│
    

    The visualization goes to stderr; the JSONL events continue to stdout. You can inspect the music without breaking the pipeline.

    Available Stages

    Stage Type Description
    seed Start Establish random seed for pipeline
    motif Generate Create melodic patterns
    euclid Generate Euclidean rhythm generation
    transpose Transform Shift notes by semitones
    scale Transform Constrain notes to a scale
    humanize Transform Add timing/velocity variation
    viz Inspect Print sparkline visualization
    to-midi Output Convert to .mid file

    Each stage is a separate binary. Mix and match as needed.

    Euclidean Rhythms

    The euclid stage generates Euclidean rhythms—mathematically optimal distributions of hits across steps:

    # 3 hits distributed across 8 steps (Cuban tresillo)
    seed | euclid --pulses 3 --steps 8 --note 36 | to-midi --out kick.mid
    
    # 4-on-the-floor kick pattern
    seed | euclid --pulses 4 --steps 16 --note 36 | to-midi --out four-floor.mid
    

    These patterns appear in music worldwide because they “feel right”—the spacing is as even as possible.

    Scale Locking

    The scale stage constrains notes to a musical scale:

    seed 42 | motif --notes 16 | scale --root C --mode minor | to-midi --out c-minor.mid
    

    No wrong notes. Every pitch fits the harmonic context.

    Layering Streams

    Generate drum and melody separately, then combine:

    {
        seed 100 | euclid --pulses 4 --steps 16 --note 36 --ch 9;
        seed 100 | motif --notes 16 | scale --root C --mode pentatonic;
    } | to-midi --out layered.mid
    

    Channel 9 is General MIDI drums. Same seed ensures coherence between parts. Both streams merge into a single MIDI file.

    Why Not Just Use midi-cli-rs?

    Different tools for different needs:

    Tool Strength Use Case
    midi-cli-rs Quick mood presets “Give me 5 seconds of jazz”
    music-pipe-rs Compositional control “Generate a motif, constrain to scale, add swing”

    midi-cli-rs is high-level: pick a mood, get music. music-pipe-rs is low-level: build compositions from primitive operations.

    Both are useful. Both work with AI coding agents.

    The Personal Software Pattern

    This continues the theme: build small tools that compose well. Don’t try to solve everything in one application. Let Unix handle orchestration.

    The best part? Standard tools still work. head, grep, jq, wc—all participate in the pipeline. No special music knowledge required to inspect the data.


    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Part 4 of the Personal Software series. View all parts | Next: Part 5 →

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 452 words3 min readAbstract

    Five ML Concepts - #21

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #21
    Video
    Comments Discord

    References

    Concept Reference
    Prompt Injection Prompt Injection attack against LLM-integrated Applications (Liu et al. 2023)
    Jailbreaks Jailbroken: How Does LLM Safety Training Fail? (Wei et al. 2023)
    GRU Empirical Evaluation of Gated Recurrent Neural Networks (Chung et al. 2014)
    Planning vs Prediction Between accurate prediction and poor decision making (Zaffalon et al. 2023)
    Production Rollbacks MLOps best practice (no canonical paper)

    Today’s Five

    1. Prompt Injection

    Malicious instructions embedded in user input that override intended system behavior. An attacker crafts text that tricks an AI into ignoring its original instructions.

    This is a major security concern for LLM-integrated applications.

    Like slipping a forged instruction into a trusted document.

    2. Jailbreaks

    Techniques that attempt to bypass safety constraints in AI systems. These attacks exploit gaps between a model’s capabilities and its safety training.

    Safety training can fail due to competing objectives or mismatched generalization.

    Like convincing a guard to bend the rules.

    3. GRU (Gated Recurrent Unit)

    A recurrent neural network unit with gates that control memory flow. GRUs decide what information to keep and what to discard at each time step.

    Simpler than LSTM but designed for similar sequence modeling tasks.

    Like a notepad where you decide what to keep and what to erase.

    4. Planning vs Prediction

    Prediction forecasts likely outcomes. Planning evaluates actions across possible futures. Accurate predictions don’t guarantee good decisions—you also need to model how actions affect outcomes.

    This is a key gap in many AI/ML systems.

    Like knowing it will rain versus deciding whether to bring an umbrella.

    5. Production Rollbacks

    Reverting to a previous stable model version after deployment issues. When a new model causes problems in production, rolling back quickly minimizes impact.

    Essential MLOps practice for maintaining system reliability.

    Like reloading a saved game state when something breaks.

    Quick Reference

    Concept One-liner
    Prompt Injection Malicious instructions overriding AI behavior
    Jailbreaks Bypassing safety constraints
    GRU Gated memory for sequence modeling
    Planning vs Prediction Action evaluation vs forecasting
    Production Rollbacks Reverting to stable model versions

    Short, accurate ML explainers. Follow for more.

    Part 21 of the Five ML Concepts series. View all parts | Next: Part 22 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1337 words7 min readAbstract

    midi-cli-rs: Extending with Custom Mood Packs

    Personal Software doesn’t stop at “it works.” It evolves. After building midi-cli-rs for AI agents to generate music, I wanted more moods—without recompiling Rust every time.

    The solution: a plugin system that lets anyone create custom mood packs using simple TOML files.

    Resource Link
    Examples Listen to Samples
    Wiki Plugin Documentation
    Video Plugin Moods Explainer
    Video
    Code midi-cli-rs
    Comments Discord

    The Problem with Built-in Only

    The original midi-cli-rs shipped with a handful of mood presets: suspense, eerie, upbeat, calm, ambient, jazz. Useful, but limited. What if you want synthwave? Chillout? Something faster or in a different key?

    Hardcoding every possible mood isn’t practical. And asking users to modify Rust source code isn’t friendly.

    Three Levels of Extensibility

      Level What You Get What You Change Skill Required
    Built-in Moods 9 curated generators Nothing—use as-is None
    Plugin Moods Parameter variations TOML config files Text editing
    Custom Generators New musical patterns Rust source code Programming (future)

    This post covers Plugin Moods—the middle tier. You can preset combinations of tempo, key, and intensity, but you’re still using the built-in generators’ musical logic. Want a “smooth-jazz” preset (slower, mellower)? Plugin mood. Want bebop or Latin jazz with different chord progressions? That requires a custom generator.

    Custom generators (writing new Rust code) will be covered in a future post when the plugin editor ships.

    The Plugin Architecture

    Custom moods live in ~/.midi-cli-rs/moods/ as TOML files. Each file is a “mood pack” that can define multiple moods. The CLI discovers them automatically.

    Here’s how it works:

    ~/.midi-cli-rs/
    └── moods/
        ├── electronic.toml    # Your synthwave, techno, etc.
        ├── cinematic.toml     # Epic, tension, wonder
        └── seasonal.toml      # Holiday themes
    

    Creating a Mood Pack

    A plug-in mood pack has two parts: pack metadata and mood definitions.

    [pack]
    name = "electronic"
    version = "1.0.0"
    author = "Your Name"
    description = "Electronic music styles"
    
    [[moods]]
    name = "synthwave"
    base_mood = "upbeat"
    default_tempo = 118
    default_key = "Am"
    default_intensity = 65
    description = "80s synthwave vibes"
    tags = ["electronic", "retro"]
    
    [[moods]]
    name = "chillout"
    base_mood = "ambient"
    default_tempo = 85
    default_key = "Em"
    default_intensity = 40
    description = "Relaxed electronic"
    

    Each mood delegates to a built-in generator (base_mood) but overrides specific parameters. You get the musical logic of the built-in mood with your customizations applied.

    Available Base Moods

    Your custom moods can extend any of the nine built-in generators:

    Base Mood Character
    suspense Tense, building
    eerie Dark, unsettling
    upbeat Energetic, positive
    calm Peaceful, slow
    ambient Atmospheric, textural
    jazz Swing, improvisation
    chiptune 8-bit, retro gaming
    orchestral Classical instruments
    show Broadway, theatrical

    Configuration Options

    Each mood definition supports these overrides:

    Field Description Example
    name CLI name (required) "synthwave"
    base_mood Built-in to extend (required) "upbeat"
    default_tempo BPM 118
    default_key Musical key "Am", "C", "Eb"
    default_intensity 0-100 energy level 65
    description Human-readable description "80s vibes"
    tags Discovery tags ["electronic", "retro"]

    How Seeds Create Variation

    Seeds aren’t random—they’re deterministic variation selectors. The same mood + same seed always produces identical output. But different seeds create observable musical differences across multiple dimensions:

    Parameter Variation Range
    Tempo ±15% from base
    Layer inclusion Which instruments appear
    Melodic contour 16 different phrase shapes
    Note density 0.6x to 1.4x
    Rest probability 0% to 35% silence
    Phrase length 3-8 notes
    Velocity -15 to +15 offset

    The system uses hash-based mixing with unique salts for each parameter. This means adjacent seeds (42 vs 43) produce completely different outputs—no gradual transitions between seeds.

    When you combine plugin moods with seed variation, you get a matrix: your custom tempo/key/intensity settings applied across different seed-driven variations of the underlying generator’s patterns.

    Using Custom Moods

    Once your TOML file is in place, the mood appears automatically:

    # List all moods (built-in + plugins)
    midi-cli-rs moods
    
    # Generate with your custom mood
    midi-cli-rs preset -m synthwave -d 5 -s 42 -o output.wav
    

    The seed system still works—same mood + same seed = identical output.

    Example: Electronic Pack

    Here’s a complete pack with four electronic moods:

    [pack]
    name = "electronic"
    version = "1.0.0"
    description = "Electronic music styles"
    
    [[moods]]
    name = "synthwave"
    base_mood = "upbeat"
    default_tempo = 118
    default_key = "Am"
    default_intensity = 65
    
    [[moods]]
    name = "chillout"
    base_mood = "ambient"
    default_tempo = 85
    default_key = "Em"
    default_intensity = 40
    
    [[moods]]
    name = "techno"
    base_mood = "upbeat"
    default_tempo = 130
    default_key = "Dm"
    default_intensity = 85
    
    [[moods]]
    name = "8bit"
    base_mood = "chiptune"
    default_tempo = 140
    default_key = "C"
    default_intensity = 70
    

    Drop this in ~/.midi-cli-rs/moods/electronic.toml and you have four new moods.

    What’s Next

    This plugin system handles mood variations—different tempos, keys, and intensities applied to existing generators. A future update will add a plugin editor for creating entirely new musical patterns without writing Rust.

    For now, the delegation model covers most use cases: want faster jazz? Darker ambient? Major-key suspense? Create a TOML file and you’re done.

    The Personal Software Pattern

    This follows the Personal Software philosophy: start with something that works, then extend it as needs emerge. The plugin system wasn’t in the original design. It grew from actual use—wanting more moods without recompiling.

    Good personal software leaves room to grow.


    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Part 3 of the Personal Software series. View all parts | Next: Part 4 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 461 words3 min readAbstract

    Five ML Concepts - #20

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #20
    Video
    Comments Discord

    References

    Concept Reference
    VAEs Auto-Encoding Variational Bayes (Kingma & Welling 2013)
    Uncertainty Estimation What Uncertainties Do We Need in Bayesian Deep Learning? (Kendall & Gal 2017)
    Interpretability Towards A Rigorous Science of Interpretable Machine Learning (Doshi-Velez & Kim 2017)
    Gradient Noise Stochastic Gradient Descent as Approximate Bayesian Inference (Mandt et al. 2017)
    Human-in-the-Loop Human-in-the-Loop Machine Learning (Monarch 2021)

    Today’s Five

    1. Variational Autoencoders (VAEs)

    VAEs are probabilistic autoencoders that learn a structured latent distribution. By sampling from that distribution, they can generate new examples similar to the training data.

    The key innovation is regularizing the latent space to be smooth and continuous.

    Like learning not just to summarize books, but to create new ones in a similar style.

    2. Uncertainty Estimation

    Models can estimate how confident they should be in predictions. Some uncertainty comes from noisy data (aleatoric), and some from limited knowledge (epistemic).

    Knowing when a model is uncertain enables safer decision-making.

    Like a weather forecast giving seventy percent chance of rain instead of a simple yes or no.

    3. Why Interpretability Is Hard

    Neural networks represent information across many interacting parameters. No single component cleanly maps to a human concept.

    Distributed representations enable powerful learning but resist simple explanations.

    Like trying to explain a dream by pointing to individual neurons.

    4. Gradient Noise

    When training with mini-batches, gradients vary from step to step. A little noise can help exploration, but too much can slow convergence.

    Batch size, learning rate, and gradient clipping all influence this noise level.

    Like getting slightly different directions each time you ask for help.

    5. Human-in-the-Loop Systems

    Humans review, supervise, or override model decisions in critical workflows. This improves safety and accountability in high-stakes applications.

    The approach combines model efficiency with human judgment where it matters most.

    Like a pilot monitoring autopilot and stepping in when necessary.

    Quick Reference

    Concept One-liner
    VAEs Generative models with structured latent spaces
    Uncertainty Estimation Know when you don’t know
    Interpretability Distributed representations resist explanation
    Gradient Noise Mini-batch variation in training
    Human-in-the-Loop Human oversight for critical decisions

    Short, accurate ML explainers. Follow for more.

    Part 20 of the Five ML Concepts series. View all parts | Next: Part 21 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 648 words4 min readAbstract

    In-Context Learning Revisited: From Mystery to Engineering

    It was 2020 when GPT-3 shocked everyone. It could learn from examples in the query—without updating its weights. We called it In-Context Learning. But was it magic, or was it doing something deeper?

    Resource Link
    Video ICL Revisited
    Video
    Papers 4 References
    Comments Discord

    Phase 1: The Empirical Discovery (2020)

    The GPT-3 paper showed that large models could perform few-shot learning. Give them examples, and they generalize. No gradient updates. No retraining. Just forward passes.

    The surprising part was that scaling alone seemed to unlock it.

    Paper: Language Models are Few-Shot Learners

    ELI5: Show a big language model a few examples of a task in your prompt, and it figures out how to do the task—without any retraining. Nobody told it to do this. It just emerged when models got big enough.

    Main idea: Scale unlocks emergent capabilities. ICL was discovered, not designed.

    Phase 2: Mechanistic Explanations (2022)

    By 2022, researchers began probing the internal mechanisms. Several papers proposed that transformers implement implicit meta-learning. The model appears to learn during inference by performing gradient-descent-like operations internally.

    Paper: What Explains In-Context Learning in Transformers?

    ELI5: When you give a transformer examples, its attention layers do something that looks like fitting a simple linear model to those examples—on the fly, during the forward pass. It’s not memorizing; it’s computing a mini-solution.

    Main idea: ICL works because attention can simulate linear regression internally.

    Paper: Transformers Learn In-Context by Gradient Descent

    ELI5: The transformer’s forward pass is secretly doing something similar to training. The attention mechanism acts like one step of gradient descent over the examples you provided. Learning happens inside inference.

    Main idea: ICL is implicit gradient descent—learning hidden inside prediction.

    Phase 3: Engineering the Effect

    Once researchers understood that ordering and structure affect ICL, prompt design became less of an art and more of an optimization problem. The quality and arrangement of demonstrations directly shape performance.

    ICL became tunable. Researchers could now deliberately improve it rather than just observe it.

    Phase 4: Interactive ICL (2026)

    Recent work pushes this further. Models are trained to predict natural language critiques and feedback. If a model can predict what a teacher would say, it can internalize that signal. External correction becomes an internal capability.

    Paper: Improving Interactive In-Context Learning from Natural Language Feedback

    ELI5: Train a model to guess what feedback a human would give. Now the model has internalized the “teacher” and can improve itself without needing the actual teacher present. Self-correction without weight updates.

    Main idea: Models can learn to learn from feedback, making ICL interactive and self-improving.

    Beyond Language

    Newer work applies ICL to neuroscience discovery, showing that the mechanism is not limited to text tasks. It becomes a flexible reasoning substrate across domains. That’s when you know a concept has matured.

    The Arc

    Phase Era Key Insight
    Discovery 2020 Emerges from scale
    Explanation 2022 Implicit gradient descent
    Engineering 2023-24 Prompt design as optimization
    Self-improvement 2026 Learning from feedback

    The Deeper Insight

    In-Context Learning started as an emergent surprise. Now it’s becoming an engineered learning substrate inside transformers.

    It was not magic. It was meta-learning hiding in plain sight.

    References

    Paper Link
    Language Models are Few-Shot Learners (GPT-3) arXiv:2005.14165
    What Explains In-Context Learning in Transformers? arXiv:2202.12837
    Transformers Learn In-Context by Gradient Descent arXiv:2212.07677
    Improving Interactive ICL from Natural Language Feedback arXiv:2602.16066

    ICL started as “whoa, it works.” Now we understand “why it works.” Next: engineering it deliberately.

    Part 5 of the Machine Learning series. View all parts | Next: Part 6 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 456 words3 min readAbstract

    Five ML Concepts - #19

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #19
    Video
    Comments Discord

    References

    Concept Reference
    Autoencoders Reducing the Dimensionality of Data with Neural Networks (Hinton & Salakhutdinov 2006)
    Correlation vs Causation Causality (Pearl 2009)
    Curriculum Learning Curriculum Learning (Bengio et al. 2009)
    Failure Analysis Practical Machine Learning for Computer Vision (Lakshmanan et al. 2021)
    Covariate Shift Dataset Shift in Machine Learning (Quinonero-Candela et al. 2009)

    Today’s Five

    1. Autoencoders

    Autoencoders are neural networks trained to compress inputs into a smaller representation and reconstruct them. The bottleneck forces the model to capture essential structure.

    This learned compression is useful for dimensionality reduction, denoising, and feature learning.

    Like summarizing a book into key points and then rebuilding the story from that summary.

    2. Correlation vs Causation

    Two variables can move together without one causing the other. Models typically learn correlations present in data, not true cause-and-effect relationships.

    This matters because interventions based on correlation alone may not produce intended effects.

    Like noticing umbrella sales rise with rain—umbrellas don’t cause rain.

    3. Curriculum Learning

    Training starts with easier examples and gradually introduces harder ones. This can improve stability and learning speed in some settings.

    The approach mirrors how humans learn complex subjects incrementally.

    Like teaching math by starting with addition before moving to calculus.

    4. Failure Analysis

    Failure analysis groups model errors into categories to understand where performance breaks down. This helps target improvements instead of guessing.

    Systematic error analysis often reveals actionable patterns invisible in aggregate metrics.

    Like a teacher reviewing which types of questions students miss most often.

    5. Covariate Shift

    Covariate shift occurs when the input distribution changes between training and deployment, while the task itself remains the same. The model may underperform because it sees unfamiliar inputs.

    Monitoring input distributions helps detect this shift early.

    Like training a driver in sunny weather and testing them in snow.

    Quick Reference

    Concept One-liner
    Autoencoders Compress and reconstruct to learn structure
    Correlation vs Causation Co-occurrence isn’t cause
    Curriculum Learning Start easy, progress to hard
    Failure Analysis Categorize errors to guide fixes
    Covariate Shift New inputs, same task

    Short, accurate ML explainers. Follow for more.

    Part 19 of the Five ML Concepts series. View all parts | Next: Part 20 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 2244 words12 min readAbstract

    JSON et al: A Deep Dive into Data Serialization Formats

    JSON is everywhere. APIs. Logs. Databases. Configuration files. But it’s not alone. A whole ecosystem of formats exists—each optimizing for different tradeoffs.

    This post expands on the JSON et al short, providing technical depth on each format: when it was created, where it’s specified, and what problems it solves.


    The Tradeoff Triangle

    Before diving in, understand the fundamental constraint. Data formats balance three competing goals:

    Goal Description
    Human Readability Can a developer read and edit it directly?
    Compactness How many bytes does it take to represent data?
    Query Performance How fast can you access specific fields?

    You usually only get two. JSON optimizes readability. Protobuf optimizes compactness. JSONB optimizes query performance. No format wins everywhere.


    JSON: The Ubiquitous Baseline

    Created: 2001 (discovered/formalized by Douglas Crockford) Specification: ECMA-404 (2013), RFC 8259 (2017) File Extension: .json

    JSON (JavaScript Object Notation) emerged from JavaScript’s object literal syntax but became language-agnostic. Crockford didn’t invent it—he “discovered” it already existing in JavaScript and formalized the specification.

    Technical Details

    • Encoding: UTF-8 text (UTF-16/32 allowed but rare)
    • Data Types: Objects {}, arrays [], strings, numbers, booleans, null
    • Schema: None required
    • Comments: Not allowed in strict JSON

    Strengths

    • Universal parser support (every language has one)
    • Human readable without tools
    • Web-native (JavaScript parses it natively)
    • Simple specification (fits on a business card)

    Weaknesses

    • Verbose (field names repeated for every object)
    • No binary data type (must base64-encode)
    • No comments (frustrating for config files)
    • Parsing overhead (tokenization + string decoding every time)

    ELI5

    Like typing a long email instead of sending a terse text. Every message spells everything out—clear, but verbose.

    When to Use

    REST APIs, configuration (when comments aren’t needed), data interchange between systems, anywhere human readability matters more than efficiency.


    JSONL / NDJSON: Streaming JSON

    Created: ~2013 (formalized) Specification: JSON Lines, NDJSON File Extension: .jsonl, .ndjson

    JSONL (JSON Lines) and NDJSON (Newline-Delimited JSON) are the same concept: one valid JSON object per line, separated by newlines.

    Technical Details

    {"name": "Alice", "score": 95}
    {"name": "Bob", "score": 87}
    {"name": "Carol", "score": 92}
    

    No wrapping array. Each line is independently parseable.

    Strengths

    • Streaming: Process line-by-line without loading entire file
    • Append-only: Add records without rewriting the file
    • Parallel processing: Split by line, distribute to workers
    • Fault-tolerant: One corrupt line doesn’t invalidate the file

    Weaknesses

    • Not valid JSON (can’t parse with standard JSON parser)
    • Still text-based (same verbosity as JSON)
    • No random access by index

    ELI5

    Like removing one comma per line to save some typing. Each line is self-contained, so you can grab and process them one at a time.

    When to Use

    Log files, big data pipelines (Spark, Pandas), ML datasets, event streams, anywhere you need to process records incrementally.


    JSONB: Binary JSON for Databases

    Created: 2014 (PostgreSQL 9.4) Specification: Implementation-specific (no universal standard) Storage: Database column type

    JSONB isn’t a file format—it’s a database storage optimization. PostgreSQL’s JSONB differs from MongoDB’s BSON, which differs from other implementations.

    PostgreSQL JSONB Details

    • Parsed once: Text converted to binary on INSERT
    • Keys sorted: Deterministic ordering for indexing
    • Duplicates removed: Last value wins
    • Offset table: O(log n) field lookup instead of O(n) text scanning

    MongoDB BSON

    Specification: bsonspec.org

    BSON (Binary JSON) is MongoDB’s serialization format. Unlike PostgreSQL’s JSONB, BSON is a standalone binary format:

    • Type-prefixed values
    • Supports additional types (Date, Binary, ObjectId)
    • Length-prefixed for fast skipping
    • ~10-15% smaller than JSON typically

    Strengths

    • Fast queries without re-parsing
    • Indexable (GIN indexes on JSONB in PostgreSQL)
    • Type coercion at storage time

    Weaknesses

    • Not portable (implementation-specific)
    • Not human-readable
    • INSERT overhead (parsing cost upfront)

    ELI5

    Instead of cooking from scratch every time, you heat a pre-made meal. The prep work happens once (on INSERT), so serving (queries) is fast.

    When to Use

    Database storage where you query into JSON structures. PostgreSQL JSONB + GIN indexes enable fast @> containment queries.


    Protocol Buffers: Google’s Schema-First Format

    Created: 2001 (internal Google), 2008 (open-sourced) Specification: developers.google.com/protocol-buffers File Extension: .proto (schema), binary wire format

    Protocol Buffers (Protobuf) is Google’s language-neutral, schema-required serialization format. It powers gRPC.

    Technical Details

    Schema definition:

    message Sensor {
      int32 temperature = 1;
      int32 humidity = 2;
    }
    

    Wire format uses field numbers, not names:

    Field 1: 72
    Field 2: 40
    

    Key Features

    • Varint encoding: Small integers use fewer bytes
    • Field numbers: Enable backward compatibility
    • Code generation: .proto → language-specific classes
    • No self-description: Receiver must know schema

    Strengths

    • Extremely compact (3-10x smaller than JSON typically)
    • Fast serialization/deserialization
    • Strong versioning semantics
    • gRPC integration

    Weaknesses

    • Requires schema agreement
    • Not human-readable
    • Tooling required for debugging
    • Schema evolution has rules

    ELI5

    Everyone agrees upfront what “field 1” means. You don’t waste space spelling out “temperature”—you just send the number 1 and the value. Both sides know the code.

    When to Use

    Microservices (gRPC), internal APIs, anywhere bandwidth and latency matter more than debuggability.


    ASN.1: The Telecom Veteran

    Created: 1984 (ITU-T X.208) Specification: ITU-T X.680-X.683 Encoding Rules: BER, DER, PER, XER, and more

    ASN.1 (Abstract Syntax Notation One) predates all modern formats. It defines both schema and encoding, with multiple encoding rules for different use cases.

    Encoding Rules Comparison

    Rule Use Case
    BER (Basic Encoding Rules) Flexible, general purpose
    DER (Distinguished Encoding Rules) Deterministic, for cryptography
    PER (Packed Encoding Rules) Most compact, for bandwidth-constrained
    XER (XML Encoding Rules) XML-based, for interop

    Where You See ASN.1

    • X.509 certificates (SSL/TLS certs are DER-encoded ASN.1)
    • LDAP (directory services)
    • SNMP (network management)
    • Telecom protocols (SS7, GSM, LTE)

    Strengths

    • Bit-level precision
    • Proven over 40 years
    • Multiple encoding options
    • Formal verification possible

    Weaknesses

    • Complex specification
    • Steep learning curve
    • Tooling can be expensive
    • Security vulnerabilities in parsers (historically)

    ELI5

    Same idea as Protobuf—everyone agrees upfront what each field number means. ASN.1 just got there 20 years earlier and handles even more edge cases.

    When to Use

    You probably won’t choose ASN.1 for new projects. You’ll encounter it in cryptography, certificates, and legacy telecom systems.


    YAML: Human-Friendly Configuration

    Created: 2001 (Clark Evans, Ingy döt Net, Oren Ben-Kiki) Specification: yaml.org/spec/1.2.2 File Extension: .yaml, .yml

    YAML (YAML Ain’t Markup Language) prioritizes human readability. It’s a superset of JSON—any valid JSON is valid YAML.

    Technical Details

    # Comments allowed!
    server:
      host: localhost
      port: 8080
      features:
        - auth
        - logging
    

    Key Features

    • Indentation-based: Whitespace matters
    • Comments: # for single-line
    • Anchors/aliases: &name and *name for references
    • Multiple documents: --- separator

    Strengths

    • Highly readable
    • Comments supported
    • Multi-line strings without escaping
    • Complex data structures

    Weaknesses

    • “Norway problem”: NO parses as boolean false
    • Whitespace sensitivity causes errors
    • Multiple ways to express same data
    • Security concerns (arbitrary code execution in some parsers)

    ELI5

    Optimized for clarity, not bandwidth. YAML is for humans editing config files—not for machines exchanging data over networks.

    When to Use

    Configuration files (Kubernetes, Docker Compose, CI/CD), anywhere humans edit data directly and comments help.


    TOML: Minimal Configuration

    Created: 2013 (Tom Preston-Werner) Specification: toml.io File Extension: .toml

    TOML (Tom’s Obvious Minimal Language) emerged as a reaction to YAML’s complexity. It’s used by Rust (Cargo.toml), Python (pyproject.toml), and others.

    Technical Details

    [server]
    host = "localhost"
    port = 8080
    
    [server.features]
    auth = true
    logging = true
    

    Key Features

    • Explicit typing: Dates, times, arrays have clear syntax
    • Sections: [section] and [section.subsection]
    • No anchors: Intentionally simpler than YAML
    • Deterministic: Same data = same representation

    Strengths

    • Easy to read and write
    • Unambiguous parsing
    • Clear error messages
    • Growing ecosystem support

    Weaknesses

    • Less expressive than YAML
    • Nested structures can be verbose
    • Smaller ecosystem than JSON/YAML

    ELI5

    Same goal as YAML—clarity for humans, not bandwidth for machines—but with stricter rules so you make fewer mistakes.

    When to Use

    Configuration files where YAML’s complexity isn’t needed. Rust projects (mandatory). Python packaging (pyproject.toml).


    TOON: Token-Optimized for LLMs

    Created: October 2025 (toon-format organization) Specification: github.com/toon-format/toon (v3.0) File Extension: .toon Media Type: text/toon (provisional)

    TOON (Token Oriented Object Notation) is the newest format in this list, designed specifically for LLM input. It’s a lossless representation of JSON that minimizes tokens.

    Technical Details

    TOON combines YAML-style indentation for nested objects with CSV-like tabular layouts for uniform arrays:

    users[2]{name,age}:
    Alice,25
    Bob,30
    

    Equivalent JSON:

    {"users": [{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}]}
    

    Key Features

    • Header-based: Field names declared once, values follow
    • 40% fewer tokens: Than equivalent JSON typically
    • Lossless: Round-trips to JSON perfectly
    • UTF-8 always: No encoding ambiguity

    Performance

    Metric JSON TOON
    Accuracy 69.7% 73.9%
    Efficiency (acc/1K tokens) 15.3 26.9

    Strengths

    • Significant token savings at scale
    • Better context window utilization
    • Lower API costs for LLM applications
    • Human-readable (unlike binary formats)

    Weaknesses

    • New format (October 2025)
    • Limited tooling compared to JSON
    • Requires conversion layer for existing systems
    • Not yet widely adopted

    ELI5

    Like having one header row for each column in a table instead of repeating the column name for every single row. You declare field names once, then just list the values.

    When to Use

    LLM prompts with structured data, RAG applications, anywhere token efficiency matters. Especially useful for large datasets with uniform object arrays.

    Implementations

    • TypeScript: Reference implementation
    • Python: toons (Rust-based, fast)
    • Go, Rust, .NET: Available via toon-format org

    Alternatives Not in the Video

    MessagePack

    Created: 2008 (Sadayuki Furuhashi) Specification: msgpack.org

    Binary JSON without schema. Type-prefixed values, efficient numeric encoding.

    Use when: You want JSON semantics but smaller/faster.

    CBOR

    Created: 2013 (IETF) Specification: RFC 8949

    Concise Binary Object Representation. Designed for constrained environments (IoT).

    Use when: Resource-constrained devices, need a standard binary format.

    Apache Avro

    Created: 2009 (Apache, Doug Cutting) Specification: avro.apache.org

    Schema-based, row-oriented binary format. Schema embedded or stored separately. Strong schema evolution support.

    Use when: Big data pipelines (Hadoop, Kafka), schema evolution is critical.

    Apache Parquet

    Created: 2013 (Twitter + Cloudera) Specification: parquet.apache.org

    Columnar storage format. Not for serialization—for analytics storage.

    Use when: Large-scale analytics, data warehousing, Spark/Pandas workflows.

    Cap’n Proto

    Created: 2013 (Kenton Varda, ex-Protobuf author) Specification: capnproto.org

    Zero-copy serialization. The serialized form is the in-memory form.

    Use when: Extreme performance requirements, inter-process communication.

    FlatBuffers

    Created: 2014 (Google) Specification: google.github.io/flatbuffers

    Zero-copy like Cap’n Proto but with better tooling. Used in games, mobile.

    Use when: Games, mobile apps, anywhere memory allocation matters.


    Quick Reference

    Format Year Schema Binary Human-Readable Best For
    JSON 2001 No No Yes APIs, interchange
    JSONL 2013 No No Yes Logs, streaming
    JSONB 2014 No Yes No Database queries
    Protobuf 2008 Yes Yes No Microservices
    ASN.1 1984 Yes Yes No Crypto, telecom
    YAML 2001 No No Yes Config files
    TOML 2013 No No Yes Simple config
    TOON 2025 No No Yes LLM prompts
    MessagePack 2008 No Yes No Fast JSON
    CBOR 2013 Optional Yes No IoT
    Avro 2009 Yes Yes No Big data

    Key Takeaways

    1. No “best” format exists. Each optimizes for different constraints.

    2. Text formats favor humans. JSON, YAML, TOML prioritize readability over efficiency.

    3. Binary formats favor machines. Protobuf, MessagePack, CBOR prioritize compactness and speed.

    4. Schema formats favor correctness. Protobuf, Avro, ASN.1 catch errors at compile time.

    5. The tradeoff triangle is real. Readability, compactness, query performance—pick two.

    The question isn’t “which format wins?” The question is: what problem are you solving?


    Resources


    Data formats are design decisions. Choose based on your constraints, not trends.

    Part 2 of the General Technology series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 449 words3 min readAbstract

    Five ML Concepts - #18

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #18
    Video
    Comments Discord

    References

    Concept Reference
    Preference Learning Learning to summarize from human feedback (Stiennon et al. 2020)
    Ensembling Ensemble Methods in Machine Learning (Dietterich 2000)
    ML Fragility Distribution Shift (Quinonero-Candela et al. 2009)
    Epoch Deep Learning (Goodfellow et al. 2016), Chapter 8
    Cost vs Quality Efficient Transformers: A Survey (Tay et al. 2022)

    Today’s Five

    1. Preference Learning

    Instead of learning from fixed labels, models are trained from comparisons between outputs. This helps align model behavior with human judgments.

    The approach works well when absolute quality is hard to define but relative preferences are easier to express.

    Like learning to cook by asking which dish tastes better.

    2. Ensembling

    Ensembling combines predictions from multiple models. Different models make different errors, and combining them can improve robustness.

    Common strategies include voting, averaging, and stacking models together.

    Like asking several experts and averaging their opinions.

    3. Why ML Is Fragile

    Models rely on statistical patterns learned from data. When those patterns shift, performance can degrade quickly.

    This fragility emerges because models optimize for training distributions, not arbitrary future scenarios.

    Like a spell checker that works on common words but struggles with unusual ones.

    4. Epoch

    An epoch is one complete pass through the training dataset. Multiple epochs allow the model to refine its weights over repeated passes.

    Training typically continues for many epochs until validation performance stops improving.

    Like reading a textbook from beginning to end more than once.

    5. Cost vs Quality Tradeoffs

    Increasing model size or compute often improves performance, but also increases cost and latency. Engineers balance quality against budget and responsiveness.

    Production systems often use smaller, faster models rather than the largest available.

    Like choosing between a luxury car and an economy car depending on your needs.

    Quick Reference

    Concept One-liner
    Preference Learning Train from comparisons, not labels
    Ensembling Combine models for robustness
    ML Fragility Statistical models break on distribution shift
    Epoch One pass through training data
    Cost vs Quality Bigger isn’t always better in production

    Short, accurate ML explainers. Follow for more.

    Part 18 of the Five ML Concepts series. View all parts | Next: Part 19 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1043 words6 min readAbstract

    midi-cli-rs: Music Generation for AI Coding Agents

    AI coding agents can write code, generate images, and produce text. But what about music? When I needed background audio for explainer videos, I wanted a tool that AI agents could use directly—no music theory required.

    Resource Link
    Video midi-cli-rs Explainer
    Video
    Examples Listen to Samples
    Code midi-cli-rs
    Comments Discord

    The Problem

    Generating music programmatically is hard. Traditional approaches require understanding music theory, MIDI specifications, instrument mappings, and audio synthesis. That’s a lot to ask of an AI agent that just needs a 5-second intro.

    I wanted something simpler: a CLI tool where an agent could say “give me 5 seconds of suspenseful music” and get a usable WAV file.

    The Solution: Mood Presets

    midi-cli-rs solves this with mood presets—curated musical generators that produce complete compositions from a single command:

    # Generate a 5-second suspenseful intro
    midi-cli-rs preset --mood suspense --duration 5 -o intro.wav
    
    # Upbeat outro with specific key
    midi-cli-rs preset -m upbeat -d 7 --key C --seed 42 -o outro.wav
    

    Six moods are available:

    Mood Character
    suspense Low drones, tremolo strings, tension
    eerie Sparse tones, diminished harmony
    upbeat Rhythmic chords, energetic
    calm Warm pads, gentle arpeggios
    ambient Textural drones, pentatonic bells
    jazz Walking bass, brushed drums, piano trio

    Each mood generates multi-layer compositions with appropriate instruments, rhythms, and harmonies. The --seed parameter ensures reproducibility—same seed, same output. Different seeds produce meaningful variations in melody contour, rhythm patterns, and instrument choices.

    Melodic Variation

    The presets don’t just randomize notes—they use a contour-based variation system. Changing the seed produces melodies that follow different shapes (ascending, descending, arch, wave) while staying musically coherent. This means you can generate multiple versions of a mood and pick the one that fits best.

    How It Works

    The tool generates MIDI programmatically, then renders to WAV using FluidSynth:

    Mood Preset → MIDI Generation → FluidSynth → WAV Output
    

    MIDI generation uses the midly crate to create standard MIDI files. Each preset generates multiple tracks with different instruments, note patterns, and dynamics.

    Audio rendering calls FluidSynth as a subprocess with a SoundFont (instrument samples). This avoids LGPL licensing complications—subprocess execution doesn’t trigger copyleft.

    Note-Level Control

    When presets aren’t enough, you can specify exact notes:

    # Note format: PITCH:DURATION:VELOCITY[@OFFSET]
    midi-cli-rs generate \
        --notes "C4:0.5:80@0,E4:0.5:80@0.5,G4:0.5:80@1,C5:1:90@1.5" \
        -i piano -t 120 -o arpeggio.wav
    

    Or use JSON for complex multi-track arrangements:

    echo '{"tempo":90,"instrument":"piano","notes":[
      {"pitch":"C4","duration":0.5,"velocity":80,"offset":0},
      {"pitch":"E4","duration":0.5,"velocity":80,"offset":0.5},
      {"pitch":"G4","duration":1,"velocity":90,"offset":1}
    ]}' | midi-cli-rs generate --json -o output.wav
    

    Web UI

    For interactive composition, there’s a browser-based interface:

    midi-cli-rs serve  # Starts on http://127.0.0.1:3105
    

    The Presets tab lets you adjust mood, key, duration, intensity, and tempo with immediate audio preview. Click the clock button to generate a time-based seed for unique but reproducible results.

    The Melodies tab provides note-by-note composition with keyboard shortcuts:

    • a-g for note pitch
    • [ / ] to adjust duration
    • + / - to change octave
    • Tab to navigate between notes

    For AI Agents

    The CLI is designed for AI agent usage:

    1. Simple commands: One line generates complete audio
    2. Reproducible: Seed values ensure consistent output
    3. Self-documenting: --help includes agent-specific instructions
    4. Composable: Generate tracks separately, combine with ffmpeg
    # AI agent workflow
    midi-cli-rs preset -m suspense -d 5 --seed 1 -o intro.wav
    midi-cli-rs preset -m upbeat -d 10 --seed 2 -o main.wav
    ffmpeg -i intro.wav -i main.wav -filter_complex concat=n=2:v=0:a=1 final.wav
    

    SoundFont Quality Matters

    The quality of generated audio depends heavily on the SoundFont used. SoundFonts are collections of audio samples for each instrument—a tiny SoundFont with compressed samples will sound thin and artificial, while a larger one with high-quality recordings produces professional results.

    SoundFont Size Quality License
    TimGM6mb ~6MB Basic GPL v2
    GeneralUser GS ~30MB Good Permissive
    FluidR3_GM ~140MB Very Good MIT
    MuseScore_General ~200MB Excellent MIT

    For anything beyond quick prototypes, use a quality SoundFont. The difference is dramatic—the same MIDI file can sound like a toy keyboard or a real instrument depending on the samples.

    The tool auto-detects SoundFonts in common locations (~/.soundfonts/, /opt/homebrew/share/soundfonts/, etc.), or specify one explicitly with --soundfont.

    Technical Details

    Built with Rust 2024 edition using permissively licensed dependencies:

    Crate Purpose
    midly MIDI file generation
    clap CLI argument parsing
    serde JSON serialization
    rand Randomization for presets
    axum Web server (for serve command)

    FluidSynth is called as a subprocess for WAV rendering, keeping the main codebase MIT-licensed.

    Try It

    Listen to sample outputs, or build locally:

    git clone https://github.com/softwarewrighter/midi-cli-rs.git
    cd midi-cli-rs
    cargo build --release
    ./target/release/midi-cli-rs preset -m jazz -d 5 -o jazz.wav
    

    Requires FluidSynth for WAV output (brew install fluid-synth on macOS).


    Disclaimer

    You are responsible for how you use generated audio. Ensure you have the appropriate rights and permissions for any commercial or public use. This tool generates MIDI data algorithmically—how you render and distribute the final audio is your responsibility.

    Be aware that algorithmic composition can inadvertently produce sequences similar to existing copyrighted works. Whether you use this tool, AI generation, or compose by hand, you must verify that your output doesn’t infringe on existing copyrights before public release or commercial use. Protect yourself legally.

    Part 2 of the Personal Software series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 477 words3 min readAbstract

    Five ML Concepts - #17

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #17
    Video
    Comments Discord

    References

    Concept Reference
    Benchmark Leakage Rethinking the Inception Architecture for Computer Vision (Szegedy et al. 2016)
    Concept/Data Drift Learning under Concept Drift: A Review (Lu et al. 2018)
    Weight Decay Decoupled Weight Decay Regularization (Loshchilov & Hutter 2019)
    Scaling Laws Scaling Laws for Neural Language Models (Kaplan et al. 2020)
    Shadow Deployment Reliable Machine Learning (Cathy Chen et al. 2022)

    Today’s Five

    1. Benchmark Leakage

    When benchmark or test data influences training, tuning, or model selection, evaluation results become unreliable. This inflates reported performance beyond real-world capability.

    Strict separation between development and evaluation data is essential for honest assessment.

    Like practicing with the exact questions that will appear on the final exam.

    2. Concept Drift vs Data Drift

    Data drift occurs when input distributions change. Concept drift occurs when the relationship between inputs and outputs changes. Both can degrade model performance over time.

    Data drift: customers buy different products. Concept drift: what “good” means has changed.

    Like customers buying different products versus products changing what they mean.

    3. Weight Decay

    A regularization method that penalizes large weights, often implemented as L2 regularization. This encourages simpler models that generalize better.

    Weight decay adds a term proportional to the squared magnitude of weights to the loss function.

    Like encouraging shorter, simpler answers instead of overly complicated ones.

    4. Scaling Laws

    Empirical relationships showing how performance tends to improve as model size, data, or compute increase. These relationships follow predictable power-law curves.

    Scaling laws help predict resource requirements for target performance levels.

    Like noticing that adding horsepower often increases a car’s speed, but with diminishing returns.

    5. Shadow Deployment

    Running a new model in parallel with production without affecting live user decisions. The shadow model processes real traffic but its outputs are only logged, not served.

    This allows safe evaluation before full deployment.

    Like a new chef preparing the same dishes in the back kitchen before serving customers.

    Quick Reference

    Concept One-liner
    Benchmark Leakage Test data contaminating training/selection
    Concept vs Data Drift Changed relationships vs changed inputs
    Weight Decay L2 penalty discourages large weights
    Scaling Laws Performance scales predictably with resources
    Shadow Deployment Test safely alongside production

    Short, accurate ML explainers. Follow for more.

    Part 17 of the Five ML Concepts series. View all parts | Next: Part 18 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1074 words6 min readAbstract

    TBT (4/?): ToonTalk - Teaching Robots to Program

    I first discovered ToonTalk during the Windows XP era—probably around 2003 or 2004. It was unlike anything I’d seen: a programming environment disguised as a video game where you trained robots by showing them what to do. The concept stuck with me for two decades.

    Resource Link
    Video ToonTalk in Rust
    Video
    tt-rs Demo Live Demo
    tt-rs Repo tt-rs
    Comments Discord

    What is ToonTalk?

    ToonTalk is a visual programming environment created by Ken Kahn in 1995. The “Toon” stands for cartoon—every abstract programming concept is mapped to a concrete, animated metaphor:

    Concept ToonTalk Metaphor
    Variables Boxes with numbered holes
    Values Numbers, text, images in boxes
    Comparison Scales that tip when values differ
    Functions Robots that watch and learn
    Message passing Birds that carry items to nests
    Garbage collection Trucks that haul away unused items

    The design was influenced by games like The Legend of Zelda and Robot Odyssey—the kind of games that made you think while you played.

    Programming by Demonstration

    The core idea is radical: you don’t write code, you show a robot what to do.

    1. Create a robot and put it in “training mode”
    2. Perform actions while the robot watches (move items, compare values, etc.)
    3. The robot records your actions as a program
    4. Give the robot a box matching the training pattern—it executes the learned behavior

    This is programming by demonstration. The robot generalizes from your example, matching patterns and applying transformations. It’s the same conceptual model as teaching a child: “Watch what I do, then you try.”

    Three Generations

    ToonTalk has existed in three forms:

    Version Era Technology
    Original ToonTalk 1995-2009 C++, 3D desktop application
    ToonTalk Reborn 2014-2017 JavaScript/jQuery web app
    tt-rs 2025-2026 Rust/WebAssembly/Yew

    The original was a full 3D world—cities, houses, helicopters, even bombs for debugging. Ken Kahn later created ToonTalk Reborn, a simplified JavaScript version that runs in browsers.

    Why I Built tt-rs

    When I rediscovered ToonTalk Reborn a few years ago, I wanted to experiment with the concepts myself. But diving into a large jQuery codebase wasn’t appealing. So I did what any reasonable person would do: I vibe coded my own version in Rust.

    tt-rs is a modern reimplementation using:

    • Rust for core logic
    • WebAssembly for browser execution
    • Yew for reactive UI
    • SVG/CSS for graphics and animations

    It’s not a port—it’s a fresh implementation inspired by the same ideas. Building it myself lets me understand the concepts deeply and experiment with variations.

    Three Learning Levels

    The demo introduces concepts progressively through three levels:

    Level Concepts Widgets
    tt1 Basics Numbers, boxes, scales, wand, vacuum
    tt2 Messaging Birds and nests for communication
    tt3 Automation Sensors (time, random) + robots

    Level one covers the fundamentals: numbers with arithmetic, boxes as containers, scales for comparison, and tools for copying and removing. Level two adds asynchronous messaging—birds carry items to their paired nests. Level three brings sensors that produce values and robots that automate actions.

    Current Features

    The live demo includes:

    Widgets:

    • Numbers: Rational arithmetic with +, -, *, / operators
    • Boxes: Configurable containers with 0-9 holes (resize with keyboard)
    • Text: Basic text display
    • Scales: Visual comparison that tips when values differ
    • Robot: Training mode, action recording, execution
    • Bird/Nest: Message passing with pairing and delivery
    • Sensors: Time (milliseconds) and random number generation

    Tools:

    • Wand: Copy any widget
    • Vacuum: Remove widgets
    • Magnifier: Inspect nest message queues and robot actions

    Interactions:

    • Drag-and-drop with visual feedback
    • Box joining (drop box on edge of another)
    • Box splitting (drop box on a number)
    • Contextual help panel with level-specific content
    • Puzzle system with animated “Show Me” demos

    Robot Training

    The core feature is programming by demonstration:

    1. Click robot to enter training mode (yellow glow indicates “I’m watching”)
    2. Perform actions while the robot records (arithmetic, copy, remove, move to box)
    3. Click robot again to stop training
    4. Click robot to replay—it executes the recorded sequence

    The tutorials demonstrate this workflow step by step. In the “Train Robot” tutorial, you teach a robot to move a number into a box. In “Robot Sensors,” you train a robot to generate random numbers, apply modulo, and send results to a nest via a bird.

    Interactive Tutorials

    Each tutorial has two parts:

    1. Show Me: Watch an animated demonstration where a cursor walks through the solution
    2. Practice: Try it yourself with the same widgets

    The tutorials cover:

    • Fill a box with numbers
    • Add numbers together
    • Copy widgets with the wand
    • Send messages with birds and nests
    • Train your first robot
    • Combine robots with sensors

    What’s Next

    The immediate priorities:

    1. Pattern matching - Robot generalizes from specific values to “any number”
    2. Watched execution - See robot work step-by-step with animated cursor
    3. Persistence - Save and load workspaces

    Long term, I’d like to add the 3D elements from the original—the cities, the houses, the helicopter view. But that’s a much larger project.

    The Enduring Appeal

    What makes ToonTalk fascinating isn’t just the visual metaphors—it’s the computational model. Under the hood, ToonTalk implements concurrent constraint logic programming. The robots are essentially guarded Horn clauses. The birds and nests implement the actor model.

    Heavy concepts, but you don’t need to know any of that to use it. You just train robots by example. The abstraction is complete.

    That’s why it stuck with me for twenty years. Good abstractions are rare. When you find one, it’s worth understanding deeply.

    References

    Resource Link
    ToonTalk Website toontalk.com
    ToonTalk on Wikipedia Wikipedia
    ToonTalk Reborn (JS) github.com/ToonTalk/ToonTalk
    ToonTalk Reborn Demo toontalk.github.io/ToonTalk
    ToonTalk Reborn Wiki Wiki
    Ken Kahn’s Page Ken Kahn
    Original Paper (1995) ERIC - ToonTalk: An Animated Programming Environment
    Ken Kahn’s Research Academia.edu

    Some ideas are worth rediscovering. ToonTalk is one of them.

    Part 4 of the Throwback Thursday series. View all parts | Next: Part 5 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 473 words3 min readAbstract

    Five ML Concepts - #16

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #16
    Video
    Comments Discord

    References

    Concept Reference
    Train/Val/Test Split Deep Learning (Goodfellow et al. 2016), Chapter 5
    Overconfidence On Calibration of Modern Neural Networks (Guo et al. 2017)
    Batch Normalization Batch Normalization: Accelerating Deep Network Training (Ioffe & Szegedy 2015)
    Optimization vs Generalization Understanding Deep Learning Requires Rethinking Generalization (Zhang et al. 2017)
    A/B Testing Controlled Experiments on the Web (Kohavi et al. 2009)

    Today’s Five

    1. Train / Validation / Test Split

    Data is divided into training, validation, and test sets. Training learns patterns, validation tunes hyperparameters, test evaluates final performance.

    Never use test data for any decisions during development—it should only be touched once.

    Like practicing on homework, checking with practice tests, then taking the real exam.

    2. Overconfidence

    Models can assign very high probabilities to incorrect predictions. This is often related to poor calibration and can be dangerous in high-stakes applications.

    Temperature scaling and other calibration methods can help align confidence with accuracy.

    Like a student who is absolutely certain of a wrong answer.

    3. Batch Normalization

    Normalizes layer activations during training to improve stability and convergence. Each mini-batch’s activations are normalized to have zero mean and unit variance.

    This reduces internal covariate shift and often allows higher learning rates.

    Like keeping everyone on a similar pace during training so no one runs too far ahead.

    4. Optimization vs Generalization

    Training loss can decrease while test performance does not improve. Good optimization does not guarantee good generalization.

    A model can perfectly fit training data while failing on new examples—this is overfitting.

    Like memorizing last year’s exam instead of understanding the subject.

    5. A/B Testing Models

    Comparing two model versions using controlled live traffic experiments. Users are randomly assigned to see predictions from model A or model B.

    Statistical analysis determines which model performs better on real-world metrics.

    Like taste-testing two recipes with real customers to see which works better.

    Quick Reference

    Concept One-liner
    Train/Val/Test Separate data for learning, tuning, and evaluation
    Overconfidence High probability on wrong predictions
    Batch Normalization Normalize activations for stable training
    Optimization vs Generalization Low train loss ≠ good test performance
    A/B Testing Compare models with live experiments

    Short, accurate ML explainers. Follow for more.

    Part 16 of the Five ML Concepts series. View all parts | Next: Part 17 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 801 words5 min readAbstract

    Multi-Hop Reasoning (2/2): The Distribution Trap

    In Part 1, a tiny 135M model achieved 75% accuracy on multi-hop reasoning. This time we scale up to 360M—and discover that RSFT on easy examples makes performance worse.

    Resource Link
    Paper KG-Guided RAG (arXiv)
    Code multi-hop-reasoning
    ELI5 eli5.md
    Demo Live Demo
    Explainer Coming soon
    Comments Discord

    Scaling Up: SmolLM-360M

    Part 1 used the 135M model. For better reasoning traces and demo quality, we trained the 360M variant:

    Model Parameters Platform
    SmolLM-135M-Instruct 135M MLX (macOS)
    SmolLM-360M-Instruct 360M MLX + Unsloth (cross-platform)

    The 360M model produces more coherent traces and is used by the live inference demo.

    The Distribution Trap

    Here’s what happened when we trained RSFT on the “easy” training data:

    Phase Training Data Accuracy Notes
    Base 0% No format compliance
    SFT (500 iters) Easy (1-3 hop) 37% Learns TRACE + ANSWER format
    RSFT Easy (1-3 hop) 27% Worse than SFT!

    RSFT on easy examples performed worse than the SFT baseline.

    Why?

    The training examples (1-3 hops) don’t match the evaluation distribution (4-5 hops). The model learns shortcuts that work on easy problems but fail on hard ones.

    Training Distribution Eval Distribution Result
    Easy (1-3 hop) Hard (4-5 hop) 27% (worse)
    Hard (4-5 hop) Hard (4-5 hop) 75% (Part 1 result)

    The rejection sampling “winners” from easy examples teach strategies that don’t generalize.

    The Key Finding

    Rejection sampling must match your target distribution.

    This is counterintuitive. You might expect that training on more examples (even easy ones) would help. Instead:

    • Easy winners use shortcuts (fewer reasoning steps)
    • Hard eval requires full chain reasoning
    • Model learns the wrong patterns

    The fix: train RSFT on eval.jsonl (hard examples), not train.jsonl (easy examples).

    Demo Improvements

    The demo now includes four interactive tabs:

    Tab Feature
    Training Animated SFT→RSFT visualization with KG scoring
    Inference Pre-recorded inference examples
    Try It Live inference with 360M model
    Distribution Interactive visualization of the key finding

    Try It: Live Inference

    Ask DevOps troubleshooting questions and watch the model reason:

    Question: What causes TLSHandshakeError?
    
    TRACE: TLSHandshakeError is caused by ClockSkew,
    and ClockSkew leads to CertificateExpired,
    and CertificateExpired is fixed by RenewCert...
    ANSWER: B
    

    The knowledge graph scores the reasoning path during training, but at inference the model reasons independently.

    Cross-Platform Support

    The pipeline now runs on both platforms:

    Platform Framework Command
    macOS (Apple Silicon) MLX make train-360m
    Linux (NVIDIA CUDA) Unsloth make train-360m-unsloth

    Unsloth provides 2x faster training with 60% less memory on NVIDIA GPUs.

    Current Status

    Component Status
    SFT training (360M) Complete
    RSFT (wrong distribution) Complete (27%)
    RSFT (correct distribution) Next step
    Live demo with Try It Complete
    Cross-platform support Complete

    Next Steps

    Priority Task Expected Result
    High Retrain RSFT on eval.jsonl 75%+ accuracy
    Medium Update demo to use corrected model Better live inference
    Medium Curriculum learning (easy→hard) Smoother training
    Low Larger models (1B+) Higher ceiling

    The corrected RSFT training:

    python3 -m core.rsft \
      --examples data/eval.jsonl \  # Hard examples!
      --kg data/kg.json \
      --sft-adapter data/runs/run_360m/models/sft \
      --output data/runs/run_360m/models/rsft_eval \
      --model HuggingFaceTB/SmolLM-360M-Instruct \
      --k-samples 8 \
      --max-examples 50
    

    Lessons Learned

    1. Distribution Matching is Non-Negotiable

    This isn’t a minor optimization—it’s the difference between 27% and 75% accuracy. Wrong distribution = wrong winners = wrong model.

    2. Easy Examples Can Hurt

    More training data isn’t always better. Easy examples teach shortcuts that fail on hard problems.

    3. Verify Your Pipeline

    We trained a full RSFT model before realizing the distribution mismatch. Always check that training data matches eval distribution.

    4. The Fix is Simple

    Once identified, the fix is one flag change: --examples data/eval.jsonl instead of train.jsonl.

    Resources


    Training distribution matters. Easy examples teach easy shortcuts.

    Part 2 of the Multi-Hop Reasoning series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 780 words4 min readAbstract

    Towards Continuous LLM Learning (2): Routing Prevents Forgetting

    In Part 1, naive LoRA fine-tuning caused catastrophic forgetting. Now we’re implementing the Share algorithm properly—and we’re about 60% of the way to verifying the paper’s claims.

    Resource Link
    Code sleepy-coder
    Part 1 When Fine-Tuning Fails
    ELI5 eli5.md
    Share Paper arXiv:2602.06043
    Comments Discord

    Paper Claims vs Implementation Status

    We’re systematically verifying the claims from the Share and UWSH papers:

    Paper Claim Infrastructure Demonstrated?
    Shared basis via SVD Complete Yes
    ~100x parameter reduction Complete (76x) Yes
    Task routing beats averaging Tested (Exp 1b) Partial
    Prevents catastrophic forgetting Tested (Exp 1b) Partial
    Sequential learning Not tested No
    UWSH subspace stability Not tested No

    Overall: ~60% complete. Infrastructure is solid. Routing tested. Sequential learning remains.

    What We Built

    The full Share algorithm implementation:

    • Phase 1: SVD-based subspace extraction from 51 LoRA adapters (60% variance threshold)
    • Phase 2: Coefficient-only training with frozen basis (83K params vs 1.6M full LoRA)
    • Phase 3: Basis merging and updates
    • Routing: Error pattern classification for coefficient selection

    Bug Fixes That Unlocked Progress

    Two critical bugs blocked proper Phase 2 training:

    Bug 1: Zero-Gradient Saddle Point

    Both coefficient matrices initialized to zero:

    eps_beta = 0, eps_alpha = 0
    → delta_W = 0 @ 0 = 0
    → zero gradients, no learning
    

    Fix: Dual small-random initialization.

    Bug 2: Half-Parameter Training

    LoRA-style initialization only trained one coefficient set:

    Before: 112/224 parameters getting gradients
    After:  224/224 parameters getting gradients
    

    Fix: Both coefficient matrices need random initialization.

    Experiment 1b: Routing Works

    With gradient-trained v4 coefficients and proper routing:

    Strategy Pass Rate BC RH TB Regressions
    Baseline (no LoRA) 46.7% 70% 40% 30%
    Averaged 50.0% 70% 40% 40% 1
    Routed 50.0% 70% 50% 30% 0

    Result handling improved 40% → 50%. Zero regressions. This is the first positive transfer from Share coefficients.

    The Forgetting Heatmap

    We applied each coefficient individually to all 30 koans:

    Koan       BL  mut_bc dbl_mt ret_lr mis_cl mis_hs mis_or opt_ok res_me ROUTED
    bc_001-009 P   P      P      P      P      P      P      P      P      P
    bc_003,5,10.   .      .      .      .      .      .      .      .      .
    rh_002     .   .     +GAIN   .      .     +GAIN  +GAIN  +GAIN  +GAIN  +GAIN
    rh_008     P  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST  -LOST   P
    tb_005     P   P      P      P      P     -LOST   P      P      P      P
    

    Key finding: rh_008 regresses under every coefficient applied globally. But routing saves it by falling back to the base model when no pattern matches.

    This is exactly what the Share paper predicts: task-specific coefficients improve targeted patterns without interfering with unrelated ones.

    What the Papers Claim vs What We’ve Verified

    Verified

    1. Shared basis via SVD — We extract principal components from 51 adapters. Works.

    2. 76x parameter reduction — 83K coefficient parameters vs 1.6M full LoRA. Verified.

    3. Routing prevents forgetting — Zero regressions with routed inference. The fragile rh_008 koan survives because it falls back to base model.

    4. Positive transfer possible — Result handling improved 40% → 50% with routed coefficients.

    Not Yet Verified

    1. Sequential learning — The core continual learning claim. Train task 1 → eval → train task 2 → eval (verify task 1 still passes). This is next.

    2. UWSH subspace stability — Do different adapter subsets converge to similar subspaces? Grassmann distance measurement needed.

    Next Experiments

    Priority Experiment Target
    High Sequential learning curve No degradation on prior tasks
    High Fix k_alpha=32 (paper recommends) Match paper exactly
    Medium UWSH verification >70% subspace overlap
    Medium Add rank update vectors Full algorithm

    The Architecture

    Day:   Agent attempts to fix Rust errors
           ↓
           Successes and failures logged
           ↓
    Night: Train coefficients (frozen basis)
           ↓
           83K params per task
           ↓
    Eval:  Route to appropriate coefficients
           ↓
           Pattern-matched inference
           ↓
    (repeat)
    

    The key insight: train small, route smart. The shared basis captures common structure. Per-task coefficients specialize without interference.

    Resources


    60% of the way to verifying the papers. Sequential learning is next.

    Part 2 of the Towards Continuous LLM Learning series. View all parts

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 475 words3 min readAbstract

    Five ML Concepts - #15

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #15
    Video
    Comments Discord

    References

    Concept Reference
    Perplexity A Neural Probabilistic Language Model (Bengio et al. 2003)
    Catastrophic Forgetting Overcoming Catastrophic Forgetting in Neural Networks (Kirkpatrick et al. 2017)
    Weight Initialization Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010)
    Curse of Dimensionality The Elements of Statistical Learning (Hastie et al. 2009), Chapter 2
    Monitoring & Drift Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift (Rabanser et al. 2019)

    Today’s Five

    1. Perplexity

    A metric for language models that reflects how well the model predicts the next token. Lower perplexity means better predictive performance.

    Perplexity is the exponentiated average negative log-likelihood per token.

    Like a test where lower scores mean you found the answers easier to guess.

    2. Catastrophic Forgetting

    When training on new tasks causes a model to lose performance on previously learned tasks. This is a key challenge in continual learning.

    Techniques like elastic weight consolidation help preserve important weights.

    Like learning a new phone number and forgetting the old one.

    3. Weight Initialization

    The starting values of model weights influence how well training progresses. Poor initialization can cause vanishing or exploding gradients.

    Xavier and He initialization are common strategies for setting initial weights appropriately.

    Like starting a race from a good position instead of stuck in a ditch.

    4. Curse of Dimensionality

    In high-dimensional spaces, data becomes sparse and distances behave differently, making learning harder. Points that seem close in low dimensions can be far apart in high dimensions.

    Feature selection and dimensionality reduction help mitigate this effect.

    Like searching for a friend in a city versus across the entire universe.

    5. Monitoring & Drift Detection

    Continuously tracking model performance and detecting shifts in input data distributions. Production models can degrade silently without proper monitoring.

    Automated alerts help catch problems before they affect users.

    Like a weather station alerting you when conditions change.

    Quick Reference

    Concept One-liner
    Perplexity How surprised the model is by the data
    Catastrophic Forgetting New learning erases old knowledge
    Weight Initialization Starting values affect training stability
    Curse of Dimensionality High dimensions make data sparse
    Monitoring & Drift Track performance and data changes

    Short, accurate ML explainers. Follow for more.

    Part 15 of the Five ML Concepts series. View all parts | Next: Part 16 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 453 words3 min readAbstract

    Five ML Concepts - #14

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #14
    Video
    Comments Discord

    References

    Concept Reference
    ROC/AUC An Introduction to ROC Analysis (Fawcett 2006)
    Spurious Correlations Unbiased Look at Dataset Bias (Torralba & Efros 2011)
    Gradient Clipping On the Difficulty of Training Recurrent Neural Networks (Pascanu et al. 2013)
    Loss Landscapes Visualizing the Loss Landscape of Neural Nets (Li et al. 2018)
    Cold Start Addressing Cold Start in Recommender Systems (Schein et al. 2002)

    Today’s Five

    1. ROC / AUC

    ROC curves plot true positive rate against false positive rate across all classification thresholds. AUC (Area Under the Curve) summarizes overall ranking performance in a single number.

    AUC of 0.5 means random guessing; 1.0 means perfect ranking.

    Like judging a student by considering every possible passing grade cutoff.

    2. Spurious Correlations

    Coincidental patterns in training data that don’t reflect true relationships. Models that rely on them can fail when the coincidence disappears.

    Dataset curation and diverse evaluation help detect spurious features.

    Like assuming umbrellas cause rain because you always see them together.

    3. Gradient Clipping

    Limiting the size of gradients during backpropagation. This helps prevent exploding gradients and unstable training, especially in recurrent networks.

    Clipping can be by value or by global norm.

    Like putting a speed limit on a car so it doesn’t lose control.

    4. Loss Landscapes

    How model error changes across different parameter settings. Training is like navigating this surface toward regions of lower loss.

    Flat minima may generalize better than sharp ones.

    Like hiking through mountains searching for the lowest valley, feeling the slope beneath your feet.

    5. Cold Start Problems

    Difficulty predicting for new users or items with no history. Without prior data, personalization becomes difficult.

    Solutions include content-based features, popularity fallbacks, or asking initial questions.

    Like a librarian trying to recommend books to someone who just walked in.

    Quick Reference

    Concept One-liner
    ROC / AUC Classifier performance across thresholds
    Spurious Correlations Coincidental patterns that don’t generalize
    Gradient Clipping Limit gradient size for stability
    Loss Landscapes Error surface over parameter space
    Cold Start No history for new users/items

    Short, accurate ML explainers. Follow for more.

    Part 14 of the Five ML Concepts series. View all parts | Next: Part 15 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 453 words3 min readAbstract

    Five ML Concepts - #13

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #13
    Video
    Comments Discord

    References

    Concept Reference
    Calibration On Calibration of Modern Neural Networks (Guo et al. 2017)
    Shortcut Learning Shortcut Learning in Deep Neural Networks (Geirhos et al. 2020)
    Early Stopping Early Stopping - But When? (Prechelt 1998)
    Universal Approximation Approximation by Superpositions of a Sigmoidal Function (Cybenko 1989)
    Checkpointing Training Deep Nets with Sublinear Memory Cost (Chen et al. 2016)

    Today’s Five

    1. Calibration

    How well a model’s predicted probabilities match real-world outcomes. If a model predicts 70% confidence many times, it should be correct about 70% of those cases.

    Well-calibrated models enable better decision-making under uncertainty.

    Like a weather forecast that predicts rain 30% of the time and is right about 30% of those forecasts.

    2. Shortcut Learning

    When models rely on superficial patterns instead of meaningful features. For example, identifying cows by detecting grass and failing when cows appear indoors.

    Shortcuts can inflate benchmark scores while masking poor real-world performance.

    Like passing a test by memorizing answer positions instead of learning the material.

    3. Early Stopping

    Training is stopped when validation performance stops improving. This helps prevent overfitting by halting before the model memorizes training data.

    Patience hyperparameters control how long to wait before stopping.

    Like knowing when to stop practicing before you start reinforcing mistakes.

    4. Universal Approximation

    The theorem stating that neural networks can approximate any continuous function, given enough capacity. In practice, finding the right weights through optimization is the challenge.

    The theorem guarantees existence, not learnability.

    Like having enough Lego blocks to build almost any shape—assembly is still hard.

    5. Checkpointing

    Saving the model’s state during training. This allows recovery from interruptions and comparison across training stages.

    Checkpoints also enable selecting the best model rather than just the final one.

    Like saving your game progress so you can reload if something goes wrong.

    Quick Reference

    Concept One-liner
    Calibration Predicted probabilities match outcomes
    Shortcut Learning Exploiting spurious patterns
    Early Stopping Stop when validation plateaus
    Universal Approximation NNs can approximate any function
    Checkpointing Save model state during training

    Short, accurate ML explainers. Follow for more.

    Part 13 of the Five ML Concepts series. View all parts | Next: Part 14 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 493 words3 min readAbstract

    Five ML Concepts - #12

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #12
    Video
    Comments Discord

    References

    Concept Reference
    Precision/Recall The Truth of the F-Measure (Sasaki 2007)
    OOD Detection A Baseline for Detecting Misclassified and Out-of-Distribution Examples (Hendrycks & Gimpel 2017)
    Batch Size On Large-Batch Training for Deep Learning (Goyal et al. 2017)
    Inductive Bias Relational Inductive Biases, Deep Learning, and Graph Networks (Battaglia et al. 2018)
    Latency/Throughput Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al. 2021)

    Today’s Five

    1. Precision vs Recall

    Precision measures how often positive predictions are correct. Recall measures how many actual positives are successfully found. Improving one often reduces the other.

    The tradeoff depends on your application: spam filters favor precision, medical screening favors recall.

    Like a search party: you can find everyone but raise false alarms, or be very certain and miss some people.

    2. OOD Inputs (Out-of-Distribution)

    Data that differs significantly from what the model saw during training. Models may fail silently or produce confident but wrong answers.

    Detecting OOD inputs is an active research area for safer AI deployment.

    Like asking a chef trained only in Italian food to make sushi.

    3. Batch Size

    The number of training examples processed before updating model weights. Larger batches can be more efficient computationally, but may generalize worse.

    Finding the right batch size involves balancing speed, memory, and model quality.

    Like grading tests one at a time or waiting to grade a full stack.

    4. Inductive Bias

    The assumptions built into a model that guide how it learns from data. Without inductive bias, models cannot generalize beyond training examples.

    CNNs assume spatial locality; transformers assume tokens can attend to any position.

    Like expecting nearby houses to have similar prices before looking at the data.

    5. Latency vs Throughput

    Latency is how long a single request takes. Throughput is how many requests can be handled per second. Optimizing one often comes at the expense of the other.

    Batching improves throughput but increases latency for individual requests.

    Like a restaurant serving one table quickly or many tables at once.

    Quick Reference

    Concept One-liner
    Precision vs Recall Correct positives vs finding all positives
    OOD Inputs Data unlike training distribution
    Batch Size Examples per weight update
    Inductive Bias Built-in learning assumptions
    Latency vs Throughput Speed per request vs total capacity

    Short, accurate ML explainers. Follow for more.

    Part 12 of the Five ML Concepts series. View all parts | Next: Part 13 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1053 words6 min readAbstract

    Neural-Net-RS: An Educational Neural Network Platform

    I wanted a neural network implementation where every step is visible—no framework magic hiding the math. Something I could use to teach the fundamentals, with a CLI for quick experiments and a web UI for visual demonstrations. Claude Code built it.

    This is Personal Software for education: a complete neural network training platform with multiple interfaces, all from a single Rust codebase.

    Resource Link
    Repo neural-net-rs
    Video Neural-Net-RS Explainer
    Video
    Comments Discord

    Why Build Your Own Neural Network?

    Frameworks like PyTorch and TensorFlow are production-ready, but they hide the fundamentals. When teaching or learning, you want to see:

    • How weights and biases actually change during training
    • Why XOR needs a hidden layer when AND doesn’t
    • What backpropagation really computes

    Neural-Net-RS exposes all of this. No autograd magic—every gradient is computed explicitly. No tensor abstractions—just matrices with clear row-major storage.

    What Got Built

    A modular Rust workspace with multiple interfaces to the same core:

    neural-net-rs/
    ├── matrix/              # Linear algebra foundation
    ├── neural-network/      # Core ML implementation
    ├── neural-net-cli/      # Command-line interface
    ├── neural-net-server/   # REST API with SSE streaming
    └── neural-net-wasm/     # WebAssembly for browser
    

    One codebase, three ways to interact:

    • CLI: Train from terminal with progress bars
    • Web UI: Visual training with real-time loss charts
    • WASM: Run entirely in browser, no server needed

    The Classic Problems

    The platform includes 8 built-in examples that teach ML concepts progressively:

    Problem Architecture Key Concept
    AND, OR 2→2→1 Linear separability
    XOR 2→3→1 Why hidden layers matter
    Parity3 3→6→1 Scaling non-linearity
    Quadrant 2→8→4 Multi-class classification
    Adder2 4→8→3 Learning arithmetic
    Iris 4→8→3 Real-world dataset
    Pattern3x3 9→6→4 Visual pattern recognition

    The XOR Problem

    XOR is the canonical neural network problem. AND and OR are linearly separable—a single line can divide the outputs. XOR isn’t. You need a hidden layer.

    AND: (0,0)→0, (0,1)→0, (1,0)→0, (1,1)→1  ← One line separates
    XOR: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0  ← No line works
    

    Watch XOR training and you see why neural networks are powerful: they learn to create intermediate representations that make non-linear problems separable.

    Implementation Details

    Feed-Forward with Backpropagation

    pub struct Network {
        pub layers: Vec<usize>,      // [input, hidden..., output]
        pub weights: Vec<Matrix>,    // Learned connections
        pub biases: Vec<Matrix>,     // Per-neuron offsets
        pub learning_rate: f64,      // Training step size
    }
    

    Forward pass: Each layer computes activation(weights × input + bias)

    Backward pass: Gradients flow backward using the chain rule, updating weights to reduce error.

    The sigmoid activation function maps any input to (0, 1):

    σ(x) = 1 / (1 + e^(-x))
    

    Custom Matrix Library

    Educational clarity over maximum performance:

    pub struct Matrix {
        rows: usize,
        cols: usize,
        data: Vec<f64>,  // Row-major storage
    }
    

    Operations: dot product, transpose, element-wise multiply, map. Everything visible, nothing hidden.

    Checkpoint System

    Training can be interrupted and resumed:

    # Train for 5000 epochs, save checkpoint
    neural-net-cli train xor --epochs 5000 --checkpoint model.json
    
    # Resume from checkpoint
    neural-net-cli train xor --epochs 10000 --resume model.json
    

    Checkpoints include version metadata to prevent loading incompatible models.

    CLI Usage

    # List available examples
    neural-net-cli examples
    
    # Train XOR with progress bar
    neural-net-cli train xor --epochs 10000 --learning-rate 0.5
    
    # Predict with trained model
    neural-net-cli predict model.json --input "0,1"
    
    # Run web UI
    neural-net-cli serve --port 8080
    

    The CLI uses indicatif for real-time progress bars:

    Training XOR [=========>   ] 7500/10000 (75%) Loss: 0.0023
    

    Web Interface

    The server embeds all assets at compile time—one binary serves everything:

    • Training panel: Select problem, set hyperparameters, watch loss decrease
    • Network visualization: See layer structure and connection strengths
    • Prediction panel: Test the trained model interactively
    • Loss chart: Real-time plotting via Server-Sent Events

    Two training modes:

    • Local (WASM): Runs entirely in browser
    • Remote (API): Server-side with streaming progress

    Technology Choices

    Component Purpose
    Rust Performance, safety, single-binary distribution
    Axum Lightweight async web framework
    wasm-bindgen Rust → WebAssembly compilation
    Indicatif Terminal progress bars
    Serde JSON serialization for checkpoints

    The WASM module is ~248KB after optimization.

    Test Coverage

    136+ tests across the workspace:

    • Matrix operations (unit tests)
    • Network training (integration tests)
    • CLI commands (integration tests)
    • Server endpoints (integration tests)
    • WASM bindings (unit tests)

    Zero clippy warnings. Reproducible results via seeded RNG.

    References

    Resource Link
    Backpropagation Learning representations by back-propagating errors (Rumelhart et al. 1986)
    Multi-Layer Perceptron Multilayer perceptron (Wikipedia)
    XOR Problem Perceptrons (Minsky & Papert 1969)
    Weight Initialization Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio 2010)
    Inspired by codemoonsxyz/neural-net-rs

    The Vibe Coding Process

    This project grew through iterative conversation with Claude Code:

    1. “Build a basic neural network in Rust with backpropagation”
    2. “Add a CLI with progress bars”
    3. “Add a web UI with real-time training visualization”
    4. “Compile to WASM so it runs in the browser”
    5. “Add checkpoint save/resume”
    6. “Include classic ML examples with educational documentation”

    Each request built on the previous. The AI handled architecture decisions, chose appropriate crates, and maintained test coverage throughout.


    When you want to understand how neural networks actually work, sometimes you need to see every weight update. That’s what this platform provides—education through transparency.

    Part 4 of the Machine Learning series. View all parts | Next: Part 5 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 919 words5 min readAbstract

    Cat Finder: Personal Software via Vibe Coding

    I needed to find cat photos scattered across my system. Instead of searching the app store, signing up for a cloud service, or uploading my personal photos to someone else’s servers, I asked Claude Code to build me the tool I needed. An hour later, I had it.

    This is Personal Software—software that exists because you needed it, built the way you want it, running entirely under your control.

    Resource Link
    Repo cat-finder
    Video Cat Finder Explainer
    Video
    Comments Discord

    The Vibe Coding Approach

    Vibe Coding is about describing what you want and letting AI handle the implementation details. No boilerplate, no Stack Overflow rabbit holes, no fighting with build systems. You focus on the what, the AI handles the how.

    For Cat Finder, the conversation went something like:

    “I want a CLI tool that scans directories for images containing cats. Run locally, no cloud. Use YOLO for detection. Output just the file paths so I can pipe them to other commands.”

    Claude Code chose the tech stack (Rust, YOLOv8n, ONNX Runtime), handled the tensor math, figured out the COCO class IDs, and produced a working tool. I guided the direction; the AI wrote the code.

    Why Personal Software?

    The traditional options for “find cat photos” would be:

    1. Cloud service: Upload photos to Google/Apple/Amazon, let them scan everything, hope they respect your privacy
    2. Desktop app: Find something in an app store, hope it does what you want, deal with subscription fees or ads
    3. Write it yourself: Spend days learning YOLO integration, tensor formats, image preprocessing

    Personal Software offers a fourth path: describe what you need, get exactly that, own the result completely.

    Cat Finder runs entirely on my machine. No accounts, no uploads, no subscriptions, no ads. The code is mine to modify, extend, or share.

    What Got Built

    A Rust CLI tool using YOLOv8n (the nano variant) through ONNX Runtime:

    Directory Traversal → Image Preprocessing → YOLO Inference → Cat Detection → Output
    

    The Detection Pipeline

    1. Walk directories recursively, finding image files (jpg, png, gif, webp, etc.)
    2. Preprocess each image: resize to 640×640, normalize to 0.0-1.0, convert to NCHW tensor format
    3. Run inference through the YOLOv8n ONNX model
    4. Parse output for class ID 15 (cat in COCO ordering) above confidence threshold
    5. Print matching paths to stdout for easy piping to other tools

    Unix Philosophy

    # stdout: just paths (machine-parseable)
    # stderr: logging and progress
    
    cat-finder ~/Photos | xargs -I {} cp {} ~/CatPhotos/
    

    This separation enables composable Unix pipelines. The tool does one thing well and plays nicely with others.

    Technology Stack

    Component Purpose
    Rust Memory-safe, high-performance core
    YOLOv8n Lightweight object detection (12MB model)
    ONNX Runtime Cross-platform inference engine
    clap CLI argument parsing
    ndarray Tensor operations
    walkdir Recursive directory traversal

    Total footprint: ~80MB (runtime + model + binary)

    I didn’t choose this stack—Claude Code did, based on the requirements. It made good choices.

    Usage

    # Basic usage
    cat-finder ~/Photos
    
    # Adjust confidence threshold
    cat-finder --confidence 0.5 ~/Photos
    
    # Verbose output with timestamps
    cat-finder -v -t ~/Photos
    
    # Copy all cat photos to a new folder
    cat-finder ~/Photos | xargs -I {} cp {} ~/CatAlbum/
    

    Honest About Limitations

    The README documents failure cases transparently:

    Image Type Detection Success
    Clear photographs High
    Artistic/stylized images Low
    Cats in clothing Low
    Small/partial cats Variable
    Low quality/blurry Variable

    Test results: 7 of 9 cat images detected (77.8% recall). Oil paintings and anthropomorphized cats confuse models trained on photographs. This is documented, not hidden.

    Bonus Features

    The project grew organically based on related needs:

    Duplicate Finder: A second binary for finding duplicate images using size-based filtering followed by SHA-256 checksums.

    find-duplicates ~/Photos
    

    Web Demo: A Flask-based interface for visual feedback with real-time progress via Server-Sent Events.

    These emerged from “while you’re at it…” requests during development. Vibe coding makes feature additions nearly frictionless.

    Setup

    git clone https://github.com/sw-ml-study/cat-finder
    cd cat-finder
    ./scripts/setup.sh  # Downloads model, builds project
    ./cat-finder ~/Photos
    

    The Personal Software Philosophy

    Privacy-first: All processing happens locally. No cloud APIs, no external services, no data leaving your machine.

    Ownership: The code is yours. Modify it, extend it, share it, delete it.

    Fit-for-purpose: Built for exactly what you need, nothing more, nothing less.

    Transparency: Known limitations documented. No marketing spin.

    References

    Resource Link
    YOLOv8 Ultralytics YOLOv8 - State-of-the-art object detection
    ONNX Runtime ONNX Runtime - Cross-platform inference engine
    ort crate ort - Rust bindings for ONNX Runtime
    COCO Dataset COCO Classes - Class ID 15 = cat

    You don’t always need an app store or a cloud service. Sometimes you just need to describe what you want and let an AI build it for you. That’s vibe coding. That’s personal software.

    Part 1 of the Personal Software series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 508 words3 min readAbstract

    Five ML Concepts - #11

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #11
    Video
    Comments Discord

    References

    Concept Reference
    RNN Learning representations by back-propagating errors (Rumelhart et al. 1986)
    Chain of Thought Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)
    Softmax Deep Learning (Goodfellow et al. 2016), Chapter 6
    MoE Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al. 2017)
    Distribution Shift Dataset Shift in Machine Learning (Quiñonero-Candela et al. 2009)

    Today’s Five

    1. RNN (Recurrent Neural Network)

    Networks designed for sequential data that maintain a hidden state carrying information across time steps. This makes them useful for language, time series, and audio.

    LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) are improved variants that better handle long-range dependencies.

    Like reading a story while keeping mental notes about characters and plot as you go.

    2. Chain of Thought

    A prompting technique that encourages step-by-step reasoning in language models. Instead of producing an answer immediately, the model generates intermediate steps.

    This can improve performance on math, logic, and multi-step problems.

    Like showing your work on a math test instead of just writing the final answer.

    3. Softmax

    Converts a vector of scores into a probability distribution where each output falls between zero and one, and all outputs sum to one. It is commonly used in classification models.

    Softmax makes raw scores easier to interpret as probabilities.

    Like turning test scores into percentages that add up to 100%.

    4. MoE (Mixture of Experts)

    Instead of one large network, the model contains many smaller expert networks with a routing mechanism that selects which experts process each input. This allows models to scale capacity while keeping computation efficient.

    Only a subset of experts activates for any given input.

    Like a hospital with specialists where a receptionist directs you to the right doctor.

    5. Distribution Shift

    Occurs when deployment data differs from training data, causing a model trained on one environment to perform poorly in another. Common causes include seasonal changes, user behavior shifts, or new populations.

    Monitoring for drift and retraining helps maintain performance.

    Like a weather model trained on summer data struggling to predict winter storms.

    Quick Reference

    Concept One-liner
    RNN Sequential processing with memory across time
    Chain of Thought Step-by-step reasoning in prompts
    Softmax Scores to normalized probabilities
    MoE Route inputs to specialized experts
    Distribution Shift Training vs deployment data mismatch

    Short, accurate ML explainers. Follow for more.

    Part 11 of the Five ML Concepts series. View all parts | Next: Part 12 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1000 words5 min readAbstract

    RLM: Recursive Language Models for Massive Context

    What happens when your data won’t fit in a context window? RLM expands the workspace instead of cramming everything into limited memory. This post covers the MIT paper, my Rust implementation, and six video demonstrations.

    Resource Link
    Paper arXiv:2512.24601
    Code rlm-project
    Playlist RLM Implementations
    Comments Discord

    The Problem: Context Limits

    Large language models have a hard limit. They can only process so much text at once.

    Imagine a cookie jar that holds 100 cookies. What if you need to search through ten thousand? When you force too much in, the model forgets things—this is called context rot.

    Bigger models help, but the limit always exists. We need a different approach.

    The RLM Solution

    Recursive Language Models flip the problem. Instead of bigger jars, use better tools.

    The data stays in a context box. The model gets tools to peek inside:

    Tool Purpose
    slice Get a character range
    find Search for text
    regex Pattern matching
    count Count occurrences
    llm_query Ask a sub-LLM to analyze a chunk

    Small, focused, deliberate. The model thinks about what it needs, then asks for just that.

    The Results

    From the MIT paper—on tasks that don’t fit in context:

    Approach Accuracy
    Standard prompting 0%
    RLM 87-91%

    Results hold across GPT-4, Claude, Llama, Mistral, and Gemini.

    My Implementation: Four Capability Levels

    I built a Rust implementation with four capability levels:

    Level Name Description
    L1 DSL Built-in commands (find, regex, count)
    L2 WASM LLM generates Rust → compiles to WebAssembly sandbox
    L3 CLI LLM generates Rust → compiles to native binary
    L4 LLM Recursive delegation to sub-LLMs

    Each level trades off safety for capability:

    • L1 is instant but limited to predefined operations
    • L2 runs custom code but in a sandboxed environment
    • L3 breaks free for large datasets that would timeout in WASM
    • L4 uses LLM reasoning for semantic analysis

    The Video Series

    Six videos demonstrate RLM in action:

    1. RLM Explained

    RLM Explained

    The foundational video. Covers the MIT paper, the cookie jar analogy, and benchmark results showing 0% → 91% accuracy improvement.

    Key insight: Expand the workspace, not the context.


    2. War and Peace Demo

    War and Peace Demo

    Can AI read all of War and Peace to find a hidden secret? The full text is 3.2 MB with 65,666 lines—way too big for any context window.

    RLM finds “the password to Prince Andrei’s secret vault” in just 2 iterations using only 3,000 tokens. That’s 100% savings compared to sending the full document.


    3. WASM Sandboxing

    WASM Sandboxing

    What if your LLM could write custom analysis code on the fly? Level 2 demonstrates WebAssembly sandboxing.

    The LLM writes Rust code that compiles to WASM and runs in a secure sandbox. Demos include:

    • Error ranking in logs
    • Response time percentiles
    • Unique IP counting

    Trade-offs: ASCII only, 64MB memory limit, subset of Rust.


    4. Native CLI Binaries

    Native CLI Binaries

    When 5,000 lines would timeout in WASM, Level 3 breaks free. Native Rust binaries process massive datasets with no limits.

    Four CLI demos:

    • Error ranking: Hash map counts error types
    • Unique IPs: Hash set finds distinct addresses
    • Percentiles: Sort and index for p50/p95/p99
    • Word frequency: Tokenize, filter stop words, count

    5. Detective Mystery Demo

    Detective Mystery Demo

    A murder at the manor. Seven suspects. Dozens of clues. Can an LLM solve it?

    Level 4 delegates reasoning to sub-LLMs. Instead of code execution, the model calls other models to:

    • Analyze witness statements
    • Compare alibis
    • Draw conclusions

    Watch as L4 examines each suspect and identifies the killer.


    6. Large Context Processing

    Large Context Processing

    War and Peace is 3MB—far too large for any context window. This video shows Level 4 extracting noble family relationships from the entire novel.

    The process:

    1. L3 extracts relationship sentences (father, mother, son, daughter…)
    2. L4 analyzes filtered data with sub-LLMs
    3. Final output: structured family trees

    Three million characters → structured family trees in ~90 seconds.


    Architecture

    ┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
    │   Client    │────▶│  RLM Server     │────▶│  Root LLM   │
    │  /visualize │     │  (Rust/Axum)    │     │  (DeepSeek) │
    └─────────────┘     └────────┬────────┘     └─────────────┘
                                 │
                        ┌────────▼────────┐
                        │ Command Executor │
                        │  slice, find,   │
                        │  regex, count,  │
                        │  llm_query...   │
                        └────────┬────────┘
                                 │
                  ┌──────────────┼──────────────┐
                  ▼              ▼              ▼
            ┌──────────┐  ┌──────────┐  ┌──────────┐
            │  Ollama  │  │  Ollama  │  │  Ollama  │
            │ (local)  │  │ (remote) │  │ (other)  │
            └──────────┘  └──────────┘  └──────────┘
                  Sub-LM Pool (for llm_query)
    

    Quick Start

    cd rlm-orchestrator
    
    # Configure providers in config.toml
    export DEEPSEEK_API_KEY="your-key"
    
    # Run the server
    cargo run --bin rlm-server
    
    # Open visualizer
    open http://localhost:8080/visualize
    

    Think of it like this:

    • Old way: Dump everything on the table, then dig through the mess
    • RLM way: Use a scoop—grab just the cookies you need

    The key insight is simple: expand the workspace, not the context.

    Resources


    When context windows aren’t enough, RLM gives your LLM tools to explore. Six videos, four capability levels, one insight: expand the workspace, not the context.

    Part 3 of the Machine Learning series. View all parts | Next: Part 4 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 504 words3 min readAbstract

    Five ML Concepts - #10

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #10
    Video
    Comments Discord

    References

    Concept Reference
    CNN ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky et al. 2012)
    Encoder-Decoder Sequence to Sequence Learning with Neural Networks (Sutskever et al. 2014)
    RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al. 2020)
    Few-shot Learning Language Models are Few-Shot Learners (Brown et al. 2020)
    Distillation Distilling the Knowledge in a Neural Network (Hinton et al. 2015)

    Today’s Five

    1. CNN (Convolutional Neural Network)

    Networks designed for image data that use small filters sliding across an image to detect edges, textures, and shapes. Early layers find simple patterns, while deeper layers recognize complex objects.

    CNNs are a foundation of modern computer vision.

    Like scanning a photo with a magnifying glass that learns to recognize patterns at different scales.

    2. Encoder-Decoder

    A model architecture with two parts: the encoder compresses input into a representation, and the decoder generates an output from that representation. This pattern is common in translation, summarization, and speech systems.

    The representation acts as a bottleneck that captures essential information.

    Like summarizing a book into notes, then writing a new version from those notes.

    3. RAG (Retrieval-Augmented Generation)

    Instead of relying only on learned parameters, the model retrieves relevant documents and uses them during generation. This helps ground responses in external information and can reduce hallucinations.

    RAG combines the strengths of retrieval systems and generative models.

    Like an open-book exam where you can look up facts instead of relying purely on memory.

    4. Few-shot Learning

    Adapting behavior from just a few examples provided directly in the prompt. Instead of retraining, the model infers the pattern and applies it to new inputs.

    Zero-shot learning relies only on instructions, without examples.

    Like learning a card game by watching a few hands before playing.

    5. Distillation

    Transferring knowledge from a large teacher model to a smaller student. The student learns to match the teacher’s outputs, not its internal weights.

    This produces models that are smaller and cheaper while retaining much of the original capability.

    Like an apprentice learning by imitating a master’s finished work, not by copying their brain.

    Quick Reference

    Concept One-liner
    CNN Sliding filters for hierarchical image features
    Encoder-Decoder Compress input, then generate output
    RAG Retrieve context before generating
    Few-shot Learning Learn from examples in the prompt
    Distillation Small student mimics large teacher

    Short, accurate ML explainers. Follow for more.

    Part 10 of the Five ML Concepts series. View all parts | Next: Part 11 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1638 words9 min readAbstract

    TBT (3/?): Vector Graphics Games

    Before pixels, there were vectors. This Throwback Thursday explores the evolution of vector graphics gaming—from military radar displays to arcade classics—and my attempt to recreate them in Rust and WebAssembly.

    My First Vector Display: The IBM 2250

    IBM 2250 Graphics Display Unit with light pen, October 1969
    IBM 2250 at Brown University, 1969. Photo credit

    My first encounter with vector graphics was an IBM 2250 Graphics Display Unit—introduced in 1964, costing around $280,000 in period dollars. It connected to an IBM 1130 that acted as a graphics controller for an IBM S/370 mainframe where the graphical applications ran. At that price, nobody was playing games on it—Computer Aided Design was the killer app.

    The 2250’s specifications were impressive for its era:

    Specification Value
    Display 21-inch P39 phosphor CRT
    Resolution 1024 × 1024 addressable points
    Usable area 12” × 12” (square aspect)
    Refresh rate ~40 frames/second
    Input Light pen for direct interaction
    Vector drawing Hardware character generator optional

    The CRT drew lines by steering an electron beam directly—no pixel grid, no rasterization. Just pure geometry traced in phosphor glow. The green P39 phosphor had long persistence, reducing flicker but creating ghostly trails on moving objects.

    The light pen was revolutionary: you could point directly at displayed geometry and the system knew which vector you were touching. Interactive graphics in 1964.

    The Arcade Era

    Vector displays found their way into arcades, where they defined a visual style that’s still recognizable today:

    Game Year Innovation
    Lunar Lander 1979 Physics simulation, thrust/gravity
    Asteroids 1979 Wrap-around space, particle effects
    BattleZone 1980 Green wireframe 3D, first-person tanks
    Tempest 1981 Multi-colored vectors, pseudo-3D depth

    (Note: Pong (1972) was actually a raster game using discrete logic, but its simple geometry makes it a natural fit for vector recreation.)

    Each generation built on the last. White vectors on black screens gave way to green wireframes, then full color. The hardware pushed boundaries that feel primitive now but were revolutionary then.

    The Vectorcade Project

    Vectorcade recreates these mechanics using modern tools:

    • Rust for game logic and rendering
    • WebAssembly for browser deployment
    • wgpu for GPU-accelerated vector rendering
    • Yew for the web frontend

    Multi-Repo Architecture

    The project architecture emerged from a design session with ChatGPT, exploring how to structure a multi-agent development workflow. The result: a DAG of repositories, each with clear ownership boundaries:

    vectorcade-shared/      (Pure Rust API contracts)
        ↓
    vectorcade-fonts/       (Vector font styles)
        ↓
    vectorcade-games/       (Game logic: Pong, Asteroids, etc.)
        ↓
    vectorcade-render-wgpu/ (wgpu + lyon tessellation)
        ↓
    vectorcade-web-yew/     (Yew web shell)
    

    This DAG structure allows parallel development with assigned agent roles:

    Agent Repo Focus
    A vectorcade-shared Core API steward: minimal, stable, pure
    B vectorcade-fonts Font stylist: 3-5 distinct vector styles
    C vectorcade-games Game logic: Pong → Asteroids → Lunar Lander
    D vectorcade-render-wgpu Renderer: lyon tessellation → wgpu triangles
    E vectorcade-web-yew Integrator: UI, mobile controls, PWA

    Each agent works against stable interfaces—the DrawCmd display list and Game trait—so they don’t step on each other.

    The Display List Model

    Games don’t render directly. They emit draw commands that the renderer interprets:

    pub enum DrawCmd {
        Clear { color: Rgba },
        Line(Line2),
        Polyline { pts: Vec<[f32;2]>, closed: bool, stroke: Stroke },
        Text { pos: [f32;2], s: String, size_px: f32, color: Rgba },
        PushTransform(Transform2),
        PopTransform,
    }
    

    This keeps game logic portable. The same Asteroids code can render through wgpu on desktop, WebGPU in browsers, or even a software rasterizer.

    Vector Fonts

    Classic arcade games had distinctive lettering. Vectorcade includes multiple font styles to match:

    Style Look Games
    ATARI Boxy, utilitarian Asteroids, Lunar Lander
    CINEMATRONICS Thin, angular Star Castle
    MIDWAY Slightly rounded Defender
    VECTOR_SCANLINE Broken segments “Beam jitter” effect

    Each font is pure vector geometry—no bitmaps, no texture atlases.

    3D Projection

    BattleZone and Tempest need 3D-to-2D projection. Instead of a full 3D renderer, Vectorcade uses a “2.5D pipeline”:

    pub struct Camera3 {
        pub pos: [f32;3],
        pub yaw: f32,
        pub pitch: f32,
        pub fov_y_rad: f32,
    }
    
    pub fn project_polyline(cam: &Camera3, pts3: &[[f32;3]]) -> Vec<[f32;2]>;
    

    Games maintain 3D geometry; the core projects it to 2D lines. Depth-based brightness gives the classic “farther = dimmer” effect.

    Why Rust + WASM?

    The combination solves several problems:

    1. Performance: Games need consistent frame rates; Rust delivers
    2. Portability: Same code runs native and in browsers
    3. Safety: No dangling pointers in the game loop
    4. Modern tooling: Cargo, wasm-pack, Trunk make deployment straightforward

    The wgpu + lyon stack provides cross-platform GPU rendering with proper thick-line support (WebGL’s lineWidth is notoriously inconsistent).

    Current Status

    Component Status
    vectorcade-shared Functional
    vectorcade-fonts Functional
    vectorcade-games Playable (5 demos)
    vectorcade-render-wgpu Functional
    vectorcade-web-yew Functional

    The core architecture works. All five demos are playable in the browser. Polish and audio remain.

    The Demos

    The video showcases five demonstrations, progressing from static display to full gameplay:

    1. IBM 2250 Chessboard

    A static image rendered in the style of the original IBM 2250. The 2250 was mainly used for Computer Aided Design, but programmers did create games on it—this chessboard pays tribute to that era.

    2. Pong (Playable)

    A vector implementation of the classic. The original Pong (1972) wasn’t actually a vector game—it used discrete logic and a raster display—but some clones used vector hardware. This recreation captures the pure-geometry aesthetic.

    3. Asteroids (Playable)

    One of the most popular vector arcade games. Rotate, thrust, and shoot to survive. The ship and asteroids wrap around screen edges, creating the classic “infinite space” feel.

    4. BattleZone (Playable)

    Green wireframe 3D tanks. Drive through a battlefield, shooting enemies and dodging missiles. One of the first games with true 3D perspective—rendered entirely with vectors.

    5. Tempest (Playable)

    The pinnacle of vector arcade hardware. Move around the edge of geometric tubes, shooting enemies that climb up from the depths. Each level changes the tube shape and color scheme.

    Implementation

    Each game implements the same Game trait:

    pub trait Game {
        fn metadata(&self) -> GameMeta;
        fn reset(&mut self, ctx: &mut GameCtx);
        fn update(&mut self, ctx: &mut GameCtx, dt: f32);
        fn render(&mut self, ctx: &mut GameCtx, out: &mut Vec<DrawCmd>);
    }
    

    This makes games drop-in replaceable in the web shell—no renderer changes needed.

    TODO

    The demos are playable but not finished. Remaining work:

    • GPU rendering: Switch from Canvas 2D emulation to actual wgpu GPU rendering [Ed. Completed 2/13]
    • Music and sound effects: Authentic arcade audio
    • More aggressive opponents: AI improvements for challenge
    • Additional levels/difficulties: Progression and replay value
    • More animations: Explosions, transitions, effects

    Resources


    Before pixels, there were vectors. Vectorcade brings them back—in Rust, for the browser, with phosphor glow optional.

    Credits

    Role Credit
    Director Mike Wright
    Research & Architecture ChatGPT
    vectorcade-shared Claude Code CLI agent
    vectorcade-fonts Claude Code CLI agent
    vectorcade-games Claude Code CLI agent
    vectorcade-render-wgpu Claude Code CLI agent
    vectorcade-web-yew Claude Code CLI agent
    Explainer Video Claude Code
    Blog Post Claude Code

    Timeline: First pass vibe coded in one day (February 12, 2026)

    • First commit: 11:08 AM PST
    • Last commit: 5:08 PM PST
    • Total commits: 52 across 4 repositories
    • WGPU support added February 13, 2026

    References

    IBM 2250 Photo: “HES IBM 2250 Console grlloyd Oct1969” by Gregory Lloyd, October 1969. Brown University Hypertext Editing System (HES) demonstration. Licensed under CC BY-SA 4.0. Used with attribution.

    Part 3 of the Throwback Thursday series. View all parts | Next: Part 4 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 786 words4 min readAbstract

    DyTopo: Dynamic Topology for Multi-Agent AI

    When multiple AI agents work together, how should they communicate? Fixed patterns fail at scale. DyTopo rebuilds the communication graph each round based on what agents need and what they can offer.

    Resource Link
    Video DyTopo
    Video
    Paper arXiv:2505.16128
    Code dytopo-rs
    Comments Discord

    The Problem: Fixed Topologies Don’t Scale

    Multi-agent systems need communication patterns. The obvious approaches have problems:

    Topology Problem
    All-to-all Context explosion—every agent reads every message
    Chain Bottlenecks—one slow agent blocks everyone
    Star Single point of failure at the hub

    As agent count grows, fixed topologies either explode in messages or create chokepoints.

    The DyTopo Solution: Dynamic Routing

    DyTopo (Dynamic Topology) solves this by reconstructing the communication graph each round. The key insight: agents know what they need and what they can offer.

    Each round, every agent emits:

    • Query: What information do I need?
    • Key: What can I contribute?

    The router computes semantic similarity between all keys and queries, then builds a sparse directed graph:

    score(sender → receiver) = cosine(sender.key, receiver.query)
    

    High-scoring pairs connect. Low-scoring pairs are ignored. The result: efficient, adaptive communication.

    How It Works

    Round N:
      1. Manager broadcasts goal
      2. Each agent produces:
         - Query (what I need)
         - Key (what I offer)
         - Draft (my current contribution)
      3. Router embeds keys and queries
      4. Similarity matrix → sparse graph (top-K per receiver)
      5. Messages flow along edges
      6. Trace written to JSONL
    

    The topology adapts every round. An agent working on parsing might connect to the syntax expert in round 1, then the error-handling expert in round 2.

    The Implementation: Rust, Zero Python

    dytopo-rs is a fully Rust implementation with no Python dependencies:

    Crate Purpose
    dytopo-core Shared types (AgentId, Topology, TraceEvent)
    dytopo-embed Text embedding (hash-based baseline, semantic planned)
    dytopo-router Sparse graph construction
    dytopo-agents Agent implementations
    dytopo-orchestrator Main execution loop
    dytopo-viz DOT export for visualization
    dytopo-cli Command-line interface

    Why Rust?

    1. Zero-cost abstractions for performance-critical embedding/routing
    2. Strong type system catches protocol mismatches at compile time
    3. No Python dependency for baseline demos
    4. Fearless concurrency for future parallelization

    Running the Demo

    cargo run -p dytopo-cli -- demo --rounds 3 --agents 5 --topk 2
    

    This produces:

    • Per-round topology printed to stdout
    • ./traces/trace_*.jsonl for machine-readable analysis
    • DOT files for graph visualization

    Current Status

    Milestone 0 is complete—the system runs end-to-end with stub agents and hash-based embeddings.

    Feature Status
    Core types and traits Done
    Hash embedder (deterministic) Done
    Top-K sparse routing Done
    Stub agents with templates Done
    Orchestrator loop Done
    JSONL tracing Done
    DOT visualization Done

    Planned

    • Semantic embeddings (fastembed/candle)
    • LLM-backed agents (Ollama integration)
    • Inbox summarization for long conversations
    • Evaluation harness comparing topologies

    Key Design Decisions

    Why Hash Embeddings First?

    The baseline uses deterministic hash-based embeddings:

    • Reproducible demos for debugging
    • No external dependencies to download
    • Validates the full pipeline before adding ML complexity

    Semantic embeddings are planned as drop-in replacements.

    Why Sparse Graphs?

    Each agent receives at most topk messages per round:

    • Prevents context explosion as agent count grows
    • Makes communication interpretable—you can trace why agents connected
    • Matches the paper’s approach

    Why JSONL Traces?

    Every event is logged to JSONL:

    • Append-only for streaming
    • Line-based for grep/filtering
    • Machine-parseable for analysis tools
    • Human-readable for debugging

    Topology Comparison

    The system supports multiple topology strategies for comparison:

    Strategy Description Use Case
    dynamic DyTopo routing Adaptive, sparse
    fully_connected All-to-all Baseline comparison
    chain Sequential Pipeline tasks
    star Hub-and-spoke Centralized coordination

    What’s Next

    1. LLM Agent Support (Milestone 2)—Replace stubs with real reasoning
    2. Semantic Embeddings (Milestone 1)—Meaningful routing decisions
    3. Evaluation Harness (Milestone 4)—Quantify DyTopo advantages

    Resources


    Dynamic topology lets agents find the right collaborators each round. No context explosion. No bottlenecks. Just efficient, adaptive communication.

    Part 2 of the Machine Learning series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1216 words7 min readAbstract

    Towards Continuous LLM Learning (1): Sleepy Coder - When Fine-Tuning Fails

    What if your AI coding assistant could learn from its mistakes? Not just for one session, but across training cycles. We built exactly that—and fifty-one adapters later, learned the mistake was trying to teach it at all.

    Resource Link
    Video Sleepy Coder
    Video
    Code sleepy-coder
    Share Paper arXiv:2602.06043
    UWSH Paper arXiv:2512.05117
    Part 2 Routing Prevents Forgetting
    Comments Discord

    The Dream: Day/Night Learning

    AI coding agents have a memory problem. They fix a bug today, then make the same mistake next week. Every session starts from the same frozen model. Nothing carries forward.

    The idea was elegant: build an agent that improves overnight.

    DAY CYCLE (Inference)
      Agent attempts to fix Rust compiler errors
      Successes and failures are logged
            ↓
    NIGHT CYCLE (Training)
      Fine-tune on failure patterns using LoRA
      Create specialized adapters
            ↓
    EVAL
      Test against benchmark
      Measure improvement
            ↓
    (repeat)
    

    During the day, the agent works and we log its failures—the error messages, the broken code, and the fixes that worked. Overnight, we fine-tune the model on those failures. Each morning, a new checkpoint should wake up a little better than before.

    We based this on two papers from the Johns Hopkins team (Kaushik, Vaidya, Chaudhari, Chellappa, Yuille):

    1. Share LoRA Subspaces (arXiv:2602.06043) — Learn a shared low-rank basis across tasks, then train only coefficients (76x fewer parameters per task)

    2. UWSH (arXiv:2512.05117) — The Universal Weight Subspace Hypothesis suggests neural networks converge to shared spectral subspaces

    The theory was sound. The implementation worked. The results were devastating.

    The System

    The Sleepy Coder agent runs in a Rust runtime, fixing compiler errors on 30 “koans” (small coding exercises) across 5 error families:

    • Borrow Checker: Ownership and lifetime errors
    • Type Bounds: Missing trait implementations
    • Result Handling: Option/Result conversions
    • Type Mismatches: Incompatible types
    • Missing Items: Undefined functions or modules

    The base model: Qwen2.5-Coder-1.5B-Instruct — small enough to train on a single GPU, capable enough to pass most koans without any fine-tuning.

    The Journey: From Hope to Reality

    Chapter 1: Naive LoRA

    First attempt: standard fine-tuning on failure patterns.

    Metric Before After
    Pass Rate 73.3% 60.0%
    Change -13.3%

    Catastrophic forgetting. The model learned the new patterns but forgot how to do everything else.

    Chapter 2: The Paper Chase

    We found the Share paper promising “continual learning without forgetting.” The UWSH paper provided theoretical backing: neural networks naturally converge to shared low-rank subspaces.

    Key insight from Share:

    Train ONLY the coefficients. Keep the basis FROZEN.

    This meant ~21,000 trainable parameters instead of ~1.6 million. A 76x reduction.

    Chapter 3: The Proper Implementation

    SVD: Singular Value Decomposition breaks a matrix into components that reveal its underlying structure. In Share, SVD finds the common “directions” that multiple LoRA adapters share—a compressed basis that captures what they have in common.

    We rebuilt everything:

    • Phase 1: Extract shared basis from 51 adapters via SVD
    • Phase 2: Train only coefficient vectors (frozen basis)
    • Phase 3: Merge and update basis periodically

    We trained 51 pattern-specific adapters. We followed the algorithm precisely.

    Chapter 4: The Stubborn Seven

    No matter what we tried, 7 tasks kept failing:

    Task The Problem
    bc_003 Mutable borrow while immutable exists
    bc_005 Double mutable borrow
    bc_010 Returning reference to local data
    tb_002 Missing Clone trait
    tb_007 Missing Hash trait
    tb_008 Missing Ord trait
    rh_004 Option to Result conversion

    These require deep understanding of Rust’s ownership system—something a 1.5B model can’t reliably learn.

    Chapter 5: The Final Score

    Approach Pass Rate vs Baseline Regressions
    Baseline (no training) 73.3% 0
    Naive LoRA 60.0% -13.3% Many
    Targeted LoRA (7 patterns) 63.3% -10% 4+
    Replay buffer 70.0% -3.3% 2
    Phase 2 coef-only (10K params) 66.7% -6.6% 2
    Share Full (Ph2+Ph3) 73.3% 0% 0

    The Share algorithm did exactly what it claimed: it prevented forgetting. But it couldn’t improve beyond baseline because there was nothing to improve.

    What Went Wrong

    1. The Model Already Knows

    The base model already passes 73% of patterns. Training on these patterns doesn’t add knowledge—it dilutes what’s there.

    2. Training Causes Forgetting

    Even training only on the 7 failure patterns (44 examples) caused 4 new regressions. The model’s knowledge is interconnected.

    3. Averaging Destroys Specialization

    The Share paper assumes task routing at inference—selecting the right coefficients for each task. We averaged coefficients, which negated any specialization.

    4. More Adapters Made It Worse

    Adapter Count Pass Rate
    6 adapters 73.3%
    51 adapters 70.0%

    More adapters meant more subspace dilution when averaging. The signal got lost in the noise.

    The Critical Insight

    LoRA fine-tuning cannot improve a capable base model for tasks it already handles reasonably well.

    The model’s knowledge is interconnected. Even 10,000 trainable parameters (0.0007% of the model) can break things. The baseline represents the ceiling, not the floor.

    What We Learned

    1. Read the room. If your base model passes 73%, maybe it doesn’t need fine-tuning. Maybe it needs better prompts.

    2. Negative results are results. 51 failed experiments taught us more than a successful one would have.

    3. Catastrophic forgetting is real. Small models especially can’t absorb new knowledge without losing old.

    4. Share prevents forgetting, not ignorance. The algorithm does what it claims—it just can’t create knowledge from nothing.

    5. Sometimes the answer is “don’t.” The best LoRA adapter for this task is no adapter.

    6. Task routing vs averaging matters. The Share paper assumes you select coefficients based on task type, not blend them together.

    7. AI coding agents cut corners. When implementing research papers, AI agents repeatedly stopped before completing all phases of the algorithm. I had to direct the agent to re-read the papers many times before it implemented them correctly.

    Paths Forward

    Since fine-tuning doesn’t work here, alternatives:

    Approach Tradeoff
    Prompt engineering No weight changes, limited by context
    Multi-turn repair Uses base model reasoning, slower
    Larger model (7B+) More capacity to absorb knowledge
    Task routing with Share Select coefficients, don’t average
    Model ensemble Multiple models, pick best output
    Accept baseline 73% may be good enough

    The Numbers

    Experiments run:        51 adapters, multiple algorithms
    Parameters trained:     From 10K to 1.6M per adapter
    Best achieved:          73.3% (matches baseline)
    Target:                 ≥76.7%
    Conclusion:             Target not achievable with LoRA
    

    Resources


    Sometimes the most valuable research shows what doesn’t work. Fifty-one adapters later, we know: let sleeping models lie.

    Part 1 of the Towards Continuous LLM Learning series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 475 words3 min readAbstract

    Five ML Concepts - #9

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #9
    Video
    Comments Discord

    References

    Concept Reference
    Dropout Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
    RLHF Training language models to follow instructions with human feedback (Ouyang et al. 2022)
    Inference Deep Learning (Goodfellow et al. 2016), Chapter 5
    Quantization A Survey of Quantization Methods for Efficient Neural Network Inference (Gholami et al. 2021)
    Flash Attention FlashAttention: Fast and Memory-Efficient Exact Attention (Dao et al. 2022)

    Today’s Five

    1. Dropout

    A regularization technique that randomly disables units during training. This encourages the network to rely on multiple pathways instead of memorizing patterns.

    It helps reduce overfitting, especially in large models.

    Like training a team where random members sit out each practice, so no one becomes a single point of failure.

    2. RLHF (Reinforcement Learning from Human Feedback)

    A training approach where humans rank or compare model outputs to produce a reward signal. The model is then optimized to better match human preferences.

    This technique is central to aligning language models with human intent.

    Like teaching by grading essays instead of dictating every word.

    3. Inference

    The process of running a trained model to make predictions on new data. Training updates the model’s parameters; inference uses them.

    The distinction matters for optimization, deployment, and cost.

    Like the difference between studying for an exam and actually taking it.

    4. Quantization

    Reducing the numerical precision used to store and compute model weights. This can shrink model size and speed up inference, sometimes with a small accuracy tradeoff.

    Essential for deploying large models on limited hardware.

    Like compressing a high-resolution photo into a smaller file that still looks good.

    5. Flash Attention

    An optimized attention algorithm designed to reduce memory usage. It avoids materializing the full attention matrix by computing attention in blocks.

    This enables longer sequences and faster training.

    Like reading a book chapter by chapter instead of photocopying the whole thing first.

    Quick Reference

    Concept One-liner
    Dropout Random disabling to prevent overfitting
    RLHF Learn from human preference comparisons
    Inference Using a trained model for predictions
    Quantization Lower precision for smaller, faster models
    Flash Attention Block-wise attention for memory efficiency

    Short, accurate ML explainers. Follow for more.

    Part 9 of the Five ML Concepts series. View all parts | Next: Part 10 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 482 words3 min readAbstract

    Five ML Concepts - #8

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #8
    Video
    Comments Discord

    References

    Concept Reference
    Bias-Variance The Elements of Statistical Learning (Hastie et al. 2009), Chapter 7
    Diffusion Denoising Diffusion Probabilistic Models (Ho et al. 2020)
    KV Cache Fast Transformer Decoding (Pope et al. 2022)
    Mixed Precision Mixed Precision Training (Micikevicius et al. 2017)
    MLA DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (DeepSeek-AI 2024)

    Today’s Five

    1. Bias-Variance Tradeoff

    A fundamental tension where simpler models tend to underfit (high bias), and more flexible models can overfit (high variance). The goal is finding a balance that generalizes well to unseen data.

    One of the oldest ideas in machine learning, still relevant today.

    Like choosing between a ruler that only draws straight lines and one so flexible it traces every bump.

    2. Diffusion Models

    A generative approach that trains a model to reverse a gradual noising process. During generation, the model starts from noise and removes it step by step.

    The foundation of image generators like Stable Diffusion and DALL-E.

    Like learning to restore a photo by practicing on progressively more damaged versions.

    3. KV Cache

    A technique that stores attention key and value tensors from earlier tokens so they don’t need to be recomputed during generation. This significantly speeds up autoregressive inference.

    Essential for efficient LLM serving.

    Like keeping notes from earlier in a conversation instead of rereading everything.

    4. Mixed Precision

    A training strategy that uses lower-precision math for most operations, while keeping some calculations in higher precision for stability. This reduces memory use and often speeds up training with little accuracy loss.

    Standard practice for modern deep learning.

    Like drafting in pencil and only using ink for the final signature.

    5. MLA (Multi-head Latent Attention)

    An attention variant that compresses key and value information into a lower-dimensional latent space. This reduces memory usage for long sequences while retaining useful context.

    Used in DeepSeek-V2 and related architectures.

    Like summarizing meeting notes instead of recording every word verbatim.

    Quick Reference

    Concept One-liner
    Bias-Variance Balance underfitting vs overfitting
    Diffusion Generate by learning to denoise
    KV Cache Store past keys/values for fast inference
    Mixed Precision Lower precision for speed, higher for stability
    MLA Compress attention into latent space

    Short, accurate ML explainers. Follow for more.

    Part 8 of the Five ML Concepts series. View all parts | Next: Part 9 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1038 words6 min readAbstract

    Deepseek Papers (3/3): Engram Revisited - From Emulation to Implementation

    We started by training models to act like they had memory. Then we found an open source implementation that does it for real. This is what we learned.

    Resource Link
    Paper arXiv:2601.07372
    Our Code engram-poc
    Reference weagan/Engram
    Video Engram Revisited
    Video
    Playlist All Engram Videos
    Comments Discord

    The Journey

    Phase 1: Behavioral Emulation

    Part 2 described our first approach: LoRA fine-tuning to make a model behave like it has memory. Train on patterns, and the model learns to respond consistently.

    Metric Baseline LoRA-tuned
    Accuracy 8.6% 14.1%
    Improvement - +63% relative

    It worked, but the architecture was unchanged. We were approximating Engram benefits, not implementing them.

    Phase 2: The Discovery

    Then we found weagan/Engram on GitHub—real hash-based memory in ~300 lines of Python:

    class EnhancedEngramModule(nn.Module):
        def __init__(self, table_size=50000, d_model=512):
            # Large learnable memory table
            self.memory_table = nn.Parameter(torch.zeros(table_size, d_model))
    
            # Gate decides when to trust memory
            self.gate = nn.Sequential(
                nn.Linear(d_model * 2, d_model),
                nn.ReLU(),
                nn.Linear(d_model, 1),
                nn.Sigmoid()
            )
    
        def forward(self, hidden_states, input_ids):
            # O(1) hash lookup
            indices = self.multi_head_hash(input_ids)
            retrieved = F.embedding(indices, self.memory_table)
    
            # Gated injection
            gate_score = self.gate(torch.cat([hidden_states, retrieved], dim=-1))
            return hidden_states + gate_score * retrieved
    

    The key insight: the gate decides when to trust the lookup. Not every token needs memory.

    Phase 3: Integration with HuggingFace

    We ported the module to work with HuggingFace models:

    SmolLM-135M (frozen)
            ↓
    EnhancedEngramModule (per layer)
      - 50K slot memory table
      - O(1) hash-based lookup
      - Learned gating
            ↓
    Output
    

    The proof it works—O(1) lookup regardless of sequence length:

    Sequence Length Lookup Time Expected if O(n)
    64 tokens 0.15 ms -
    2048 tokens 2.77 ms 4.8 ms

    Sub-linear scaling proves constant-time hash lookup.

    The Reality Check

    Here’s where it gets interesting. Real Engram memory excels at some tasks and hurts others.

    Where Engram Helps

    Task Type Baseline Engram Change
    Acronym expansion 25% 75% +200%
    Element symbols 33% 67% +103%
    Long-term fact recall 90% 100% +11%

    For exact-match lookups with structured keys, Engram dominates.

    Where Engram Hurts

    Task Type Baseline Engram Change
    World capitals 83% 67% -19%
    Pattern completion 14% 11% -21%

    For tasks where the base model already knows the answer, Engram’s hash collisions add noise.

    The Key Insight

    Engram is a specialized tool, not a general enhancement.

    Use Engram For Don’t Use Engram For
    FAQ responses Creative generation
    Terminology lookup Novel combinations
    Entity facts Context-dependent answers
    Code boilerplate Reasoning tasks

    The gating mechanism is critical: it must learn to suppress memory when it doesn’t help. Without proper gating, hash collisions inject noise into every token.

    Obstacles Encountered

    1. Hash Collisions

    Different inputs can map to the same memory slot. The gate must learn to ignore irrelevant retrievals.

    2. Parameter Explosion

    50K slots × 768 dimensions × 30 layers = 1.2B additional parameters. We had to inject selectively (every 4th layer) to stay practical.

    3. Training Dynamics

    Memory tables start at zero. They need higher learning rates (10x) to develop meaningful representations before the model learns to use them.

    4. Evaluation Mismatch

    Our pattern completion task wasn’t ideal for hash-based memory. Engram shines on exact-match retrieval, not generalization.

    Combined Approach

    The best results came from combining both methods:

    Base Model (SmolLM-135M)
            ↓
    EnhancedEngramModule
      - Long-term fact storage
      - O(1) lookup for known patterns
            ↓
    LoRA Adapters
      - Pattern completion
      - Domain-specific behaviors
            ↓
    Output
    

    This gives you:

    • Long-term memory from hash tables
    • Pattern consistency from behavioral training
    • Flexibility to disable either component

    What We Learned

    1. Emulation vs Implementation: LoRA fine-tuning approximates memory behavior; hash tables implement it. Both have their place.

    2. Gating is Essential: The learned gate prevents hash collisions from degrading performance. Never use Engram without gating.

    3. Match Task to Tool: Hash-based memory excels at exact lookups, not pattern generalization. Use it where applicable.

    4. Selective Application: Don’t inject Engram everywhere. Target layers and use cases where it helps.

    5. The Gate as a Safety Valve: When the gate learns to output near-zero for a task, that’s the model telling you Engram doesn’t help there. Listen to it.

    Resources

    Series Recap

    Part Topic Key Insight
    1 mHC Doubly-stochastic constraints bound signal amplification
    2 Engram Intro O(1) lookup beats recomputing through attention
    3 Engram Revisited Use Engram where applicable; gate to avoid worse results

    Hash-based memory is powerful but specialized. The gate decides when to use it—and when not to.

    Part 3 of the Deepseek Papers series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 474 words3 min readAbstract

    Five ML Concepts - #7

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #7
    Video
    Comments Discord

    References

    Concept Reference
    Cross-Validation A Study of Cross-Validation and Bootstrap (Kohavi 1995)
    GPT Language Models are Unsupervised Multitask Learners (Radford et al. 2019)
    GQA GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al. 2023)
    Context Window Attention Is All You Need (Vaswani et al. 2017)
    Self-Attention Attention Is All You Need (Vaswani et al. 2017)

    Today’s Five

    1. Cross-Validation

    A technique that splits data into multiple folds to evaluate model performance on data it wasn’t trained on. By rotating which data is held out, it gives a more reliable estimate of generalization.

    Essential for honest model evaluation.

    Like practicing with different sets of flashcards to see if you actually learned the material.

    2. GPT

    Generative Pre-trained Transformer. A family of autoregressive language models trained to predict the next token in a sequence.

    Many AI assistants and chatbots are built on this approach.

    Like autocomplete, but scaled up and trained on vast text data.

    3. GQA (Grouped Query Attention)

    An attention variant where multiple query heads share key and value projections. This reduces memory usage and can speed up inference compared to standard multi-head attention.

    Widely adopted in efficient transformer architectures.

    Like several students sharing one set of notes instead of copying everything separately.

    4. Context Window

    The maximum number of tokens a model can process in a single forward pass. Larger context windows allow longer inputs, but increase memory and compute costs.

    A key constraint in language model design.

    Like the size of a desk that limits how many papers you can spread out at once.

    5. Self-Attention

    A mechanism where each token computes attention scores with other tokens in the same sequence. This lets the model weigh which parts of the input are most relevant to each position.

    The core operation inside transformers.

    Like everyone in a meeting deciding who to listen to based on the conversation.

    Quick Reference

    Concept One-liner
    Cross-Validation Rotate held-out data for reliable evaluation
    GPT Predict next token, at scale
    GQA Shared keys/values for efficient attention
    Context Window How much the model sees at once
    Self-Attention Each token attends to all others

    Short, accurate ML explainers. Follow for more.

    Part 7 of the Five ML Concepts series. View all parts | Next: Part 8 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 496 words3 min readAbstract

    Five ML Concepts - #6

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #6
    Video
    Comments Discord

    References

    Concept Reference
    Regularization Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
    BERT BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
    RoPE RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al. 2021)
    Prompting Language Models are Few-Shot Learners (Brown et al. 2020)
    Positional Encoding Attention Is All You Need (Vaswani et al. 2017)

    Today’s Five

    1. Regularization

    Techniques that reduce overfitting by adding constraints or penalties during training. Common examples include L2 weight decay, L1 sparsity, dropout, and early stopping.

    The goal is better generalization, not just fitting the training set.

    Like adding friction so a model can’t take the easiest overfit path.

    2. BERT

    Bidirectional Encoder Representations from Transformers. A transformer encoder trained with masked language modeling: predicting hidden tokens using context from both sides.

    It was a major step forward for many NLP tasks after its 2018 release.

    Like filling in blanks by reading the whole sentence, not just the words before it.

    3. RoPE (Rotary Positional Embeddings)

    A way to represent token position inside attention by rotating query and key vectors as a function of position. This gives attention information about relative order and distance.

    It’s widely used in modern transformer models.

    Like turning a dial differently for each position so the model can tell where tokens are.

    4. Prompting

    Crafting inputs to steer a model toward the output you want. Small changes in instructions, examples, or format can change behavior significantly.

    A key skill for working effectively with language models.

    Like asking a question in just the right way to get a useful answer.

    5. Positional Encoding

    Transformers need a way to represent token order, because attention alone doesn’t include sequence position. Different methods do this, including learned embeddings and rotary approaches like RoPE.

    Without it, “the cat sat on the mat” would be indistinguishable from “mat the on sat cat the.”

    Like numbering the pages of a shuffled book so you can read them in order.

    Quick Reference

    Concept One-liner
    Regularization Add constraints to prevent overfitting
    BERT Bidirectional masked language modeling
    RoPE Position info via rotation in attention
    Prompting Craft inputs to steer model outputs
    Positional Encoding Tell the model where tokens are in sequence

    Short, accurate ML explainers. Follow for more.

    Part 6 of the Five ML Concepts series. View all parts | Next: Part 7 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 498 words3 min readAbstract

    Five ML Concepts - #5

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #5
    Video
    Comments Discord

    References

    Concept Reference
    Perceptron The Perceptron: A Probabilistic Model (Rosenblatt 1958)
    Pre-training BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al. 2018)
    Speculative Decoding Fast Inference from Transformers via Speculative Decoding (Leviathan et al. 2022)
    ICL Language Models are Few-Shot Learners (Brown et al. 2020)
    Latent Space Auto-Encoding Variational Bayes (Kingma & Welling 2013)

    Today’s Five

    1. Perceptron

    The simplest neural network: a single linear unit with weights and a bias. It computes a weighted sum and applies a threshold or activation.

    It inspired modern neural networks, even though today’s models are far more complex.

    Like a single voter weighing inputs before deciding yes or no.

    2. Pre-training

    Training a model on a large, general dataset before adapting it to a specific task. This gives the model broad patterns that later training can refine.

    BERT, GPT, and most modern LLMs use this approach.

    Like going to medical school before choosing a specialty.

    3. Speculative Decoding

    A technique where a small, fast model proposes tokens, and a larger model verifies or rejects them in parallel. This can speed up inference without changing final outputs.

    A key optimization for production LLM deployments.

    Like a junior writer drafting text for a senior editor to approve in batches.

    4. In-Context Learning (ICL)

    When a model adapts its behavior using examples in the prompt, without updating its weights. It allows flexible task behavior at inference time.

    This emergent capability surprised researchers when GPT-3 demonstrated it.

    Like solving a new puzzle after seeing a few worked examples.

    5. Latent Space

    The internal representations a model learns as it processes data. In this space, similar inputs tend to be located near each other.

    It’s not a literal place, but a useful way to think about how models organize information.

    Like a map where cities are arranged by similarity instead of geography.

    Quick Reference

    Concept One-liner
    Perceptron Single linear unit—the neural network ancestor
    Pre-training Learn general patterns before specializing
    Speculative Decoding Draft fast, verify in parallel
    ICL Adapt from prompt examples without training
    Latent Space Internal representations where similar things cluster


    Short, accurate ML explainers. Follow for more.

    Part 5 of the Five ML Concepts series. View all parts | Next: Part 6 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 458 words3 min readAbstract

    Five ML Concepts - #4

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #4
    Video
    Comments Discord

    References

    Concept Reference
    Activation Functions Deep Learning (Goodfellow et al. 2016), Chapter 6
    Transfer Learning A Survey on Transfer Learning (Pan & Yang 2010)
    VLM Learning Transferable Visual Models (CLIP) (Radford et al. 2021)
    Adam Adam: A Method for Stochastic Optimization (Kingma & Ba 2014)
    Superposition Toy Models of Superposition (Elhage et al. 2022)

    Today’s Five

    1. Activation Functions

    Functions like ReLU, sigmoid, and tanh that transform neuron outputs. They introduce nonlinearity, allowing networks to learn complex patterns beyond simple linear relationships.

    Without them, stacking layers would just be matrix multiplication.

    Like an on-off switch that can also dim the lights.

    2. Transfer Learning

    Using knowledge a model learned on one task to improve performance on a related task. This often reduces training time and data requirements dramatically.

    Pre-trained models can be fine-tuned for specific applications.

    Like a chef who already knows French cooking learning Japanese cuisine faster.

    3. VLM (Vision-Language Models)

    Models trained to work with both images and text. They learn shared representations that connect visual and language understanding.

    CLIP, GPT-4V, and LLaVA are examples of this approach.

    Like someone who can look at a photo and describe what’s happening.

    4. Adam

    An optimizer that adapts learning rates for each parameter using information from past gradients. It combines ideas from momentum and adaptive learning-rate methods.

    One of the most popular optimizers in deep learning.

    Like a hiker who adjusts step size for each part of the trail, steep or flat.

    5. Superposition

    A way neural networks represent many concepts using overlapping directions in the same space. This allows models to pack more information into fewer neurons than expected.

    It’s why interpretability is hard—features aren’t neatly separated.

    Like discovering a painting has hidden layers that appear under the right light.

    Quick Reference

    Concept One-liner
    Activation Functions Introduce nonlinearity to enable complex patterns
    Transfer Learning Reuse knowledge from one task for another
    VLM Joint understanding of images and text
    Adam Adaptive per-parameter learning rates
    Superposition Many concepts packed into overlapping representations

    Short, accurate ML explainers. Follow for more.

    Part 4 of the Five ML Concepts series. View all parts | Next: Part 5 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 529 words3 min readAbstract

    Five ML Concepts - #3

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #3
    Video
    Comments Discord

    References

    Concept Reference
    Loss Function A Survey of Loss Functions for Deep Neural Networks (Janocha & Czarnecki 2017)
    Overfitting Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al. 2014)
    Fine-tuning A Survey on Transfer Learning (Zhuang et al. 2020)
    LoRA LoRA: Low-Rank Adaptation of Large Language Models (Hu et al. 2021)
    Tokenization Neural Machine Translation of Rare Words with Subword Units (Sennrich et al. 2015)

    Today’s Five

    1. Loss Function

    A formula that measures how far off the model’s predictions are from the correct answers. It quantifies the gap between what the model predicted and what it should have predicted.

    Training a neural network means minimizing this function.

    Like a scorecard that tells the model how badly it messed up.

    2. Overfitting

    When a model learns the training data too well, including noise and outliers, and fails on new data. The model performs great on examples it has seen but poorly on anything new.

    One of the most common pitfalls in machine learning.

    Like memorizing the answers to a test instead of understanding the subject.

    3. Fine-tuning

    Taking a pre-trained model and training it further on a specific task or dataset. Instead of training from scratch, you start from a model that already understands language or images, then specialize it.

    This makes powerful models accessible without massive compute budgets.

    Like teaching a chef who already knows cooking to specialize in sushi.

    4. LoRA (Low-Rank Adaptation)

    An efficient fine-tuning method that trains a small number of added parameters instead of the full model. It inserts small trainable matrices into each layer while keeping the original weights frozen.

    This dramatically reduces the memory and compute needed for fine-tuning.

    Like adding sticky notes to a textbook instead of rewriting the whole thing.

    5. Tokenization

    The process of breaking text into smaller units called tokens that a model can process. Most modern models use subword tokenization, splitting words into common pieces rather than individual characters or whole words.

    It determines what the model actually “sees” and affects everything from vocabulary size to multilingual performance.

    Like chopping sentences into bite-sized pieces a model can digest.

    Quick Reference

    Concept One-liner
    Loss Function How far off the model’s predictions are
    Overfitting Memorizing the test instead of learning the subject
    Fine-tuning Specializing a pre-trained model for a new task
    LoRA Efficient fine-tuning with small added matrices
    Tokenization Breaking text into the pieces a model actually reads

    Short, accurate ML explainers. Follow for more.

    Part 3 of the Five ML Concepts series. View all parts | Next: Part 4 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1784 words9 min readAbstract

    TBT (2/?): Pipelines on OS/390

    Unix invented pipes. Mainframes reinvented them—for records, not bytes.

    This is the second Throwback Thursday post—revisiting technologies that shaped how I think about programming. This time: CMS/TSO Pipelines, and a vibe coding project that brings them back to life in Rust for education, fun, and nostalgic reasons.

    Resource Link
    Code pipelines-rs
    Demo Live Demo
    Video Pipelines on OS/390 #TBT
    Comments Discord

    The 1996 Olympics and a Pair of Mainframes

    In 1996, IBM hosted the Olympics Web Server—one of the largest public web properties at the time. Many distributed IBM systems in different regions served dynamic web pages. The logs from all of them were funneled to a pair of IBM S/390 mainframes I was in charge of, running OS/390 (formerly MVS).

    When you’re processing millions of log records for statistics and forensics, you need tools that think in records, not lines. That’s where Pipelines for TSO/E came in.

    Pipelines for TSO/E was the MVS/ESA port of CMS Pipelines, which ran on VM/ESA. Both let you chain stages together to filter, transform, and aggregate record-oriented data—record-oriented pipelines that evolved in parallel with Unix’s byte-stream pipes.

    Two Traditions of Piping

    Unix pipes came first—Thompson and McIlroy at Bell Labs, 1969–1974. Byte streams, file descriptors, the | operator. Brutally simple. Explosively powerful. POSIX.1-1988 standardized pipe(2) and shell pipelines, though POSIX work began in the mid-1980s.

    CMS Pipelines emerged on IBM mainframes in the mid-to-late 1980s. They weren’t a Unix clone—they were convergent evolution under different pressures. Where Unix piped bytes between small programs, CMS piped records through declarative stages. Pipelines for TSO/E followed in the late 1980s and early 1990s, porting CMS concepts to the MVS multi-user environment. Unlike CMS Pipelines (which ships with z/VM), the TSO/E port is typically installed separately on z/OS.

    Neither tradition was “behind.” They were optimizing different dimensions:

      Unix Pipes CMS/TSO Pipelines
    Era 1969–1974 Mid-to-late 1980s
    Data unit Byte stream Records (fixed or variable length)
    Stage input stdin (bytes) Record buffer
    Field access awk, cut (text parsing) Column positions (direct)
    Execution Typically a process per stage Stages in one address space
    Topology Linear by default; fan-out/fan-in via tee, FIFOs, or process substitution Multi-stream, fan-out/fan-in built in
    Philosophy Small tools, ad hoc composition Declarative data transformation

    Many datasets on mainframes are record-structured. Records can be fixed-length or variable-length. CMS and TSO/E Pipelines treat records as byte arrays—character-oriented stages assume EBCDIC text, while position/length stages are binary-safe. A fixed-length 80-byte record isn’t arbitrary text—columns 1-8 are the name, 9-18 are the department, 19-26 are the salary. You don’t parse. You just read the right columns.

    Unix won culturally—cheap hardware, academic distribution, C portability. But IBM’s record-oriented pipelines were better at structured dataflow, and they anticipate or parallel patterns seen in ETL frameworks like Spark and Beam.

    CMS Pipelines ships with z/VM and is still used; Pipelines for TSO/E exists for z/OS but isn’t universally installed. These are not historical curiosities—mainframes continue to process a significant share of high-value transactions, and pipelines remain an available tool for data transformation on those systems.

    What a Pipeline Looks Like

    CMS Pipelines uses a DSL with PIPE as the command, | to chain stages, and ? as a command terminator (it suppresses the console from being used as implicit input):

    PIPE CONSOLE
    | FILTER 18,10 = "SALES"
    | SELECT 0,8,0; 8,10,8
    | CONSOLE
    ?
    

    This reads input records, keeps only those where columns 18–27 equal “SALES”, extracts the name fields, and writes the result. No regex. No string splitting. Just column positions.

    Note: pipelines-rs uses 0-based offsets (e.g., SELECT 0,8,0). Historical CMS Pipelines uses 1-based column positions.

    Compare with the Unix equivalent:

    cat input.txt | awk '$3 == "SALES" {print $1, $2}'
    

    The Unix version looks simpler—until your fields contain spaces, or your records contain non-text bytes, or you need to chain 15 stages without spawning 15 processes.

    Bringing It Back in Rust (Vibe Coding)

    pipelines-rs is a nostalgia-driven vibe coding project—my attempt to emulate Pipelines for TSO/E in Rust, not because it’s practical, but because these ideas deserve to be celebrated. It supports a subset of stages and features two execution models:

    The Two Executors

    Batched processes all records through one stage before moving to the next:

    All records → Stage 1 → All records → Stage 2 → All records → Stage 3
    

    This emulates the correct output and is faster, but doesn’t demonstrate record-oriented dataflow well.

    Record-At-a-Time (RAT) sends each record through the entire pipeline before reading the next:

    Record 1 → Stage 1 → Stage 2 → Stage 3 → Output
    Record 2 → Stage 1 → Stage 2 → Stage 3 → Output
    Record 3 → Stage 1 → Stage 2 → Stage 3 → Output
    

    RAT is the implementation shown in the video. It’s a naive approach—more buffers, more copying—but it shows the dataflow concepts clearly and enables the visual debugger. Both run in linear time (records × stages) and produce identical output for all 23 test specifications.

    A future version will aim for fewer buffers and fewer copy operations. Whether it’s faster than Batched remains to be seen.

    The 80-Byte Record

    The Rust implementation supports fixed-length records only. The fundamental data type is the Record—exactly 80 bytes, matching historical punch card width. Variable-length input lines are accepted and padded to 80 bytes:

    pub const RECORD_WIDTH: usize = 80;
    
    pub struct Record {
        data: [u8; RECORD_WIDTH],
    }
    

    Fields are accessed by column position and length. No parsing, no delimiters. The data is always right where you expect it.

    Supported Stages

    The current implementation supports 14 stages:

    Stage Purpose Example
    FILTER Keep/reject records by field value FILTER 18,10 = "SALES"
    LOCATE Keep records containing a pattern LOCATE "ERROR"
    NLOCATE Keep records NOT containing a pattern NLOCATE "DEBUG"
    SELECT Extract and reposition fields SELECT 0,8,0; 8,10,8
    CHANGE Text replacement CHANGE "SALES" "MKTG"
    COUNT Count records COUNT
    TAKE Keep first N records TAKE 5
    SKIP Skip first N records SKIP 2
    DUPLICATE Repeat each record N times DUPLICATE 3
    LITERAL Append a literal record LITERAL "--- END ---"
    UPPER/LOWER Case conversion UPPER
    REVERSE Reverse record text REVERSE
    HOLE Discard all input HOLE
    CONSOLE Driver stage: source or sink depending on position CONSOLE

    The CLI

    Both executors have identical CLIs:

    # Batch executor
    pipe-run specs/filter-sales.pipe specs/input-fixed-80.data -v
    
    # Record-at-a-time executor
    pipe-run-rat specs/filter-sales.pipe specs/input-fixed-80.data -v
    

    Given this input data:

    SMITH   JOHN      SALES     00050000
    JONES   MARY      ENGINEER  00075000
    DOE     JANE      SALES     00060000
    WILSON  ROBERT    MARKETING 00055000
    CHEN    LISA      ENGINEER  00080000
    GARCIA  CARLOS    SALES     00045000
    TAYLOR  SUSAN     MARKETING 00065000
    BROWN   MICHAEL   ENGINEER  00090000
    

    And this pipeline:

    PIPE CONSOLE
    | FILTER 18,10 = "SALES"
    | CONSOLE
    ?
    

    The output is:

    SMITH   JOHN      SALES     00050000
    DOE     JANE      SALES     00060000
    GARCIA  CARLOS    SALES     00045000
    Records:  8 in -> 3 out
    

    Exactly what I’d have gotten on OS/390 in 1996, but with Web Server log data showing client IP address, OS, browser type/version, user cookies, timestamps, URLs, and more, instead of accounting data. 😊

    The Web UI for Two pipelines-rs Implementations

    The web interface runs entirely in the browser via WebAssembly. It has three panels: input records with an 80-column ruler, the pipeline editor, and the output.

    Tutorial Mode

    The tutorial walks through each stage with examples, running pipelines automatically to show results. You can step through manually or let it auto-advance.

    The Visual Debugger

    The debugger is the reason RAT exists. It lets you:

    • Step through execution one pipe point at a time
    • Watch data at specific pipe points between stages
    • Set breakpoints to pause at specific stages
    • See stage state for stateful stages like COUNT

    You load a pipeline, click Run, then Step to watch each record flow through each stage. The debugger highlights which stages have been reached with a green border. For COUNT and other aggregation stages, you can watch the flush phase where accumulated state becomes output.

    What’s Next

    The current RAT executor is intentionally naive—it uses a buffer at every pipe point and copies each record between them. A better implementation would minimize buffers and copy operations while preserving the record-at-a-time semantics.

    Multi-pipe features are also planned—CMS Pipelines supported fan-out (one input, multiple output streams) and fan-in (multiple inputs merged), which enabled complex processing topologies beyond simple linear chains.

    How pipelines-rs Differs from IBM Pipelines

      IBM CMS/TSO/E Pipelines pipelines-rs
    Indexing 1-based column positions 0-based offsets
    Record format Fixed or variable length, EBCDIC Fixed 80-byte ASCII only (variable-length input padded)
    Stages Hundreds of built-in stages 14 implemented so far
    Topology Multi-stream: fan-out, fan-in, multi-pipe Linear only (multi-pipe planned)
    Environment z/VM, z/OS mainframes CLI (native) and browser (WASM)
    Character set EBCDIC ASCII/UTF-8

    This is a teaching tool and nostalgia project, not a production replacement.

    Implementation Details

    Metric Value
    Language Rust (2024 edition)
    Web UI Yew framework, compiled to WASM
    Stages 14 implemented
    Test Specs 23 pipeline specifications
    Tests 60+ (including batch/RAT equivalence)
    License MIT
    Live Demo sw-comp-history.github.io/pipelines-rs

    Resources

    Credits

    Role Who
    Concept & direction Mike Wright
    Content creation Claude (Anthropic)
    Editorial review ChatGPT (OpenAI)

    Mainframe ideas, modern tools. Follow for more.

    Part 2 of the Throwback Thursday series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 979 words5 min readAbstract

    Small Models (6/6): Which Small AI Fits YOUR Laptop?

    Maximum AI capability on minimum hardware. The 2-3B efficient frontier.

    This is Part 6 (the finale) of the Small Models, Big Brains series. We’re benchmarking the best small models to help you choose the right one for your laptop.

    The Efficient Frontier

    In economics, the “efficient frontier” is the set of optimal portfolios offering the highest return for a given level of risk.

    In AI, it’s the models offering the best capability for a given size.

    The Contenders

    Model Params Source Key Strength
    Phi-2 2.7B Microsoft Reasoning, synthetic data
    Gemma-2B 2B Google Distillation, multilingual
    SmolLM2-1.7B 1.7B HuggingFace 11T tokens, fast inference
    SmolLM3-3B 3B HuggingFace Dual reasoning, 6 languages

    Benchmark Results

    Actual measurements on Apple Silicon (M-series) from efficient-llm:

    Model MMLU GSM8K HumanEval Speed (CPU) Memory
    Phi-2 61.7% 57.0% 50.0% 7.1 tok/s 5.2GB
    Gemma-2B 38.9% 18.0% 90.0% 8.5 tok/s 4.7GB
    SmolLM2 55.6% * * 3.7 tok/s 3.2GB

    *SmolLM2 GSM8K/HumanEval scores reflect prompt format incompatibility, not capability.

    The Key Insight: Data Quality Beats Parameters

    Phi-2 achieves 61.7% MMLU with only 2.7B parameters.

    For comparison:

    • Llama-2-7B: ~46% MMLU
    • Llama-2-13B: ~55% MMLU

    Phi-2 beats models 5x its size. The secret? Synthetic textbook training.

    Microsoft generated high-quality educational content specifically designed to teach reasoning. Quality data > quantity data > model size.

    Model Profiles

    Phi-2: The Reasoning Champion

    Strengths: Math, logic, code understanding
    Weakness:  Less conversational
    Best for:  Technical tasks, chain-of-thought
    

    Phi-2 was trained on “textbook quality” synthetic data. It thinks like a textbook explains.

    Gemma-2B: The Distillation Expert

    Strengths: Multilingual, edge deployment
    Weakness:  Lower benchmark scores
    Best for:  Production apps, Google ecosystem
    

    Google distilled knowledge from larger models into this compact package. Great tooling and documentation.

    SmolLM2-1.7B: The Speed Demon

    Strengths: Fastest inference, smallest footprint
    Weakness:  Prompt format sensitivity
    Best for:  Memory-constrained environments
    

    HuggingFace trained on 11T tokens—massive overtraining like TinyLlama but at a slightly larger scale.

    SmolLM3-3B: The Balanced Choice

    Strengths: Dual reasoning modes, 6 languages
    Weakness:  Newest, less battle-tested
    Best for:  General-purpose small model needs
    

    The latest from HuggingFace, designed to be the go-to small model.

    Decision Framework

    ├── Need best reasoning?           → Phi-2
    ├── Need instruction following?    → SmolLM2 or SmolLM3
    ├── Need multilingual?             → Gemma-2B or SmolLM3
    ├── Memory constrained (<4GB)?     → SmolLM2 + INT4
    ├── Need Google ecosystem?         → Gemma-2B
    ├── General purpose?               → SmolLM3
    └── Maximum quality per byte?      → Phi-2
    

    Running the Benchmarks

    git clone https://github.com/softwarewrighter/efficient-llm
    cd efficient-llm
    
    # Setup
    uv venv && source .venv/bin/activate
    uv pip install torch transformers accelerate bitsandbytes datasets tqdm
    
    # HuggingFace login (required for Gemma)
    huggingface-cli login
    
    # Download and benchmark
    python download_models.py
    python benchmark_quality.py
    python benchmark_speed.py
    python benchmark_memory.py
    
    # Interactive demos
    python demo_reasoning.py
    python demo_code.py
    python demo_chat.py
    

    Hardware Requirements

    Setup Models You Can Run
    4GB RAM SmolLM2 (INT4)
    8GB RAM All models (INT4)
    16GB RAM All models (FP16)
    Apple Silicon All models (MPS)

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 7 .py files
    Estimated Size ~1.4 KLOC
    Framework Transformers, PyTorch
    Build System uv / pip
    Key Features MMLU/GSM8K/HumanEval benchmarks, demos

    Good for you if: You want to benchmark 2-3B models, compare quality vs speed tradeoffs, or run interactive comparisons between Phi-2, Gemma, and SmolLM.

    Complexity: Low. Similar structure to billion-llm. Standalone Python scripts for each benchmark and demo. Requires HuggingFace authentication for Gemma access.

    Series Recap

    Over six parts, we’ve explored the cutting edge of small model research:

    Part Model/Topic Key Insight
    1 TRM (<1K params) Iteration beats scale
    2 MobileLLM (350M) Offline AI is practical
    3 HRM (27M) Hierarchy enables reasoning
    4 BDH Sparsity enables interpretability
    5 1B models The efficiency sweet spot
    6 2-3B models Data quality beats parameters

    Key Takeaways

    1. Data quality beats parameter count. Phi-2 proves careful curation outperforms brute scaling.

    2. The 2-3B range is remarkably capable. These models handle real tasks, not just demos.

    3. Each model has its niche. Match the model to your use case.

    4. Quantization makes everything accessible. INT4 lets you run 3B models on 4GB RAM.

    5. The frontier keeps moving. SmolLM3 is weeks old. Better models are coming.

    What We’ve Learned

    Small models aren’t a compromise—they’re a different optimization target. When you can’t throw compute at a problem, you’re forced to be clever:

    • Recursive reasoning (TRM)
    • Mobile-optimized architectures (MobileLLM)
    • Hierarchical decomposition (HRM)
    • Sparse interpretable activations (BDH)
    • Overtraining on quality data (TinyLlama, Phi-2)

    These techniques will eventually feed back into large models too. Small model research isn’t a dead end—it’s the frontier.

    Resources


    Part 6 of 6 in the Small Models, Big Brains series. Thanks for following along!

    Part 6 of the Small Models, Big Brains series. View all parts

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 451 words3 min readAbstract

    Five ML Concepts - #2

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #2
    Video
    Comments Discord

    References

    Concept Reference
    Gradient Descent An overview of gradient descent optimization algorithms (Ruder 2016)
    Attention Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al. 2014)
    DPO Direct Preference Optimization (Rafailov et al. 2023)
    Learning Rate Cyclical Learning Rates (Smith 2015)
    Temperature On the Properties of Neural Machine Translation (Cho et al. 2014)

    Today’s Five

    1. Gradient Descent

    A general optimization method used across machine learning. It improves a model by taking small steps in the direction that reduces error the most.

    Many learning algorithms rely on it, especially neural networks.

    Like walking downhill in fog, adjusting each step based on the slope beneath your feet.

    2. Attention

    A mechanism that lets models weigh different parts of the input by importance. Instead of treating everything equally, attention highlights what matters most.

    This was key to breakthroughs in translation and language models.

    Like reading a sentence and focusing more on the important words.

    3. DPO (Direct Preference Optimization)

    A method for aligning language models with human preferences. Unlike RLHF, it trains directly on preference comparisons and avoids an explicit reward model.

    This simplifies training while achieving comparable alignment.

    Like learning preferences by observing choices, not by designing a scoring system.

    4. Learning Rate

    Controls how large each update step is during training. Too large and learning becomes unstable. Too small and training is slow or gets stuck.

    One of the most important hyperparameters to tune.

    Like choosing how fast to walk downhill without losing balance.

    5. Temperature

    A parameter that controls randomness during text generation. Low temperature favors predictable, high-probability outputs. Higher temperature increases variety and surprise.

    A tradeoff between consistency and creativity.

    Like adjusting a dial from cautious to adventurous.

    Quick Reference

    Concept One-liner
    Gradient Descent Walk downhill to minimize error
    Attention Focus on what matters in the input
    DPO Align models from preference pairs directly
    Learning Rate Step size that balances speed and stability
    Temperature Dial between predictable and creative

    Short, accurate ML explainers. Follow for more.

    Part 2 of the Five ML Concepts series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 844 words5 min readAbstract

    Small Models (5/6): Max AI Per Watt

    One billion parameters. The sweet spot for AI.

    Big enough to reason. Small enough to run anywhere. Maximum capability per watt.

    This is Part 5 of the Small Models, Big Brains series, comparing four models at the 1B parameter point.

    Resource Link
    Code billion-llm
    TinyLlama jzhang38/TinyLlama
    Llama 3.2 ai.meta.com/llama
    Pythia EleutherAI/pythia
    Video Max AI Per Watt
    Video
    Comments Discord

    Why One Billion?

    Range Reality
    Below 1B Models struggle with complex reasoning
    Above 1B Hardware requirements increase significantly
    At 1B Maximum capability per watt

    1B parameters is where you get:

    • Real language understanding
    • Ability to follow instructions
    • Fine-tuning in minutes on a laptop
    • Deployment anywhere (phone, Raspberry Pi, browser)

    The Contenders

    Model Params Key Strength Training Data
    TinyLlama 1.1B Overtrained on 3T tokens Community
    Llama-3.2-1B 1B Official Meta ecosystem Meta
    StableLM-1.6B 1.6B Multilingual, 2T tokens Stability AI
    Pythia-1B 1.08B 154 research checkpoints EleutherAI

    TinyLlama: The Overtraining Champion

    TinyLlama breaks the rules. The Chinchilla scaling laws suggest training tokens should scale with parameters. TinyLlama uses 100x more data than optimal.

    Chinchilla-optimal for 1B: ~30B tokens
    TinyLlama actual:          3T tokens (3,000B)
    

    The result? A tiny model that punches well above its weight.

    Benchmarks

    From the billion-llm repository:

    Model MMLU HumanEval Speed Memory
    TinyLlama 25.3% 12.2% Fast 2.2GB
    Llama-3.2-1B 32.1% 18.5% Fast 2.4GB
    StableLM-1.6B 30.8% 15.1% Medium 3.2GB
    Pythia-1B 26.4% 10.3% Fast 2.2GB

    Llama-3.2-1B leads on quality. TinyLlama offers the best value when you factor in the open training recipe.

    LoRA Fine-Tuning in Minutes

    All these models can be fine-tuned on a laptop using LoRA:

    cd billion-llm
    python finetune_demo.py --model tinyllama --epochs 3
    

    LoRA adds small trainable adapters without modifying base weights:

    Base Model (frozen): 1.1B parameters
    LoRA Adapters:       ~4M parameters (0.4%)
    Training time:       5-10 minutes on M1 Mac
    

    Speculative Decoding: 2-3x Speedup

    Use a fast 1B model to draft tokens, verify with a slower 7B model:

    Draft (1B):   "The quick brown fox" → [jumps, over, the, lazy]
    Verify (7B):  Accept [jumps, over, the] → Reject [lazy] → Generate [sleepy]
    

    The 1B model generates candidates quickly. The 7B model only needs to verify, not generate from scratch.

    python speculative_demo.py
    

    Results: 2-3x speedup on autoregressive generation.

    Hardware Requirements

    Setup What You Can Run
    CPU only All models (slower, INT4 quantized)
    4GB VRAM All models (INT4 quantized)
    8GB VRAM All models (FP16)
    Apple Silicon All models (MPS acceleration)

    Quick Start

    git clone https://github.com/softwarewrighter/billion-llm
    cd billion-llm
    
    # Setup
    uv venv && source .venv/bin/activate
    uv pip install -r requirements.txt
    
    # Download models
    python download_models.py
    
    # Run benchmarks
    python benchmark.py
    
    # Interactive comparison
    python demo_chat.py --compare tinyllama llama3.2-1b
    

    Which Model Should You Choose?

    ├── Need Meta ecosystem compatibility? → Llama-3.2-1B
    ├── Need multilingual support?         → StableLM-1.6B
    ├── Need research reproducibility?     → Pythia-1B (154 checkpoints)
    ├── Need maximum performance/size?     → TinyLlama
    └── Just getting started?              → Any of them work!
    

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 8 .py files
    Estimated Size ~1.4 KLOC
    Framework Transformers, PyTorch
    Build System uv / pip
    Key Features Benchmarking, LoRA fine-tuning, speculative decoding

    Good for you if: You want to benchmark small LLMs, learn LoRA fine-tuning, experiment with speculative decoding, or compare models head-to-head.

    Complexity: Low. Clean Python scripts with HuggingFace Transformers. Each script is standalone—run benchmarks, chat demos, or fine-tuning independently. Well-documented with shell scripts for common tasks.

    Key Takeaways

    1. 1B is the efficiency sweet spot. Below this, capability drops. Above, hardware costs rise.

    2. Overtraining works. TinyLlama proves you can compensate for size with data.

    3. LoRA makes fine-tuning accessible. Customize models on consumer hardware.

    4. Speculative decoding is free speed. Use small models to accelerate large ones.

    5. All roads lead to open weights. Every model here is fully open.

    What’s Next

    Part 6 explores the 2-3B efficient frontier—Phi-2, Gemma, and SmolLM pushing the limits of small model capability.

    Resources


    Part 5 of the Small Models, Big Brains series. View all parts | Next: Part 6 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 416 words3 min readAbstract

    Five ML Concepts - #1

    5 machine learning concepts. Under 30 seconds each.

    Resource Link
    Papers Links in References section
    Video Five ML Concepts #1
    Video
    Comments Discord

    References

    Concept Reference
    Backprop Learning representations by back-propagating errors (Rumelhart, Hinton, Williams 1986)
    Transformer Attention Is All You Need (Vaswani et al. 2017)
    Mamba Mamba: Linear-Time Sequence Modeling (Gu & Dao 2023)
    Hallucination Survey of Hallucination in NLG (Ji et al. 2023)
    Embedding Word2Vec (Mikolov et al. 2013)

    Today’s Five

    1. Backpropagation

    Back propagation of errors. It’s how neural networks learn—flowing error backward through the network to adjust each weight.

    Without it, modern deep learning wouldn’t be practical.

    Think of it like retracing your steps to see which earlier choices caused the mistake.

    2. Transformer

    The architecture behind GPT, Claude, and most modern language models. Instead of processing words one at a time, transformers use attention to weigh relationships between all tokens.

    This enables parallel training and rich context awareness.

    Like understanding a sentence by seeing how every word relates to every other.

    3. Mamba (State Space Models)

    A newer alternative to transformers that processes sequences in linear time instead of quadratic.

    This allows scaling to very long documents with much lower memory use.

    Like a smart conveyor belt that carries forward only what matters.

    4. Hallucination

    When a model generates confident-sounding nonsense. It happens because language models predict plausible next words, not true facts.

    They optimize for likelihood, not correctness.

    Like a student who writes confidently without verifying sources.

    5. Embedding

    Turning words, images, or concepts into vectors of numbers. Similar meanings end up close together in this space.

    This lets math capture semantic relationships.

    Think of it as a coordinate system for meaning.

    Quick Reference

    Concept One-liner
    Backprop Learn by flowing error backward
    Transformer Attention over all tokens at once
    Mamba Linear-time sequence modeling
    Hallucination Confident nonsense from likelihood optimization
    Embedding Meaning as coordinates in vector space

    Short, accurate ML explainers. Follow for more.

    Part 1 of the Five ML Concepts series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 849 words5 min readAbstract

    Small Models (4/6): This AI Has a Visible Brain

    LLMs are black boxes. Baby Dragon Hatchling (BDH) is different—a brain-inspired language model with sparse, interpretable activations.

    Train it on Shakespeare and actually see what’s happening inside.

    This is Part 4 of the Small Models, Big Brains series, exploring interpretability through sparsity.

    Resource Link
    Paper Pathway (Sparse Coding)
    Original Code pathwaycom/bdh
    Fork (with tools) softwarewrighter/bdh
    Video This AI Has a Visible Brain
    Video
    Comments Discord

    The Black Box Problem

    Modern neural networks are opaque:

    • Billions of parameters
    • Dense activations everywhere
    • No clear mapping from neurons to concepts
    • “It works, but we don’t know why”

    This isn’t just an academic concern. We’re deploying AI systems we don’t understand.

    Baby Dragon Hatchling: A Different Approach

    BDH takes inspiration from biological brains, which use sparse coding:

    Biological Brains Dense Neural Networks
    ~1-5% neurons active ~100% neurons active
    Energy efficient Computationally expensive
    Interpretable patterns Distributed, opaque
    Robust to noise Brittle

    Sparse Activations

    BDH enforces 80% sparsity—only 20% of neurons are active for any given token.

    Dense Network:    [████████████████████] 100% active
    BDH:              [████░░░░░░░░░░░░░░░░]  20% active
    

    This constraint forces the network to learn meaningful, localized representations.

    Training on Shakespeare

    The demo trains BDH on Shakespeare’s works:

    Training Progress:
    Epoch 1:   Loss 0.86
    Epoch 50:  Loss 0.54
    Epoch 100: Loss 0.38
    Epoch 200: Loss 0.22
    

    Loss drops from 0.86 to 0.22—the architecture works.

    Seeing Inside the Model

    With sparse activations, you can actually inspect what neurons mean:

    # Which neurons fire for "love"?
    activations = model.forward("love")
    active_neurons = activations.nonzero()
    
    # Neuron 47: fires for emotional words
    # Neuron 112: fires for abstract nouns
    # Neuron 203: fires for relationship terms
    

    When only 20% of neurons fire, each one carries interpretable meaning.

    Running the Code

    The bdh repository is a fork of Pathway’s original with added inspection tools:

    git clone https://github.com/softwarewrighter/bdh
    cd bdh
    pip install -r requirements.txt
    
    # Train on Shakespeare
    python train.py --dataset shakespeare --sparsity 0.8
    
    # Inspect activations
    python inspect.py --model checkpoint.pt --text "To be or not to be"
    

    GPU recommended (Nvidia or Apple Silicon) for reasonable training times.

    Why Sparsity Enables Interpretability

    Dense Networks

    Every neuron participates in every computation. The “meaning” of any single neuron is distributed across all inputs it ever sees.

    Input: "cat"  → All neurons contribute → Output
    Input: "dog"  → All neurons contribute → Output
    Input: "love" → All neurons contribute → Output
    

    Trying to understand one neuron means understanding everything.

    Sparse Networks

    Only a small subset of neurons fire for each input. Neurons develop specialization.

    Input: "cat"  → Neurons [12, 47, 89] fire → Output
    Input: "dog"  → Neurons [12, 52, 89] fire → Output
    Input: "love" → Neurons [47, 112, 203] fire → Output
    

    Neuron 12 might mean “animal.” Neuron 47 might mean “emotional/living.” You can actually trace meaning.

    Comparison with Other Sparse Architectures

    Model Sparsity Type Purpose
    Mixture of Experts Routing sparsity Efficiency
    Top-k attention Attention sparsity Memory
    BDH Activation sparsity Interpretability

    BDH’s sparsity is specifically designed for understanding, not just efficiency.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 9 .py files
    Estimated Size ~1.5 KLOC
    Framework PyTorch
    Build System pip / requirements.txt
    GPU Support CUDA, MPS (Apple Silicon)

    Good for you if: You want to experiment with sparse neural architectures, study interpretability techniques, or train small language models with visible internals.

    Complexity: Low-Moderate. Standard PyTorch project structure. The sparse activation mechanism is well-documented. Fork includes additional inspection tools not in the original.

    Key Takeaways

    1. Sparsity enables interpretability. When fewer neurons fire, each one means more.

    2. Brain-inspired design works. Biological neural coding principles transfer to AI.

    3. Interpretability doesn’t require sacrifice. BDH learns effectively despite constraints.

    4. We can build AI we understand. Black boxes aren’t inevitable.

    Current Limitations

    • Early research stage
    • Smaller scale than production models
    • Training requires more epochs
    • Not yet competitive with dense models on benchmarks

    But the principle is sound: constraint breeds clarity.

    What’s Next

    Part 5 dives into the 1B parameter sweet spot—comparing TinyLlama, Llama 3.2, StableLM, and Pythia.

    Resources


    Part 4 of the Small Models, Big Brains series. View all parts | Next: Part 5 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1478 words8 min readAbstract

    Solving Sparse Rewards with Many Eyes

    Single explorer: 0% success. Five explorers: 60% success.

    Learning often fails not because models are slow, but because they see too little. In sparse-reward environments, a single explorer is likely to miss the rare feedback entirely. The solution? Put many eyes on the problem.

    The Problem: Sparse Rewards Create Blindness

    As IRPO formalizes: in sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal.

    A 7x7 grid with a single goal demonstrates this perfectly:

    • Random agent success rate: ~9%
    • With limited training (75 episodes), a single learner exploring alone never finds the goal

    This isn’t a compute problem. It’s an information problem.

    Challenge Effect Paper Connection
    Rare rewards Weak gradient signal IRPO’s core problem statement
    Single explorer Limited coverage Why multiple scouts help
    Random exploration Misses valuable states Why intrinsic rewards matter
    No feedback structure Can’t distinguish “almost right” from “nonsense” Reagent’s motivation

    The Solution: Many Eyes

    Instead of one explorer, use multiple scouts—independent exploratory agents that gather diverse information.

    ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
    │   Scout 1   │  │   Scout 2   │  │   Scout N   │
    │ (strategy A)│  │ (strategy B)│  │ (strategy N)│
    └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
           │                │                │
           v                v                v
    ┌─────────────────────────────────────────────────┐
    │              Experience Buffer                   │
    └─────────────────────────────────────────────────┘
                           │
                           v
    ┌─────────────────────────────────────────────────┐
    │               Shared Learner                     │
    └─────────────────────────────────────────────────┘
    

    Each scout explores with its own strategy. Their discoveries are aggregated to improve a shared learner.

    Results

    On a 7x7 sparse grid with 75 training episodes:

    Method Success Rate
    Random baseline 9%
    Single scout 0%
    Many eyes (3 scouts) 40%
    Many eyes (5 scouts) 60%

    Same total environment steps. Dramatically better outcomes.

    Why It Works

    Single Scout Fails Because:

    In IRPO terms: sparse reward → sparse gradient signal → no learning.

    1. Random exploration rarely reaches the goal (~9%)
    2. Insufficient successful trajectories
    3. DQN can’t learn from sparse positive examples
    4. The policy gradient has near-zero magnitude

    Many Eyes Succeeds Because:

    IRPO’s key insight: multiple exploratory policies manufacture signal.

    1. More coverage: Different scouts explore different regions (intrinsic rewards drive novelty-seeking)
    2. More discoveries: Higher probability of reaching goal (scouts find extrinsic reward)
    3. Signal routing: Scout discoveries update the shared learner (surrogate gradient in IRPO, experience pooling in many-eyes)
    4. Better gradients: Aggregated experience provides meaningful learning signal

    Scout Strategies (Intrinsic Rewards)

    IRPO uses intrinsic rewards to drive exploration. The many-eyes-learning project implements several strategies:

    Strategy Intrinsic Motivation IRPO Connection
    Epsilon-greedy Random action with probability ε Simple exploration noise
    Curious Bonus for novel states: 1/√(count+1) Count-based intrinsic reward
    Optimistic High initial Q-values Optimism under uncertainty
    Random Pure random baseline Maximum entropy exploration
    # CuriousScout intrinsic reward (simplified)
    def intrinsic_reward(self, state):
        count = self.state_counts[state]
        return self.bonus_scale / sqrt(count + 1)
    

    Scouts can be homogeneous (same strategy, different seeds) or heterogeneous (different strategies). IRPO supports swapping intrinsic reward functions—many-eyes makes this concrete with pluggable scout types.

    Running the Demo

    git clone https://github.com/softwarewrighter/many-eyes-learning
    cd many-eyes-learning
    
    # Setup
    uv venv .venv
    source .venv/bin/activate
    uv pip install -e ".[dev]"
    
    # Interactive CLI demo
    python experiments/cli_demo.py
    
    # Full experiment
    python experiments/run_experiment.py --episodes 75 --scouts 1 3 5
    
    # Generate plots
    python experiments/plot_results.py
    

    Results appear in ~5-10 minutes on a laptop.

    Diversity Experiment

    Does diversity of strategies matter, or just number of scouts?

    Configuration Success Rate
    5 random scouts 20%
    5 epsilon-greedy scouts 40%
    5 diverse scouts (mixed strategies) 40%

    Finding: In simple environments, strategy quality matters more than diversity. Epsilon-greedy beats random regardless of diversity.

    Key Insight

    The problem isn’t that learning is slow. The problem is that learning is blind.

    Many eyes make learning better.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files ~12 .py files
    Estimated Size ~1.5 KLOC
    Framework PyTorch, NumPy
    Platform CPU (no GPU required)

    Good for you if: You want to understand exploration in RL, experiment with sparse-reward environments, or see a clean implementation of scout-based learning.

    Complexity: Low-Moderate. Clean codebase with CLI demos. Runs on a laptop in minutes.

    Design Philosophy

    The project prioritizes clarity over performance:

    • Single-file implementations where practical
    • Minimal dependencies
    • Sequential mode is first-class (parallel optional)
    • Reproducible experiments with fixed seeds

    Simplifications from IRPO

    Full IRPO computes Jacobians to route gradients from exploratory policies back to the base policy. Many-eyes-learning simplifies this:

    IRPO Many-Eyes-Learning
    Jacobian chain rule Experience pooling
    Surrogate gradient Standard DQN updates
    Learned intrinsic rewards Hand-designed strategies

    The core insight remains: scouts explore with intrinsic motivation, discoveries benefit the shared learner. The math is simpler, the demo runs on a laptop, and the concept is clear.

    Key Takeaways

    1. Sparse rewards create information bottlenecks. Learning fails not from lack of compute, but lack of signal.

    2. More eyes = more information. Multiple scouts increase coverage and discovery rate.

    3. Diversity helps, but quality matters more. In simple environments, good exploration strategy beats diversity.

    4. Same compute, better outcomes. Many-eyes improves sample efficiency, not wall-clock speed.

    The Papers Behind Many-Eyes

    This project builds on two recent papers that attack the same fundamental problem: sparse rewards starve learning of signal.

    IRPO: Intrinsic Reward Policy Optimization

    IRPO (Cho & Tran, UIUC) formalizes the scouts concept mathematically.

    The core insight: In sparse-reward RL, the true policy gradient is basically uninformative most of the time. No reward signal → no gradient signal. Learning stalls.

    IRPO’s solution:

    ┌─────────────────────────────────────────────────┐
    │  1. Train exploratory policies (scouts)         │
    │     using INTRINSIC rewards                     │
    ├─────────────────────────────────────────────────┤
    │  2. Scouts discover EXTRINSIC rewards           │
    │     through exploration                         │
    ├─────────────────────────────────────────────────┤
    │  3. Route extrinsic signal back to base policy  │
    │     via surrogate gradient (Jacobian chain)     │
    └─────────────────────────────────────────────────┘
    
    IRPO Concept What It Means
    Intrinsic rewards “Explore what’s new” - reward novelty
    Exploratory policies Scouts driven by intrinsic motivation
    Surrogate gradient Trade bias for signal - approximate gradient that actually has magnitude
    Base policy The learner that benefits from scout discoveries

    How many-eyes-learning demonstrates this:

    • Scouts implement intrinsic motivation (CuriousScout uses count-based novelty bonuses)
    • Multiple exploration strategies create diverse coverage
    • Aggregated experience routes discoveries to the shared DQN learner
    • Simplified gradient routing - we pool experiences rather than compute full Jacobians

    Reagent: Reasoning Reward Models for Agents

    Reagent (Fan et al., CUHK/Meituan) takes a different approach: make feedback richer and more structured.

    The problem with sparse rewards: They can’t distinguish “almost right, failed at the end” from “complete nonsense.” Both get the same zero reward.

    Reagent’s solution: Build a Reasoning Reward Model that emits:

    Signal Purpose
    <think> Explicit reasoning trace
    <critique> Targeted natural-language feedback
    <score> Overall scalar reward

    This provides dense-ish supervision without hand-labeling every step.

    How many-eyes-learning relates:

    • Both papers recognize sparse rewards as an information problem
    • Reagent enriches the reward signal; IRPO multiplies the exploration
    • Many-eyes takes the IRPO path: more explorers finding the sparse signal
    • Future work could combine both: scouts + richer feedback per trajectory

    The Shared Meta-Lesson

    Both papers are saying the same thing:

    Sparse signals are a tragedy. Let’s smuggle in richer ones.

    • IRPO: via intrinsic-reward exploration gradients
    • Reagent: via language-based reward feedback

    Many-eyes-learning demonstrates the IRPO intuition in a simple, visual, reproducible way.

    Resources


    Sparse rewards are an information problem. Many eyes provide the solution.

    Part 1 of the Machine Learning series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 666 words4 min readAbstract

    MCP: Teaching Claude to Play (and Trash Talk)

    Claude learned to play tic-tac-toe. And trash talk. Using one protocol that works with any language model.

    Resource Link
    Code game-mcp-poc
    MCP Spec modelcontextprotocol.io
    Video Claude Plays Tic-Tac-Toe
    Video
    Comments Discord

    The Problem

    Language models are stuck in text. They can’t click buttons, make moves, or interact with real systems. Every integration is custom—different for Claude, GPT, Gemini.

    The Solution: MCP

    Model Context Protocol is a standard way for models to use tools. Define your tools once, they work with Claude, GPT, or any MCP-compatible agent.

    The protocol is simple:

    • JSON-RPC 2.0 over stdio
    • No HTTP needed
    • Clean request/response cycle

    The Demo: Trash Talkin’ Tic Tac Toe

    This proof-of-concept implements 6 MCP tools:

    Tool Purpose
    view_game_state See the board, players, status
    get_turn Whose turn is it?
    make_move Play a square (row, col)
    taunt_player Send trash talk to opponent
    restart_game Start a new game
    get_game_history All moves with timestamps

    The AI calls tools, the server responds. Claude can play a full game AND talk trash—all through the same protocol.

    Architecture

    ┌─────────────────────────────────────────────┐
    │            Claude Code (AI)                 │
    │              (MCP Client)                   │
    └──────────────────┬──────────────────────────┘
                       │ JSON-RPC 2.0 via stdio
                       ▼
    ┌─────────────────────────────────────────────┐
    │         MCP Server (Rust Binary)            │
    │  ┌───────────────────────────────────────┐  │
    │  │  6 Tools: view, turn, move, taunt,   │  │
    │  │           restart, history            │  │
    │  └───────────────────────────────────────┘  │
    │                   ▼                         │
    │  ┌───────────────────────────────────────┐  │
    │  │      SQLite (game.db)                 │  │
    │  │  • Games • Moves • Taunts             │  │
    │  └───────────────────────────────────────┘  │
    └─────────────────────────────────────────────┘
             ▲                           ▲
             │ REST API                  │ MCP
             │                           │
        Browser (UI)              AI Agent
        (Yew/WASM)              (Claude Code)
    

    Running It

    git clone https://github.com/sw-game-dev/game-mcp-poc
    cd game-mcp-poc
    
    # Development mode (with hot-reload)
    ./scripts/dev.sh
    
    # Or production build
    ./scripts/build.sh
    ./scripts/serve.sh
    

    The server runs on http://localhost:7397 serving:

    • REST API for UI interactions
    • MCP endpoint for AI agents
    • SSE for real-time updates
    • Yew/WASM frontend

    Configuring Claude Code

    Add to ~/.config/claude-code/mcp.json:

    {
      "mcpServers": {
        "tic-tac-toe": {
          "command": "/path/to/game-mcp-poc/target/release/game-mcp-server",
          "args": [],
          "env": {
            "GAME_DB_PATH": "/path/to/game.db"
          }
        }
      }
    }
    

    Restart Claude Code, then:

    You: "Let's play tic-tac-toe! Show me the board."
    You: "I'll take the center."
    You: "Your turn!"
    You: "Can you taunt me?"
    

    Implementation Details

    Metric Value
    Language Rust 2024 Edition
    Frontend Yew + WebAssembly
    Database SQLite
    Tests 175+ passing
    LOC ~2,500 (backend) + ~1,500 (tests)
    Binary Size ~8 MB

    Good for you if: You want to learn MCP, build AI-tool integrations, or see a production-quality Rust game server.

    Complexity: Moderate. Clean architecture with TDD. Requires Rust toolchain and understanding of JSON-RPC.

    Key Takeaways

    1. MCP standardizes AI tools. Define once, works with any compatible model.

    2. JSON-RPC over stdio is elegant. No HTTP complexity for local tools.

    3. Rust + WASM = fast everywhere. Same language for server and (via Yew) frontend.

    4. Trash talk is essential. Games without taunting are just… exercises.

    Resources


    MCP turns language models into tool users. This demo proves it works—and that AI can talk trash.

    Part 1 of the General Technology series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 794 words4 min readAbstract

    Small Models (3/6): Planner + Doer = Genius

    27 million parameters beats o3-mini on ARC.

    The hardest reasoning benchmark. Most LLMs score under 5 percent. This tiny model scores 40 percent.

    This is Part 3 of the Small Models, Big Brains series, exploring the Hierarchical Reasoning Model (HRM)—a brain-inspired architecture that separates planning from execution.

    Resource Link
    Paper Hierarchical Reasoning Model
    Original Code sapientinc/HRM
    Visualization viz-hrm-ft
    Video Planner + Doer = Genius
    Video
    Comments Discord

    The ARC Challenge

    The Abstraction and Reasoning Corpus (ARC) tests:

    • Abstract reasoning
    • Pattern matching
    • Spatial logic
    • Puzzles requiring real thinking

    These aren’t problems you can memorize. Each puzzle is unique, requiring genuine understanding of the underlying pattern.

    Why LLMs Struggle

    Challenge LLM Limitation
    Novel patterns Can’t rely on training data
    Spatial reasoning Text-based thinking is linearized
    Multi-step logic Each step compounds errors
    Abstraction Pattern matching isn’t generalization

    Meet HRM: The Hierarchical Reasoning Model

    HRM uses just 27 million parameters but achieves remarkable results by mimicking how the brain thinks: plan first, then act.

    Two-Module Architecture

    ┌─────────────────────────────────────┐
    │           PLANNER                   │
    │   Thinks slow and abstract          │
    │   Sets goals and strategies         │
    └─────────────┬───────────────────────┘
                  │ Goals
                  ▼
    ┌─────────────────────────────────────┐
    │            DOER                     │
    │   Moves fast                        │
    │   Takes concrete actions            │
    └─────────────────────────────────────┘
    
    Module Speed Function
    Planner Slow Abstract thinking, goal setting
    Doer Fast Concrete actions, execution

    This mirrors the brain’s dual-process theory: System 1 (fast, intuitive) and System 2 (slow, deliberate).

    Results

    Benchmark HRM (27M) o3-mini GPT-4
    ARC 40% <40% <5%
    Hard Mazes 99% - ~0%
    Complex Sudoku 99% - -

    A 27M parameter model outperforming models 1000x larger on reasoning tasks.

    The Visualization Tool

    The viz-hrm-ft repository provides a React app to visualize HRM’s reasoning process:

    • Watch the Planner form strategies
    • See the Doer execute actions
    • Visualize the feedback loop between modules
    • Simulate fine-tuning on BabyAI tasks
    git clone https://github.com/softwarewrighter/viz-hrm-ft
    cd viz-hrm-ft
    npm install
    npm start
    

    Why Hierarchy Works

    Traditional Flat Models

    Input → [Single Network] → Output
    

    Everything happens in one pass. Complex problems overwhelm the network.

    Hierarchical Models

    Input → [Planner] → Strategy
                      ↓
    Strategy → [Doer] → Action
                      ↓
    Action → [Environment] → Feedback
                           ↓
    Feedback → [Planner] → Refined Strategy
                         ↓
                        ...
    

    The Planner doesn’t worry about details. The Doer doesn’t worry about strategy. Each module focuses on what it does best.

    Key Insights

    1. Separation of concerns scales. Splitting planning from execution lets each module specialize.

    2. Iteration enables refinement. The Planner-Doer loop allows course correction.

    3. Small can beat big. 27M parameters with good architecture beats 100B+ with brute force.

    4. Brain-inspired design works. Mimicking cognitive architecture yields better results.

    Comparison with Part 1 (TRM)

    Aspect TRM HRM
    Parameters <1,000 27M
    Architecture Think-Act cycles Planner-Doer hierarchy
    Strength Maze solving Abstract reasoning
    Key insight Iteration Hierarchical decomposition

    Both use recursive reasoning, but HRM adds hierarchical structure for more complex tasks.

    Implementation Details

    Metric Value
    Primary Language TypeScript
    Source Files 26 .ts/.tsx, 7 .js
    Estimated Size ~4 KLOC
    Framework React
    Build System npm / Create React App
    Visualization Canvas-based rendering

    Good for you if: You want to visualize neural reasoning processes, build interactive ML demos, or learn React with a real project.

    Complexity: Low-Moderate. Standard React/TypeScript project. No ML training code—this is a visualization tool for understanding the HRM architecture. Easy to extend with new visualizations.

    Key Takeaways

    1. Plan, then act. Separating strategy from execution mirrors effective human thinking.

    2. Hierarchy enables complexity. Multi-level reasoning handles problems flat networks can’t.

    3. Architecture > Scale for reasoning tasks.

    4. ARC remains unsolved by brute-force scaling—clever architectures are the path forward.

    What’s Next

    Part 4 explores Baby Dragon Hatchling (BDH)—a brain-inspired model with visible, interpretable activations.

    Resources


    Part 3 of the Small Models, Big Brains series. View all parts | Next: Part 4 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 710 words4 min readAbstract

    Deepseek Papers (2/3): Engram - Conditional Memory for Transformers

    Deepseek publishes papers. I implement them. This paper tackles another fundamental transformer problem: redundant computation.

    This post covers my implementation of Engram (Conditional Memory via Scalable Lookup)—running on both Apple Silicon and NVIDIA GPUs.

    Resource Link
    Paper arXiv:2601.07372
    Code engram-poc
    Video 1 Engram Part 1
    Video
    Video 2 Engram Part 2
    Video
    Comments Discord

    The Problem: Redundant Computation

    LLMs waste compute reconstructing patterns they’ve seen before:

    • Style rules repeated across files
    • Common code idioms re-derived each call
    • Boilerplate knowledge injected repeatedly

    Attention computes everything from scratch every time. For recurring patterns, this is wasteful.

    The Engram Solution: O(1) Lookup

    Engram introduces conditional memory as a complementary sparsity axis. Instead of recomputing common patterns through attention, look them up in O(1) time.

    Think of it as a cache for the model’s learned patterns:

    Without Engram With Engram
    Recompute pattern every call Look up cached result
    O(n²) attention O(1) deterministic lookup
    Implicit knowledge Explicit, inspectable memory

    The PoC Approach

    The full Engram paper describes in-model memory. The engram-poc repo approximates the benefits through behavioral fine-tuning:

    1. Pattern Injection: Training data encodes lookup-like patterns
    2. LoRA Adapters: Learn to recognize and consistently respond
    3. Evaluation: Compare baseline vs tuned model

    Pattern Categories

    The PoC includes 131 patterns across 4 categories:

    Category Examples
    Code Idioms for i in range(len(items)):
    Factual Recall HTTP status for 'Not Found'?404
    Format Transforms snake_case: getUserNameget_user_name
    Error Fixes Fix: if x = 5:if x == 5:

    Results

    Training on SmolLM-135M-Instruct:

    Metric Value
    Training Examples 337
    Training Time ~10 seconds (M-series Mac)
    Loss Reduction 58.2% (4.34 → 1.82)

    Behavioral change:

    Prompt: Complete: for i in range(
    
    Baseline:     "Here is a Python function that implements this approach..."
    Engram-tuned: "len(items)):"
    

    The tuned model produces direct, pattern-completing responses instead of verbose explanations.

    Running the Engram Demo

    git clone https://github.com/softwarewrighter/engram-poc
    cd engram-poc
    
    # Apple Silicon
    uv venv && source .venv/bin/activate
    uv pip install -r requirements.txt
    ./scripts/run_all.sh
    
    # NVIDIA GPU (separate directory)
    cd unsloth-nvidia
    uv venv && source .venv/bin/activate
    uv pip install torch --index-url https://download.pytorch.org/whl/cu124
    uv pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
    ./scripts/run_all.sh
    

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 24 .py, 10 .sh, 6 .yaml
    Estimated Size ~3.0 KLOC
    Frameworks MLX-LM, Unsloth
    Platforms Apple Silicon, NVIDIA CUDA
    Key Features LoRA fine-tuning, pattern evaluation, interactive demo

    Good for you if: You want to experiment with LoRA fine-tuning, understand behavioral pattern injection, or compare MLX vs Unsloth workflows.

    Complexity: Moderate. Includes extensive documentation and video recording guides. Pattern data is human-readable YAML.

    Key Takeaways

    1. Engram reduces redundant computation. O(1) lookup for recurring patterns beats recomputing through attention.

    2. LoRA makes experimentation accessible. Fine-tune small models in seconds on a laptop.

    3. Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, with different tooling for each.

    4. Deepseek publishes useful research. Their papers address real problems with practical solutions.

    What’s Next

    Part 3 will cover Engram Revisited—what happened when we moved from behavioral emulation to real hash-based memory implementation. Spoiler: it works, but not everywhere.

    Resources


    Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.

    Part 2 of the Deepseek Papers series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 697 words4 min readAbstract

    Multi-Hop Reasoning (1/2): Training Wheels for Small LLMs

    A tiny 135M parameter model goes from 0% to 75% accuracy in 5 minutes of training. The secret? Knowledge graph-guided training with rejection sampling.

    The Problem: Multi-Hop Reasoning

    LLMs struggle with questions requiring multiple reasoning steps. “What’s the fix for a crash caused by a corrupted config file on a system running outdated firmware?” requires connecting several facts:

    1. Corrupted config → need config reset
    2. Outdated firmware → need firmware update
    3. Crash context → check dependencies between these fixes

    Standard fine-tuning teaches pattern matching. Multi-hop reasoning requires following logical chains.

    The Paper’s Approach

    Training wheels

    Learn with training wheels, remove them after learning completes.

    Knowledge Graph-Guided RAG from Princeton proposes using knowledge graphs during training to score reasoning quality—then removing the graph at inference.

    The key insight: train with scaffolding, test without it.

    My Implementation

    The repo implements this for a software troubleshooting domain:

    Component Details
    Knowledge Graph ~200 entities, ~600 edges (symptoms, causes, fixes)
    Training Data MCQs with 1-3 hop paths
    Eval Data MCQs with 4-5 hop paths (harder)
    Model SmolLM-135M-Instruct
    Framework MLX (Apple Silicon native)

    The Training Pipeline

    ┌─────────────────────────────────────────┐
    │  1. SFT: Learn output format            │
    │     TRACE: <reasoning>                  │
    │     ANSWER: A|B|C|D                     │
    ├─────────────────────────────────────────┤
    │  2. RSFT: Rejection Sampling FT         │
    │     - Generate multiple answers         │
    │     - Score with knowledge graph        │
    │     - Keep only correct traces          │
    │     - Train on winners                  │
    └─────────────────────────────────────────┘
    

    The Reward Function

    The knowledge graph scores outputs during training:

    • R_corr: +1.0 correct answer, -2.0 incorrect
    • R_path: Entity coverage (did the trace mention relevant nodes?)
    • P_spam: -0.5 penalty for repeating entities (prevents gaming)

    At inference, the graph is removed. The model must reason from learned patterns.

    Results

    Phase Accuracy Training Time
    Base model 0% -
    After SFT 30% ~2 min
    After RSFT 75% ~3 min

    The critical finding: distribution matching matters.

    Training on easy examples (1-2 hops) hurt performance on hard eval (4-5 hops). Training on examples matching the eval distribution achieved 75%.

    Running It

    git clone https://github.com/softwarewrighter/multi-hop-reasoning
    cd multi-hop-reasoning
    
    # Setup (Apple Silicon)
    make setup-mlx
    
    # Full pipeline
    make train
    

    Results appear in ~5 minutes on an M-series Mac.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 12 .py files
    Estimated Size ~1.5 KLOC
    Framework MLX, Transformers
    Platform Apple Silicon (MLX native)

    Good for you if: You want to understand knowledge graph-guided training, experiment with rejection sampling fine-tuning, or see how small models can learn reasoning patterns.

    Complexity: Moderate. Clean codebase with Make targets for each step. Requires understanding of fine-tuning concepts.

    Key Takeaways

    1. Scaffolded training works. Use structured feedback during training, remove it at inference.

    2. Distribution matching matters. Train on examples that match your eval distribution.

    3. Small models can reason. 135M parameters is enough for 75% accuracy on 4-5 hop questions.

    4. MLX makes iteration fast. Full pipeline runs in 5 minutes on a MacBook.

    Resources


    Knowledge graphs as training wheels—helping small models learn to reason, then letting go.

    Part 1 of the Multi-Hop Reasoning series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 770 words4 min readAbstract

    Small Models (2/6): AI in Your Pocket

    AI on your phone. All day. No internet required.

    This is Part 2 of the Small Models, Big Brains series. Today we’re putting a language model in your pocket with Pocket Eliza++—a modern AI therapist that runs completely offline on Android.

    Resource Link
    Paper MobileLLM (ICML 2024)
    Code pocket-llm
    Runtime llama.cpp
    Video AI in Your Pocket
    Video
    Comments Discord

    Why Offline Matters

    Benefit Description
    Privacy Data never leaves your device
    Speed No network latency
    Cost No API fees
    Offline Works without internet
    Battery Efficient on-device inference

    Cloud AI is convenient, but sometimes you want a conversation that stays on your device.

    MobileLLM: Meta’s Edge Champion

    MobileLLM is Meta’s sub-500M parameter model optimized specifically for on-device inference.

    Architecture Optimizations

    Technique Benefit
    Deep-thin design More layers, fewer parameters per layer
    SwiGLU activation Better performance than ReLU
    Embedding sharing Saves 30% of parameters
    Grouped-query attention Faster inference

    The result: a 260MB quantized model (Q4_K_M) that runs smoothly on phones.

    Pocket Eliza++

    Eliza taking notes

    The original ELIZA (1966) used pattern matching to simulate a Rogerian therapist. Pocket Eliza++ uses the same therapeutic approach but with actual language understanding.

    Therapeutic Design

    The system prompt instructs the model to:

    • Ask one short question at a time
    • Never repeat questions
    • Vary question types (feelings, motivations, specifics)
    • Never give advice or explanations

    It’s a reflective listener, not a problem solver.

    Technical Stack

    ┌─────────────────────────────────┐
    │     Kotlin + Jetpack Compose    │  UI Layer
    ├─────────────────────────────────┤
    │            JNI Bridge           │
    ├─────────────────────────────────┤
    │           llama.cpp             │  Inference Engine
    ├─────────────────────────────────┤
    │    MobileLLM-350M (Q4_K_M)      │  Model (260MB)
    └─────────────────────────────────┘
    
    • Model: MobileLLM-350M quantized to Q4_K_M (260MB GGUF)
    • Runtime: llama.cpp compiled for Android via NDK
    • Interface: Kotlin + Jetpack Compose
    • Bridge: JNI bindings connect Kotlin to native llama.cpp

    Building the App

    # Clone the repository
    git clone https://github.com/softwarewrighter/pocket-llm
    cd pocket-llm/android-demo
    
    # Clone llama.cpp into native source
    git clone https://github.com/ggerganov/llama.cpp.git \
        app/src/main/cpp/llama.cpp
    
    # Download the model (260MB)
    mkdir -p app/src/main/assets
    curl -L -o app/src/main/assets/MobileLLM-376M-Q4_K_M.gguf \
        "https://huggingface.co/pjh64/MobileLLM-350M-GGUF/resolve/main/MobileLLM-376M-Q4_K_M.gguf"
    
    # Build and install
    ./gradlew assembleDebug
    adb install -r app/build/outputs/apk/debug/app-debug.apk
    

    Build Requirements

    Requirement Value
    Target SDK 35 (Android 15)
    Min SDK 28 (Android 9.0)
    ABI arm64-v8a
    NDK CMake for native build
    Kotlin 2.0.0

    Quick CLI Demo

    Don’t want to build the Android app? Test with Ollama:

    pip install -r requirements.txt
    ollama pull smollm:360m
    python3 eliza.py
    

    Performance

    On a mid-range Android phone (Snapdragon 7 series):

    • First token: ~500ms
    • Generation: ~10 tokens/second
    • Memory: ~400MB RAM
    • Battery: Minimal impact for short sessions

    Implementation Details

    Metric Value
    Languages Kotlin (UI), Python (CLI), C++ (JNI)
    Source Files 6 .kt, 4 .py, 2 .cpp
    Estimated Size ~1.3 KLOC
    Android Target SDK 28+ (Android 9.0)
    Build System Gradle + CMake (NDK)
    Key Dependency llama.cpp (vendored)

    Good for you if: You want to deploy LLMs on Android, learn JNI/NDK integration, or build privacy-focused mobile AI apps.

    Complexity: Moderate-High. Requires Android Studio, NDK setup, and understanding of JNI bridges. The llama.cpp integration is the tricky part; the Kotlin UI is straightforward Jetpack Compose.

    Key Takeaways

    1. Sub-500M models are phone-ready. MobileLLM proves useful AI fits in your pocket.

    2. llama.cpp is the universal runtime. Same engine runs on Mac, Linux, Windows, and Android.

    3. Privacy doesn’t require sacrifice. Offline AI can still be conversational and helpful.

    4. Quantization is essential. Q4_K_M brings 350M parameters down to 260MB with minimal quality loss.

    What’s Next

    Part 3 explores the Hierarchical Reasoning Model (HRM)—a 27M parameter model that beats o3-mini on abstract reasoning.

    Resources


    Part 2 of the Small Models, Big Brains series. View all parts | Next: Part 3 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 765 words4 min readAbstract

    Deepseek Papers (1/3): mHC - Training Stability at Any Depth

    Deepseek publishes papers. I implement them. This paper tackles a fundamental transformer problem: training stability in deep networks.

    This post covers my implementation of mHC (Manifold-Constrained Hyper-Connections)—running on both Apple Silicon and NVIDIA GPUs.

    Resource Link
    Paper arXiv:2512.24880
    Code mHC-poc
    ELI5 eli5-mHC.md
    ELI4 eli4-mHC.md
    Video 1 mHC Demo
    Video
    Video 2 mHC Explained
    Video
    Video 3 mHC Results
    Video
    Comments Discord

    The Problem: Deep Networks Explode

    Residual connections revolutionized deep learning. Skip connections let gradients flow through hundreds of layers. But there’s a catch.

    Standard residual connections:

    output = layer(input) + input
    

    This works, but the signal accumulates. With many layers, small amplifications compound into instability.

    Hyper-Connections (HC) tried to fix this by learning connection weights:

    output = α₁ × layer(input) + α₂ × input
    

    Better expressiveness, but learned weights can still cause explosion. At 48 layers, HC becomes unstable.

    The mHC Solution: Doubly-Stochastic Constraints

    mHC constrains the connection weights using Sinkhorn-Knopp iteration—a mathematical technique that ensures weights form a doubly-stochastic matrix.

    What does “doubly-stochastic” mean?

    • Each row sums to 1
    • Each column sums to 1

    This bounds the total signal flow. No matter how deep the network, amplification stays controlled.

    # Sinkhorn-Knopp iteration (simplified)
    def make_doubly_stochastic(weights, iterations=5):
        for _ in range(iterations):
            weights = weights / weights.sum(dim=0)  # Column normalize
            weights = weights / weights.sum(dim=1)  # Row normalize
        return weights
    

    Results: Stability at Depth

    The mHC-poc repo stress-tests this with a depth sweep:

    Depth Baseline HC mHC
    12 layers Stable Stable Stable
    24 layers Struggling Stable Stable
    48 layers Oscillating Explodes Stable

    At 48 layers:

    • HC gain proxy: 10²⁷ (catastrophic amplification)
    • mHC gain proxy: 10⁻⁰·⁶ (bounded, healthy)

    HC’s final loss at 48 layers: 5.54 (never learns) mHC’s final loss at 48 layers: 0.0002 (perfect convergence)

    Cross-Platform Validation

    The implementation runs on both Apple Silicon (MLX) and NVIDIA (PyTorch/CUDA):

    Metric MLX (Apple) CUDA (NVIDIA)
    Gain Proxy (24L) -0.6 -0.602
    Gradient Stability Stable Stable
    NaN Events 0 0

    Identical results confirm the Sinkhorn-Knopp projection works correctly on both platforms.

    Running the mHC Demo

    git clone https://github.com/softwarewrighter/mHC-poc
    cd mHC-poc
    
    # Apple Silicon (MLX)
    uv venv && source .venv/bin/activate
    uv pip install -r mlx/requirements.txt
    bash scripts/run_depth_sweep.sh
    
    # NVIDIA (CUDA)
    cd cuda
    uv venv && source .venv/bin/activate
    uv pip install -r requirements.txt
    bash scripts/run_cuda_depth_sweep.sh
    

    Results go to runs/ with plots showing loss, gradient norms, and gain proxy across depths.

    Implementation Details

    Metric Value
    Primary Language Python
    Source Files 29 .py, 3 .sh, 10 .yaml
    Estimated Size ~2.5 KLOC
    Frameworks MLX, PyTorch
    Platforms Apple Silicon, NVIDIA CUDA
    Key Features Depth sweep, cross-platform validation, visualization

    Good for you if: You want to understand mHC’s stability benefits, compare MLX vs PyTorch implementations, or experiment with residual connection variants.

    Complexity: Moderate. Well-documented with ELI5 explanations in docs/. Requires understanding of residual connections and matrix constraints.

    Key Takeaways

    1. mHC solves deep network instability. Doubly-stochastic constraints bound signal amplification at any depth.

    2. Cross-platform matters. The repo runs on Apple Silicon and NVIDIA, validated to produce identical results.

    3. Deepseek publishes useful research. Their papers address real problems with practical solutions.

    What’s Next

    Part 2 covers Engram—Deepseek’s approach to reducing redundant computation through conditional memory.

    Resources


    Implementing papers is the best way to understand them. Clone the repo and run the demo yourself.

    Part 1 of the Deepseek Papers series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 708 words4 min readAbstract

    Small Models (1/6): 976 Parameters Beat Billions

    The best large language models score zero on hard mazes. A model with under 1,000 parameters scores 85 percent.

    This is Part 1 of the Small Models, Big Brains series, exploring how tiny models with clever architectures outperform massive ones on specific tasks.

    Why LLMs Fail at Mazes

    Large language models generate one token at a time. They cannot backtrack. One wrong move and the entire solution fails.

    Maze solving requires:

    • Exploring dead ends
    • Backtracking when stuck
    • Maintaining spatial awareness
    • Planning multiple steps ahead

    Autoregressive generation is fundamentally incompatible with these requirements.

    Meet TRM: The Tiny Recursive Model

    The Tiny Recursive Model uses under 1,000 parameters. Instead of being bigger, it thinks in loops.

    Input → Think → Act → Think → Act → ... → Output
    

    A simple two-layer network that iterates until the solution emerges.

    The Architecture

    TRM alternates between two phases:

    Phase Purpose
    Think Update internal latent state by processing input, current answer, and previous state
    Act Update the answer based on the refined latent state

    This process repeats for multiple cycles, progressively improving the output.

    TRMConfig {
        input_dim: 5,
        output_dim: 5,
        hidden_dim: 16,
        latent_dim: 16,
        l_layers: 2,      // Network depth
        h_cycles: 3,      // Outer think-act cycles
        l_cycles: 4,      // Inner think cycles
    }
    

    The Secret: Deep Supervision

    The key insight isn’t just recursion—it’s supervising every step, not just the final answer.

    Traditional training:

    Input → [black box] → Final Output → Loss
    

    TRM training:

    Input → Step 1 → Loss₁
          → Step 2 → Loss₂
          → Step 3 → Loss₃
          → ...
          → Final  → Loss_n
    

    Every iteration gets feedback. The model learns to make progress at each step.

    Results

    Model Maze Accuracy
    GPT-4 ~0% on hard mazes
    Claude ~0% on hard mazes
    TRM (976 params) 85%

    Iteration beats scale.

    Running the Code

    The train-trm repo provides a complete Rust implementation:

    # Clone and build
    git clone https://github.com/softwarewrighter/train-trm
    cd train-trm
    ./scripts/build.sh --release
    
    # Train a model
    ./scripts/train.sh --epochs 1000 --lr 0.01
    
    # Evaluate
    ./scripts/eval.sh
    
    # Or launch the web UI
    cargo install --locked trunk
    ./scripts/web-serve.sh
    

    The web UI includes interactive maze visualization with solution paths and real-time training charts.

    Implementation Details

    Metric Value
    Primary Language Rust
    Source Files 21 .rs files
    Estimated Size ~2.5 KLOC
    Also Includes HTML (web UI), Shell scripts
    Build System Cargo + Trunk (WASM)
    Dependencies ndarray, serde, clap, wasm-bindgen

    Good for you if: You want to learn Rust ML from scratch, experiment with recursive architectures, or need a web-based training visualization.

    Complexity: Moderate. Clean Rust code with good documentation. The neural network is implemented from scratch (no PyTorch/TensorFlow), making it educational but requiring Rust familiarity.

    Key Takeaways

    1. Parameter count isn’t everything. Architecture and training strategy matter more for certain tasks.

    2. Recursion enables backtracking. By iterating, TRM can explore and refine solutions.

    3. Deep supervision accelerates learning. Feedback at every step, not just the end.

    4. Task-specific models excel. TRM isn’t a general-purpose LLM—it’s optimized for maze-like reasoning.

    What’s Next

    Part 2 explores MobileLLM and running AI completely offline on your Android phone.

    Resources


    Part 1 of the Small Models, Big Brains series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1018 words6 min readAbstract

    Welcome to Software Wrighter Lab

    Welcome to Software Wrighter Lab—a blog, YouTube channel, Discord server, and GitHub repos for exploring the intersection of AI coding agents, systems programming, and practical machine learning.

    I’m Mike Wright, a software engineer with over four decades of experience, currently focused on AI-assisted development with Rust and WebAssembly.

    Quick Links  
    YouTube @SoftwareWrighter
    GitHub softwarewrighter
    Discord SW Lab
    Comments Discord

    Contents:

    About Me

    I’ve been writing code professionally for over 35 years—an Emacs user since 1989, still going strong.

    My background spans mainframes to startups:

    • IBM Data Processing Division - MVS Dynamic Reconfiguration and Standalone Dump (SADUMP)
    • IBM T.J. Watson Research - Advisory Programmer on MVS Batch Pipes, Automatic Restart Manager, Java Record I/O, and IMS Data Sharing
    • Forte Software / Sun Microsystems - Senior Programmer on Forte 4GL/Conductor/Fusion, Open Enterprise Service Bus, and Glassfish
    • Startups - Individual contributor and management roles including LogiCoy (Open ESB), Likestream (Facebook Clojure App), Guidewire (Platform), Illumio (Network Security Web UI), and Signifyd (Gradle/Docker performance tuning)

    Areas I’ve worked in: mainframe O/S development, EAI/B2B middleware, platform engineering, build/release engineering, and embedded programming.

    Programming Languages

    Over the years, I’ve written production code in:

    Era Languages
    Mainframe APL, Assembler (S/370, S/390), IBM PL/S, PL/AS, PL/X, CMS/TSO Pipelines
    Systems C, C++
    Enterprise Java, Forte 4GL, Guidewire Gosu, Groovy
    Web/Modern JavaScript, TypeScript, Go, Clojure, ClojureScript
    Current Elisp, JavaScript, Kotlin, Python, Rust, WebAssembly

    Each language taught me something different about how to think about problems. APL taught me array thinking. Assembler taught me what the machine is actually doing. CMS/TSO Pipelines taught me dataflow composition (an area I plan to revisit in Throwback Thursday posts). Lisp (via Clojure) taught me functional composition. Rust is teaching me ownership and fearless concurrency.

    I’m a lifelong learner. When Rust emerged as a modern systems language, I dove in. When AI coding agents became capable enough to be genuine collaborators, I started exploring how they change the way we build software.

    This blog and the accompanying YouTube channel document that exploration.

    What This Blog Covers

    Software Wrighter Lab focuses on three main areas:

    1. AI Coding Agents

    How do tools like Claude Code, Cursor, and other AI assistants actually perform on real projects? I build the same applications with different agents to compare their strengths and weaknesses.

    • Vibe coding comparisons (Claude vs GLM, different models)
    • Practical workflows (parallel coding with git worktrees, hooks, custom scripts)
    • Tool development (guardian-cli, proact, ralphy)

    2. Machine Learning Research Implementation

    When interesting ML papers come out, I implement them to understand how they work. The goal isn’t to compete with research labs—it’s to learn by building.

    Recent implementations include:

    • Tiny Recursive Model (TRM) - Under 1,000 parameters solving mazes
    • Hierarchical Reasoning Model (HRM) - Planner-Doer architecture for abstract reasoning
    • MobileLLM - Running LLMs offline on Android
    • Deepseek papers (mHC, Engram) - Novel architectures for efficient inference
    • MIT’s Recursive Language Model - Implemented in Rust with WASM

    3. Rust, WebAssembly, and Practical Tools

    Rust is my language of choice for new projects. Combined with WebAssembly, it enables building tools that run anywhere—CLI, browser, or embedded.

    Topics include:

    • Rust/Yew/WASM web applications
    • Visualization (Three.js, d3.js, pure CSS approaches)
    • Video production tools (TTS, lip sync, explainer generation)
    • Developer utilities (installation scripts, repo assistants, modularizers)

    Why “Software Wrighter”?

    A “wright” is a craftsperson—someone who builds things. A wheelwright builds wheels. A playwright builds plays.

    A Software Wrighter builds software, with attention to craft.

    The name reflects my belief that good software comes from treating programming as a craft: learning continuously, choosing tools deliberately, and building things that work well and last.

    What to Expect

    Posts on this blog will typically include:

    • Links to papers, repos, and videos (above the fold)
    • Implementation details (language, LOC, complexity assessment)
    • Working code you can clone and run
    • Honest assessments of what works and what doesn’t

    I’m not trying to sell you anything. This is a lab notebook—a record of experiments, some successful, some not.

    Current Projects

    As of February 2026, I’m actively working on:

    Project Description Status
    Small Models, Big Brains 6-part series on efficient LLMs Publishing
    Deepseek papers mHC and Engram implementations In progress
    Explainer pipeline AI-generated video production Ongoing
    RLM implementations Recursive Language Models in Rust Complete

    Technology Stack

    Most of my current work uses:

    Layer Technology
    Systems Rust
    Web Yew, WASM, TypeScript
    ML Python, PyTorch, HuggingFace
    AI Agents Claude Code, Cursor
    Video OBS, FFmpeg, TTS tools

    Get Involved

    If any of this resonates with you:

    I’m always interested in discussing these topics with other engineers exploring similar territory.

    What’s Next

    The first content series, Small Models, Big Brains, starts tomorrow. It’s a 6-part deep dive into small language models that outperform much larger ones on specific tasks:

    1. TRM: 976 parameters beating GPT-4 on mazes
    2. MobileLLM: AI running offline on your phone
    3. HRM: 27M parameters beating o3-mini on abstract reasoning
    4. BDH: A language model with visible, interpretable activations
    5. Billion-parameter models: The efficiency sweet spot
    6. The 2-3B efficient frontier: Phi-2, Gemma, SmolLM

    Each post maps to a YouTube video, a GitHub repo, and working code you can run yourself.

    Thanks for reading. Let’s build something interesting.


    Mike Wright Software Wrighter LLC San Francisco Bay Area

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

  • 1131 words6 min readAbstract

    TBT (1/?): My First Program Was a Horse Race

    My first program was a horse race. Written in APL. On a mainframe. In 1972.

    This is the first Throwback Thursday post—a series where I revisit the technologies, languages, and ideas that shaped how I think about programming.

    Resource Link
    Code apl-horse-race
    Demo Live Demo
    GNU APL gnu.org/software/apl
    Video Greek Code, No Lowercase #TBT
    Video
    Comments Discord

    APL: A Programming Language

    APL was created by Kenneth Iverson at IBM in the 1960s. The name literally means “A Programming Language”—Iverson was a mathematician who designed it as a notation for describing algorithms before it became an actual programming language.

    What made APL special:

    Feature Description
    Array-oriented Operations work on entire arrays, not single values
    Symbolic notation Greek letters and mathematical symbols as operators
    Interactive REPL-style development decades before it was common
    Terse Complex operations in a few characters

    APL programs look like nothing else:

    POS←POS+?5⍴3
    

    This single line adds random values (1-3) to all five horse positions simultaneously. No loops. No iteration. The operation just happens across the entire array.

    The IBM 2741 Experience

    In 1972, APL\360 ran on IBM mainframes. You accessed it through terminals like the IBM 2741—essentially a modified Selectric typewriter with a special APL typeball.

    IBM Selectric APL typeball
    APL typeball for IBM Selectric

    The typeball had all the APL glyphs: ⍴ ⍳ ∇ ⎕ ← ⌈ ⌊ ⍵ ⍺ ∘ ⊃ ⊂ and dozens more. You physically typed these symbols. The keyboard layout was completely different from anything you’d seen before.

    When you made an error, there was no backspace in the modern sense. You’d overstrike characters or start the line over. Programs were stored in workspaces, saved to tape or disk.

    The terminal printed on paper. Every interaction left a physical record.

    The Horse Race Program

    Horse race simulations were popular APL demonstrations. They showed off several things:

    1. Random number generation (? roll operator)
    2. Array operations (updating all positions at once)
    3. Character graphics (crude but effective visualization)
    4. Interactive output (watching the race unfold)

    Here’s the verbose version from the repo:

    ∇ RACE;HORSES;POS;FINISH;ROUND;_
      HORSES←'LUCKY  ' 'THUNDER' 'SHADOW ' 'COMET  ' 'BLAZE  '
      POS←5⍴0
      FINISH←15
      ROUND←0
      ⎕←'══════════════════════════════════════════'
      ⎕←'           THE RACE IS ON!'
      ⎕←'══════════════════════════════════════════'
    LOOP:ROUND←ROUND+1
      ⎕←'--- ROUND ',(⍕ROUND),' ---'
      POS←POS+?5⍴3
      SHOWHORSES
      →DONE×⍳∨/POS≥FINISH
      →LOOP
    DONE:⎕←'WINNER: ',((⊃(POS=⌈/POS)/⍳5)⊃HORSES),'!'
    ∇
    

    Key APL Idioms

    Array creation:

    POS←5⍴0    ⍝ Create array of 5 zeros
    

    The (rho) operator reshapes. 5⍴0 means “reshape 0 into a 5-element array.”

    Random numbers:

    ?5⍴3       ⍝ Roll 5 dice, each 1-3
    

    The ? operator is “roll”—like rolling dice. ?5⍴3 rolls five 3-sided dice.

    Finding the winner:

    (⊃(POS=⌈/POS)/⍳5)⊃HORSES
    

    This reads right-to-left:

    • ⌈/POS — maximum of all positions
    • POS=⌈/POS — boolean mask: which horses are at max?
    • /⍳5 — compress: keep only those indices
    • — take the first one
    • ⊃HORSES — select that horse’s name

    One line. No loops. Pure array thinking.

    The Idiomatic Version

    APL programmers pride themselves on terseness. The idiomatic version does the same thing in fewer characters:

    HORSES←'LUCKY  ' 'THUNDER' 'SHADOW ' 'COMET  ' 'BLAZE  '
    
    ∇ SHOW;I
      I←1
    N:⎕←(I⊃HORSES),'│',((I⊃POS)⍴'░'),'▓'
      I←I+1
      →N×⍳I≤5
    ∇
    
    ∇ RACE;POS;_
      POS←5⍴0
      ⎕←'THE RACE IS ON!'
    L:_←⎕DL 0.3
      POS←POS+?5⍴3
      SHOW
      ⎕←''
      →L×⍳~∨/POS≥15
      ⎕←'WINNER: ',(⊃(POS=⌈/POS)/⍳5)⊃HORSES
    ∇
    

    The entire program fits on a single screen. This was the APL aesthetic: powerful ideas expressed concisely.

    Running It Today

    GNU APL implements ISO 13751 (Extended APL) and runs on modern systems:

    # macOS
    brew install gnu-apl
    
    # Arch Linux
    yay -S gnu-apl
    
    # Run the race
    git clone https://github.com/sw-comp-history/apl-horse-race
    cd apl-horse-race
    apl -f src/race.apl
    

    Sample output:

    ══════════════════════════════════════════
               THE RACE IS ON!
    ══════════════════════════════════════════
    
    --- ROUND 1 ---
    LUCKY   │▓▓▓◄
    THUNDER │▓▓◄
    SHADOW  │▓◄
    COMET   │▓▓▓◄
    BLAZE   │▓▓◄
    

    The horses advance randomly until one crosses the finish line.

    What APL Taught Me

    APL shaped how I think about programming in ways that persist today:

    1. Think in arrays, not loops.

    When I see a problem, I ask: can this be expressed as an operation on a whole collection? Languages like NumPy, R, and Julia carry this forward.

    2. Notation matters.

    Good notation can make complex ideas simple. Bad notation obscures them. APL’s symbols were controversial, but they made array operations visible in ways that verbose syntax doesn’t.

    3. The REPL is powerful.

    Interactive development—type an expression, see the result immediately—was central to APL decades before it became fashionable again with Jupyter notebooks and modern REPLs.

    4. Terseness has value.

    Not obfuscation for its own sake, but the ability to see an entire algorithm at once. When your program fits on one screen, you can reason about the whole thing.

    APL’s Legacy

    APL influenced many languages:

    Language Year APL Influence
    J 1990 Iverson’s ASCII-only redesign
    K/Q 1993 Powers financial systems at Kx
    A+ 1988 Morgan Stanley’s open-source APL
    BQN 2020 Modern APL with cleaner semantics
    NumPy 2006 Array operations in Python
    R 1993 Vector operations for statistics

    The ideas live on, even if the glyphs don’t.

    Implementation Details

    Metric Value
    Primary Language APL
    Source Files 2 .apl files
    Lines of Code ~50 lines total
    Runtime GNU APL
    Also Includes Documentation, PNG samples for Unicode issues

    Good for you if: You want to understand array programming origins, learn basic APL, or experience what programming felt like in the 1970s.

    Complexity: Low. The program is intentionally simple—a teaching example, not production code. The repo includes extensive documentation explaining every line.

    Why Throwback Thursday?

    Programming didn’t start with Python and JavaScript. Every abstraction we use today was invented by someone solving a real problem.

    TBT posts will explore:

    • Languages that shaped my thinking (APL, Lisp, Forth)
    • Technologies that were ahead of their time (CMS/TSO Pipelines, dataflow)
    • Ideas worth revisiting with modern tools

    Understanding where we came from helps us see where we’re going.

    Resources


    Part 1 of the Throwback Thursday series. View all parts | Next: Part 2 →

    Watch the Video

    Unmute to hear narration.

    Comments or questions? SW Lab Discord or YouTube @SoftwareWrighter.

subscribe via RSS