Eliminate Dolt write amplification by separating transient single-flight locks from durable audit history.
Status: proposal · Tracking: beads gc-klq · Author: gastown.mayor · Date: 2026-05-13
Related upstream: #1978 · #1510 · #1248 · #1709 · #1850 (PG backend) · #1977
- TL;DR
- Evidence — what's happening today
- Root cause: 5 Whys
- Current architecture
- Proposed architecture
- Sequence diagrams — current vs proposed
- Crash recovery state machine
- Order body contract —
didWorksignal - Phased rollout
- Risk matrix
- Effort & impact summary
- Downstream consumers & feed compatibility
- Alternative without changing storage layer
- Reframe — factory vs human-scale writes
- Convergent upstream work
The order dispatcher conflates three lifetimes into one Dolt bead written up-front before the order body runs:
- 🟡 transient — single-flight lock, lives seconds to minutes
- 🟡 short —
lastRuntimestamp, lives one cooldown interval - 🟢 durable — audit history, lives forever
Result: every cooldown fire (~6/min steady-state across rigs) produces a 4-event bead lifecycle in Dolt, even when the order body did nothing. ~16 lifecycles/min × 4 Dolt commits/lifecycle = 64 unnecessary commits/min on a 3-rig idle city.
❌ The fix is not config tuning. Interval bumps reduce frequency; they cannot drive the per-fire cost to zero. Cost is structural.
✅ The fix is decoupling. Use in-memory map + WAL for the transient lock. Use the existing
rememberLastRuncache. Create the audit bead lazily on completion, only when the order body did real work. No-op gate-sweep → zero Dolt writes.
| Metric | Value | Notes |
|---|---|---|
| Order fires / 5min | ~18 | Most produce no real work |
| Dolt commits / order fire | 4 | Even for no-op orders |
| Disk growth (pre-fix) | 5.3 MB/min | ~7.6 GB/day untrimmed |
| hq journal (pre-GC) | 294 MB | 86% of hq's 342 MB |
Per-rig cadence in .gc/system/packs/maintenance/orders/ (3 active rigs: focuster, intervaltree, gastown):
| Order | Interval | Rigs | Bead fires/min |
|---|---|---|---|
gate-sweep |
30s | 3 + city | 8 |
order-tracking-sweep |
1m | 3 + city | 4 |
dolt-health, beads-health |
(varies) | 1 each | ~2 |
cross-rig-deps, orphan-sweep, spawn-storm-detect |
5m | 3 each | ~1.8 |
mol-dog-jsonl |
15m | 1 | 0.07 |
mol-dog-reaper |
30m | 1 | 0.03 |
| Total tracking-bead lifecycles / min | ~16 |
Observed in events.jsonl over a 20-minute window (128 order fires; ~6.4/min before any real work events):
20 dolt-health · 20 beads-health · 19 gate-sweep (city)
18 gate-sweep:rig:intervaltree · 17 gate-sweep:rig:focuster
12 order-tracking-sweep:rig:intervaltree · 12 order-tracking-sweep (city)
10 order-tracking-sweep:rig:focuster · ...
Why 1 — Dolt is being hammered. Each cooldown order writes a full bead lifecycle (create → 1–2 updates → close) per interval. 16 lifecycles/min × 4 Dolt commits = 64 commits/min just from sweeps.
Why 2 — Every order fire writes a bead, even no-op ones.
order_dispatch.go:322 creates the bead synchronously, before the order body runs. The dispatcher commits "this order ran" before it can know whether anything actually happened.
Why 3 — The bead must exist up-front because it is the single-flight lock.
Comment at line 320: "Create tracking bead synchronously BEFORE dispatch goroutine. This prevents the cooldown trigger from re-firing on the next tick." hasOpenWorkInStoresStrict (line 311) queries open beads to decide whether to skip. No bead → no lock → cooldown re-fires on the next 1-second tick.
Why 4 — The lock is a persisted bead, not in-memory state, because it must survive controller restarts. In-memory locks die with the process. A crash mid-dispatch must not cause double-fire on next boot.
Why 5 (root) — Crash recovery requires a Dolt commit per fire because the design conflates 3 concerns in one bead. Single-flight lock (transient), lastRun timestamp (short-lived), and audit history (durable) all bundled into one bead written up-front. Every transient lock pays the full audit cost — even when no audit is warranted.
🔴 Root cause statement — Architectural conflation of transient lifecycle (single-flight + lastRun) with durable history (audit) in a single bead, both written up-front before the order body can signal whether audit is needed.
The tracking bead carries three responsibilities. They have wildly different durability requirements but share one storage write path.
flowchart LR
classDef hot fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2
classDef warm fill:#78350f,stroke:#fcd34d,color:#fef3c7
classDef cold fill:#064e3b,stroke:#6ee7b7,color:#d1fae5
A[Cooldown tick] --> B{Order due?}
B -->|no| Z[skip]
B -->|yes| C{open tracking bead?}
C -->|yes| Z
C -->|no| D[CREATE tracking bead in Dolt]
D --> E[rememberLastRun in-mem]
E --> F[goroutine: dispatchOne]
F --> G[Order body runs]
G --> H[UPDATE bead with outcome label]
H --> I[CLOSE bead]
class D,H,I hot
class E warm
class C hot
Every red box is a Dolt commit. 4 commits per order fire, all happening regardless of whether the order body did any real work.
graph TB
classDef transient fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2
classDef short fill:#78350f,stroke:#fcd34d,color:#fef3c7
classDef durable fill:#064e3b,stroke:#6ee7b7,color:#d1fae5
TB[("Tracking Bead<br/>one Dolt row")]:::durable
TB --> SF[Single-flight lock<br/>lifetime: seconds-minutes]:::transient
TB --> LR[lastRun timestamp<br/>lifetime: 1 cooldown interval]:::short
TB --> AH[Audit history<br/>lifetime: forever]:::durable
Bundling them means every transient lock pays full audit cost — even when no audit is warranted.
Split the concerns into the storage that fits their lifetime. Most order fires produce zero Dolt writes.
flowchart LR
classDef hot fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2
classDef warm fill:#78350f,stroke:#fcd34d,color:#fef3c7
classDef cold fill:#064e3b,stroke:#6ee7b7,color:#d1fae5
A[Cooldown tick] --> B{Order due?}
B -->|no| Z[skip]
B -->|yes| C{"inFlight name<br/>present and fresh?"}
C -->|yes| Z
C -->|no| D[set inFlight in-mem map]
D --> E[appendWAL <1KB local fs]
E --> F[rememberLastRun in-mem]
F --> G[goroutine: dispatchOne]
G --> H[Order body runs]
H --> I{didWork or error?}
I -->|no| J[delete inFlight, truncate WAL entry]
I -->|yes| K[CREATE+CLOSE audit bead<br/>one transaction]
K --> J
class K hot
class E warm
class D,F,J cold
Fast path (no-op gate-sweep, the common case) does zero Dolt writes. Slow path (order did real work, or failed) writes one audit bead in a single transaction — down from 4.
graph TB
classDef transient fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2
classDef short fill:#78350f,stroke:#fcd34d,color:#fef3c7
classDef durable fill:#064e3b,stroke:#6ee7b7,color:#d1fae5
SF[Single-flight lock]:::transient --> SFS[("in-mem map<br/>+ local WAL")]
LR[lastRun timestamp]:::short --> LRS[("in-mem<br/>rememberLastRun cache")]
AH[Audit history]:::durable --> AHS[("Dolt bead<br/>lazy on completion")]
SFS -.crash recovery.-> AHS
Side-by-side trace of a single tick where the order body does no real work (the common case).
sequenceDiagram
participant T as Tick loop
participant D as Dispatcher
participant S as Beads/Dolt
participant O as Order body (gate-sweep.sh)
T->>D: cooldown elapsed
D->>S: query open tracking beads
S-->>D: none
D->>S: CREATE tracking bead
Note over S: 🔥 Dolt commit #1
D->>D: rememberLastRun (in-mem)
D->>O: spawn goroutine, run script
O->>O: bd gate check --type=timer
Note over O: nothing to do
O-->>D: exit 0
D->>S: UPDATE bead (outcome label)
Note over S: 🔥 Dolt commit #2
D->>S: CLOSE bead
Note over S: 🔥 Dolt commit #3
Note right of S: + auto-commit flushes<br/>= ~4 commits total
sequenceDiagram
participant T as Tick loop
participant D as Dispatcher
participant M as inFlight map
participant W as WAL (local fs)
participant S as Beads/Dolt
participant O as Order body (gate-sweep.sh)
T->>D: cooldown elapsed
D->>M: check inFlight[scoped]
M-->>D: not present
D->>M: insert lease
D->>W: append entry (sync, <1KB)
D->>D: rememberLastRun (in-mem)
D->>O: spawn goroutine, run script
O->>O: bd gate check --type=timer
Note over O: nothing to do
O-->>D: exit 100 (no-op)
D->>M: delete inFlight[scoped]
D->>W: truncate entry
Note right of S: ✅ zero Dolt commits
One audit bead, one transaction (create + close in same SQL):
sequenceDiagram
participant D as Dispatcher
participant O as Order body
participant S as Beads/Dolt
D->>O: run
O-->>D: exit 0, didWork
D->>S: single txn — INSERT audit bead, set labels, CLOSE, COMMIT
Note over S: 🔥 1 Dolt commit vs 4 today
D->>D: release inFlight + WAL
The WAL plus boot scan guarantees no double-fire on controller restart.
stateDiagram-v2
[*] --> Boot
Boot: Controller starting
Boot --> ReadWAL
ReadWAL: Read .gc/runtime/order-locks.wal
ReadWAL --> SweepOrphans
SweepOrphans: sweepOrphanedOrderTracking()<br/>closes leftover audit beads
SweepOrphans --> SeedLastRun
SeedLastRun: rememberLastRun() for each<br/>WAL entry (prevents instant re-fire)
SeedLastRun --> TruncateWAL
TruncateWAL --> Running
Running --> TickFire: order due
TickFire --> InFlight: lock acquired, WAL written
InFlight --> Completed: body returned
Completed --> Running
InFlight --> Crash: process dies
Crash --> Boot
Per-crash-point behavior:
| Crash point | Result | Severity |
|---|---|---|
| Between WAL append and goroutine spawn | Boot seeds lastRun; cooldown waits one interval; re-fires normally. | 🟢 benign |
| Inside order body | Same as above. Side effects may have happened. Investigator sees WAL entry but no matching audit bead. | 🟡 acceptable |
| Between body completion and audit write | WAL still has entry. Boot's sweepOrphanedOrderTracking already handles the inverse. |
🟢 benign |
| Between audit write and WAL truncate | Audit bead exists; WAL replay seeds lastRun. Acceptable (one extra cooldown wait). | 🟢 benign |
| Disk full (WAL won't write) | Order refuses to fire. gc doctor check surfaces the issue. |
🟡 surfaced |
Order bodies need a way to tell the dispatcher "I did nothing audit-worthy." Three options considered:
| Option | Mechanism | Pros | Cons |
|---|---|---|---|
| Exit code 100 ✅ recommended | exit 0 = did work; exit 100 = no-op success; other = failure |
Idiomatic "success-with-info"; backwards compatible (legacy exit 0 just always writes audit) |
Scripts must learn the convention; set -e foot-guns |
| Stdout sentinel | Last line of stdout: GC_DIDWORK: false |
No exit-code semantics conflict | Parsing fragility; mixed with normal output |
| Sidecar file | Touch ${GC_ORDER_TMPDIR}/didwork |
Explicit; orthogonal to stdout/exit | Extra fs op per order; tempdir wiring |
#!/usr/bin/env bash
set -euo pipefail
# Track whether bd gate check actually fired anything.
DID=0
bd gate check --type=timer --escalate && DID=1 || true
bd gate check --type=gh --escalate && DID=1 || true
# Exit 100 = no-op success → dispatcher skips audit bead.
[ "$DID" -eq 0 ] && exit 100
exit 0
⚠️ Validation needed: confirmbd gate checkexit semantics distinguish "nothing closed" from "error". May need a--did-workexit-code flag added tobd.
gantt
title Rollout phases
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 1
Dispatcher refactor (always didWork=true) :p1, 2026-05-14, 2d
Tests + soak :after p1, 1d
section Phase 2
Add exit-100 convention :p2, after p1, 1d
Convert gate-sweep, order-tracking-sweep :after p2, 1d
Convert dolt-health, beads-health :p3, after p2, 1d
section Phase 3
Convert remaining maintenance orders :p4, after p3, 2d
Doctor check for WAL health :after p4, 1d
- Add
inFlightmap + WAL writer/reader. - Boot-time recovery: read WAL → seed
rememberLastRun→ reuse existingsweepOrphanedOrderTracking. - Always treat order body as
didWork=true— audit beads still written as before. Validates lock correctness without behavior change.
- Define
exit 100contract; document in pack guide. - Convert
gate-sweep,order-tracking-sweep,dolt-health,beads-healthto opt in. - Expected: ~80–90% of Dolt writes from sweep orders eliminated.
- Convert remaining maintenance orders (
cross-rig-deps,orphan-sweep,spawn-storm-detect, …). gc doctorcheck for WAL health / orphaned entries.
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
Consumers of audit beads lose "the dispatcher ran" signal for no-ops. Confirmed consumers: /v0/city/{name}/orders/feed API, /v0/city/{name}/order/history/{bead_id} API, gc order history CLI. Dashboard panels not yet built but generated TS types exist. |
🔴 high | 🟡 med | Pair with API field lastFiredAt sourced from in-mem rememberLastRun cache. Feed renders dimmed entries for no-op orders, real audit beads for orders that did work. Details in §12. |
| WAL corruption on disk-full | 🟢 low | 🟡 med | gc doctor check; refuse to fire orders until disk recovered. |
| WAL race between rig dispatchers | 🟡 med | 🟢 low | Per-rig WAL files. No cross-rig sharing of lock state. |
| Migration regressions in cooldown semantics | 🟡 med | 🔴 high | Phase 1 keeps audit writes unchanged — only the lock storage changes. Validates correctness before changing emit behavior. |
| Order body buggy: returns 100 but actually did side effects | 🟡 med | 🟡 med | Side effects are visible elsewhere (closed gates, sent mail). Add doctor lint for orders that exec mutating commands without auditing. |
| Metric | Value |
|---|---|
| Effort | ~700 LOC, 2–3 days focused work, 3 PRs |
| Dolt commits eliminated | ~85% (from ~64/min to ~10/min on idle city) |
| Disk growth at idle | ~80% reduction (5.3 MB/min → projected <1 MB/min) |
| Wake amplification (dispatchers) | ~75% reduction (fewer bead.* events → fewer dispatcher wakes) |
| Upstream | Layer it fixes | This proposal |
|---|---|---|
| #1978 (bd daemon / batching) | Transport (per-write shell-out cost) | Composable — they reduce per-write cost, we reduce write count |
| #1510 (CachingStore.SetMetadata no-op skip) | Cache layer | Composable — even fewer writes reach Dolt |
| #1248 Shift B (connection pool) | Connection efficiency | Composable — pools amortize the writes we still emit |
| #1709 Orchestration v3 | Run-as-primitive | Does NOT address this — explicitly preserves orders as the trigger layer. May make this more urgent (Run reuses tracking bead) |
| #1850 + #1792/#1796 (PG backend) | Storage engine | Orthogonal — even on Postgres, writing 64 commits/min when 0-1 are needed remains wasteful |
cmd/gc/order_dispatch.go— dispatcher refactor, ~150 LOC + ~300 LOC testscmd/gc/order_wal.go(new) — WAL writer/reader, ~80 LOC + ~120 LOC testsinternal/doctor/checks_wal.go(new) — WAL health check, ~40 LOCexamples/gastown/packs/maintenance/orders/assets/scripts/*.sh— opt-in exit-100, ~30 LOC across filesengdocs/architecture/orders.md— documentdidWorkcontract
Audit beads are read by these consumers today. Eliminating no-op audit beads creates visible gaps unless we pair the change with an API surface for in-mem lastRun.
| Consumer | Endpoint / file | What it reads | Break risk |
|---|---|---|---|
| Orders feed API | internal/api/orders_feed.go:309GET /v0/city/{name}/orders/feed |
All beads with order-tracking label, sorted by createdAt |
🔴 high — no-op runs vanish from feed |
| Order history detail API | internal/api/huma_handlers_orders.go:299GET /v0/city/{name}/order/history/{bead_id} |
Specific bead by ID | 🟡 med — only existing beads queryable (already true) |
| CLI | cmd/gc/cmd_order.go:811gc order history <name> |
order-run:<scopedName> label query, includes closed |
🔴 high — operators lose "did this order tick?" answer |
| Wisp GC | cmd/gc/wisp_gc.go:62 |
Closed tracking beads with TTL | 🟢 improvement — fewer beads to clean up |
| Dashboard SPA | cmd/gc/dashboard/web/src/generated/types.gen.ts |
Generated types exist (OrderHistoryEntry, OrdersFeedBody); no panel imports them yet |
🟡 latent — panel under construction needs redesign anyway |
Add a sibling field to the orders feed response that surfaces the in-mem rememberLastRun cache:
flowchart LR
classDef hot fill:#7f1d1d,stroke:#fca5a5,color:#fee2e2
classDef cold fill:#064e3b,stroke:#6ee7b7,color:#d1fae5
A[orders feed handler] --> B[list audit beads<br/>order-tracking label]:::hot
A --> C[read rememberLastRun<br/>in-mem cache]:::cold
B --> D[merge by scopedName]
C --> D
D --> E[response items<br/>auditBeadID? lastFiredAt? status?]
Response shape (illustrative):
{
"scopedName": "gate-sweep:rig:focuster",
"lastFiredAt": "2026-05-13T17:21:30Z",
"auditBeadID": null,
"status": "no-op",
"title": "gate-sweep (rig:focuster)"
}Dashboard renders no-op runs as dimmed entries with timestamp only; runs with audit beads render fully.
# Current: lists beads only
gc order history gate-sweep
# After: lists beads + synthesizes no-op entries from in-mem cache
gc order history gate-sweep # default: includes synthetic
gc order history gate-sweep --audited-only # only beads, no synthetic- Ephemeral tracking beads with short TTL (5-10min) + aggressive compaction. Pros: zero API/CLI changes. Cons: still writes 4 commits per fire — only saves storage long-tail. Rejected — doesn't address root cause.
- Rebuild feed from
events.jsonl. Dispatcher already emitsevents.OrderFiredatorder_dispatch.go:494. Feed could be rebuilt from event stream. Pros: best architecture, every fire visible. Cons: bigger refactor; events.jsonl rotation/archival semantics need tightening. Defer to follow-up.
If the full decoupling is too invasive, there's a less-aggressive bundle that keeps tracking beads in Dolt but attacks per-write cost and per-cycle commit count:
| Lever | Effort | Cycle commits | Wall time | Disk |
|---|---|---|---|---|
| Today | — | 4 | ~320ms | grows 5MB/min |
| 1: collapse update+close into one txn | ~30 LOC | 2 | ~160ms | same |
| 2: in-process dispatcher SQL conn | ~150 LOC | 2 | ~20ms | same |
| 3: open-bead cache | ~40 LOC | 2 | ~10ms | same |
4: periodic dolt gc order |
~10 LOC | 2 | ~10ms | capped |
| 5: CachingStore no-op skip (#1510 upstream) | upstream | 2 | ~10ms | capped + quieter |
Net delta with bundle: 50% fewer commits, ~30x faster per cycle, capped disk, no API/CLI/dashboard break. Tracking beads remain in Dolt; consumers untouched.
What this DOESN'T solve: the bundle still writes 2 Dolt commits per fire even for no-op orders. Going to zero requires the structural fix above. The bundle is a low-risk landing pad; the proposal is the durable fix.
Gas City is a software factory. Dolt is designed for human-scale commits (git-like semantics: narrated, intentional, versioned, time-travel queryable). The mismatch is real:
Human-scale (Dolt's sweet spot)
- Real bugs filed by witnesses → audit forever
- Polecat completes work → capability ledger entry
- Decisions recorded with rationale → time-travel relevant
- Rate: hundreds/day per city
Factory-scale (Dolt's anti-pattern)
- 30s gate-sweep ticks → tracking bead
- 60s order-tracking-sweep → tracking bead
- Per-session heartbeats, lastSeen, watchdogs
- bd CLI session metadata writes from idle agents
- Rate: thousands/hour per city
Gas City uses one storage primitive (Dolt beads) for both scales. The "capability ledger" framing in the Mayor prompt explicitly conflates them — "Every completion is recorded. Every handoff is logged."
The ledger metaphor is right for real work. It's wrong for tick beads. A factory doesn't put every piston stroke in the corporate ledger.
Beyond the bd shell-out issue, Dolt has fundamental per-commit overhead:
- Content-addressable Prolly trees — every commit rewrites O(log N) chunks up to the root. PG dirties one heap page in shared buffers.
- Git-like commit objects — every commit creates a real commit with parent/timestamp/message + new root tree hash. PG
COMMIT= flush WAL + mark visible. - No group commit — Dolt fsyncs per commit; PG batches N concurrent commits into one fsync.
- No in-place updates — every Dolt write creates new chunks; old ones orphaned until manual
dolt gc. PG HOT updates rewrite the same page. - Single-writer engine — writes serialize through one engine. PG has MVCC with row-level locking.
Empirical: Dolt commit ~10-80ms (incl. shell-out), Postgres ~0.5-2ms. ~10-40x gap per commit.
Postgres support is actively being plumbed (PRs #1792, #1796, #1850 by quad341). When it lands, the dispatcher waste becomes tolerable on PG — but still wasteful. The architectural fix here is engine-independent: same code emits fewer commits regardless of backend.
v3 explicitly preserves orders as the trigger layer ("orders remain as the trigger layer that fires Runs"). The write-amp problem is orthogonal to v3. Worse: if v3 reuses tracking beads as Run roots, the problem deepens. The decoupling should land first so v3 builds on a clean primitive.
- Architectural fix vs storage swap — is the Dolt → PG migration (#1850) intended to make this kind of waste tolerable, or is the dispatcher waste considered a separate problem to fix regardless?
- #1709 + this proposal — does Orchestration v3 want to absorb the decoupling, or are they independent landings? (cc @csells)
- Factory vs ledger framing — should Gas City formalize two storage tiers (transient + durable) at the bead-library level, or keep it per-subsystem?
- Multi-tier Run primitive in v3 — should Run be one entity that can live in either tier, or two distinct types?
- Risk appetite — Phase 1 dispatcher refactor only (storage-stays bundle) vs full proposal — which is the right first move?
Track work via beads gc-klq. Original HTML version with interactive Mermaid: engdocs/proposals/order-dispatch-decouple.html in gastownhall/gascity fork.
A maintainer independently surfaced the same problem from the storage-classification angle:
"I recently identified that we are categorizing all beads (task and orchestration) as 'permanent' that adds the overhead of full history tracking and the tax that comes with it. Amazing Dolt feature, but we don't need that feature for most of our beads usage. Gas Town, on the other hand, only did that for task beads and pushed the orchestration ones to the ephemeral wisp tables. I'm currently working on the design and migration right now ... My city was generating around 100k permanent beads a day and ground to a complete halt under load."
| Layer | Their fix (storage tier) | This proposal (dispatcher) |
|---|---|---|
| Per-write cost | Lower — ephemeral wisp tables skip Dolt history tax | Same |
| Number of writes per no-op order fire | 4 (unchanged) | 0 |
| Net effect | Every fire becomes cheap | Most fires don't write at all |
Multiplicative. Their refactor + this dispatcher fix:
- No-op order → 0 writes (dispatcher skipped the lifecycle)
- Real-work order → 1 write to cheap ephemeral tier (audit bead created lazily)
If the ephemeral wisp tables remain queryable for gc order history and /v0/orders/feed, the API-break risk in §10 collapses. The lastFiredAt mitigation in §12 may be unnecessary. The dispatcher refactor becomes much lower-cost to land.
Worth aligning the two migrations so they ship as a coherent pair rather than re-litigating consumer compatibility twice.