# Retrospective: Agent Failure Patterns in wit-what

**Date:** 2026-04-15
**Scope:** Project-wide retrospective on the recurring pattern of agents introducing new bugs while fixing old ones.
**Inputs:**
- Phase 1 — full survey of `code_tips/`, `.planning/DATA-FLOW.md`, `.planning/STATE.md`, `.planning/debug/`, recent phase REVIEW/VERIFICATION docs, CLAUDE.md, BUGSWEEPER GUIDE.
- Phase 2 — `/trace` over 17 sessions in `~/.claude/projects/C--Users-decid-Documents-projects-wit-what/`, focused on cascade-bug language.
- Phase 3 — git log analysis across last ~300 commits.

---

## 1. Headline finding

**The dominant root cause of agent-introduced cascade bugs is parallel-code-path drift.** A fix updates one site (the obvious one) but misses 1+ secondary sites that exist for the same logical operation. The bug returns through the un-patched path. This pattern accounts for the majority of "you fixed it then broke it" complaints.

Three structural amplifiers:
1. **Modal state machine** is the single most fragile area — 71% of recent gap-closure fixes touched modal-related code; two debug docs were marked resolved, then re-opened.
2. **Sync-merge logic** is a latent-defect reservoir — local-only fields (assignments, archive state, notes metadata) are silently destroyed by sync passes that update one merge site but not another.
3. **Large refactor commits** (300+ lines) consistently spawn 3–5 follow-up "gap closure" commits within days. Smaller focused commits (≤20 lines) are stable.

**Process amplifier:** Plans (PLAN.md) are written reactively as checklists of UAT-discovered gaps, not proactively as forward designs that enumerate all parallel sites. Plan:fix ratio is roughly 1:3 in phase 19.1.

---

## 2. Documented surface area (recurring bug classes)

Catalogued from existing project docs. Counts indicate distinct documented occurrences.

### A. Domain / data correctness (5 patterns)
- **A-1** `INSERT OR REPLACE` wipes columns owned by other write paths. (Source: `code_tips/SQLITE_TIPS.md`)
- **A-2** `DELETE`-then-re-`INSERT` destroys metadata (esp. `synced_at`). (Source: `code_tips/SQLITE_TIPS.md`, `STATE.md`)
- **A-3** Shipment status placed on `Recipient` instead of `Card`. (`DATA-FLOW.md` RULE-02)
- **A-4** `github_profile_url` invented as a GH column. (`DATA-FLOW.md` RULE-01, `CLAUDE.md`)
- **A-5** Field added without source documented. (`DATA-FLOW.md` RULE-06)

### B. UI / Slint (7 patterns)
- **B-1** `VerticalLayout` inside `Flickable` bottom-aligns. (`code_tips/SLINT_TIPS.md`)
- **B-2** `HorizontalLayout alignment:center` blocks child stretching.
- **B-3** `TouchArea` over `TextInput` blocks click-drag selection.
- **B-4** Hover-conditional element flicker (focus feedback loop).
- **B-5** Window background defaults to white (startup flash).
- **B-6** Font embedding via `slint_build` instead of bare imports.
- **B-7** `ListView` requires single `for` child.

### C. Async / threading (5 patterns)
- **C-1** Synchronous callback hides intermediate UI state from the event loop.
- **C-2** Closure captures `None` handle at wire time; never updates after init.
- **C-3** Modal close drops focus; global key handler dies.
- **C-4** Secondary sync paths skip `invoke_sync_cards_updated`; `all_cards_ref` permanently empty.
- **C-5** Windows `autocrlf` corrupts refinery migration checksums.

### D. Test / verification gaps (4 patterns)
- **D-1** `apply_optimistic_*` called but `apply_filters` never follows.
- **D-2** Optimistic update assumes a `Vec<Item>` model when struct only has a single string.
- **D-3** Composite key (`recipient_name + status_pill`) used to match cards, collides on duplicate orders.
- **D-4** Slint focus / Esc handler priority not testable statically.

### E. Workflow / process (4 patterns)
- **E-1** `DATA-FLOW.md` not updated in same commit as data changes.
- **E-2** `code_tips/` not consulted before writing Slint or SQL code.
- **E-3** BUGSWEEPER not run before UAT — bugs surface in front of user.
- **E-4** Callback chain wired but missing one intermediate step.

### F. Tooling / environment (2 patterns)
- **F-1** Scripts use `jq` (absent on Windows).
- **F-2** Credentials read from env vars instead of Windows Credential Manager.

### G. Other recurring (3 patterns)
- **G-1** Connection-status `enum` handled with implicit `else`, hides new variants.
- **G-2** Modal layout uses absolute `y` coordinates; overflows on content change.
- **G-3** `image-fit` set in one component but not its sibling.

---

## 3. How the user actually reports the failure

From Phase 2 trace of 17 sessions. Vocabulary is terse and consistent — frustration shows up as architectural questions, not profanity.

**Primary signals (each found in 4 distinct sessions):**
- Numbered UAT lists with literal `fail` tokens: *"4. fail.", "5. fail. the bounding boxes is still only about 33%…"*
- `still …` continuations: *"still happening", "still crashes", "STILL does nothing"*
- Modal-related complaints (4 sessions span lookup/state/settings modals).

**Secondary signals (2–3 sessions each):**
- *"this plan is incomplete"* — at plan stage, before any code.
- *"why was the old X replaced?"* — agent rewrote-instead-of-refactored, silently dropped a documented requirement.
- *"unable to test/validate because of issue X"* — cascade where one broken thing blocks UAT of others.
- *"opposite the behavior that was defined in phase 19"* — explicit regression callout against documented spec.
- *"why ARE there two different states for the same thing?"* — frustration after a desync bug.
- *"should have been debugged with Bugsweeper / did you test this using bugsweeper?"* — repeated 3+ times in one session.

**Cascade timing:** ~70% same-session (UAT catches it immediately), ~25% next-session (bug reappears next day on same area), ~5% days-later regression. Same-session dominance means the user is the test harness.

**Marathon evidence:** Apr 8 2026 — three back-to-back sessions over 9 hours (`0b0888de`, `02237f00`, `787237af`), each opening with a numbered fail-list against the previous round.

---

## 4. Quantified code-writing patterns

From Phase 3 git analysis.

| Pattern | Measurement | Value |
|---|---|---|
| Highest-churn phase | commits in single phase | 26 (19.1), 28 (20.1.1.1.1) |
| Top-churn file | edits in last 200 commits | `crates/app/src/main.rs` — 51 |
| Modal-related fix density | % of gap-closure commits touching modal code | 71% |
| Reopened debug docs | files in `/debug/` AND `/debug/resolved/` | 2 (`item-modal-focus-and-esc`, `byproduct-multi-item-grouping`) |
| Gaps bundled per fix commit | average | 4–6 |
| Plan : fix-commit ratio | phase 19.1 | 8 plans : 26 fix commits |
| Marathon burst | commits in single hour during Apr 8 | 8 (00:xx) + 8 (03:xx) |
| Round markers | explicit "Round 2/3 UAT" commits | 4 |
| Explicit reverts | of prior fix commits | 0 (only 1 strategic revert: `6581aea`) |
| Hedging language in commits | "actually fix", "really", "for real", "oops" | 0 |
| Large-refactor regressions | 300+L commits followed by gap-closure within days | 1 confirmed (`d693614` → `29e8706`, `705124c`) |

**Concrete parallel-path-drift commits** (each fixed a primary site previously, then a sibling site:):
- `e6003e1` (19.1-08) — `assignment.rs` AND `live_client.rs::run_sync_cycle` had to be patched together. Sync was wiping locally-assigned units.
- `dc622640` debug session — `on_settings_save_clicked` and `on_settings_shopify_clear_clicked` both missed `invoke_sync_cards_updated`, while the primary path had it.
- `bac9584` (19.1-05) — init crash in lookup-modal also required state reset in dashboard parent.
- `af4c8b8` (19.1-05) — item-square click required new callback chain in card.slint, dashboard.slint, and main.rs simultaneously.
- `d9bfe49` (19.1-05) — `unit_is_on_card` and `product_is_on_card` had to be wired in both data and display paths.

---

## 5. Failure pattern taxonomy

Combining the three lenses yields six concrete failure modes. Mitigations target these.

| ID | Failure mode | Why it happens | Where it shows |
|---|---|---|---|
| **F1** | Parallel-code-path drift | Agent grep-fixes one site; doesn't search for siblings doing the same logical operation | Sync paths, optimistic-update callbacks, modal close handlers, image-fit settings |
| **F2** | Optimistic update without `apply_filters` | Agent thinks mutating `all_cards_ref` is enough; doesn't trace the model→view pipeline end-to-end | Note save, item add/remove, archive toggle |
| **F3** | Replacement-instead-of-refactoring | Agent rewrites a component (e.g. LookupModal) and silently drops requirements present in the original | Lookup modal, search input, modal layouts |
| **F4** | Verification skipped: BUGSWEEPER, code_tips, DATA-FLOW.md | Required reads/runs in CLAUDE.md treated as optional under time pressure | All recent UAT marathons |
| **F5** | Static-only verification of runtime-coupled behavior | Reviewer/checker can't tell which Esc handler fires first; agent assumes code structure = runtime behavior | Modal Esc / focus, image fit, intermediate UI state |
| **F6** | Reactive plans (gap-closure as primary planning mode) | PLAN.md written after UAT discovers gaps, not before; doesn't enumerate parallel sites or local-only fields the sync could destroy | All "round 2", "round 3", "gap closure" phases |

---

## 6. Mitigation plan (proposed — awaiting approval)

Each mitigation is **scoped, concrete, and targets one or more failure modes**. Marked priority **P0** (do first), **P1** (do soon), **P2** (do when convenient). Many are doc/process changes; some are code/hook additions.

### M-1 (P0) — Parallel-Sites checklist in `PLAN.md` template — targets F1, F6
Add a mandatory "Parallel Sites" section to every `PLAN.md`. Before any change, the planner must list:
- Every other code site doing the same logical operation (with file:line)
- Every "local-only" field that sync passes might destroy
- Every callback that the new wiring assumes already exists end-to-end

The `gsd-plan-checker` agent gates the plan on this section being non-empty when the change touches sync, modals, or callbacks.

**Concrete artifact:** Update `gsd-planner` and `gsd-plan-checker` prompts. Add a one-page `.planning/templates/PARALLEL-SITES-CHECKLIST.md`.

### M-2 (P0) — Pre-fix "find-all-sites" rule — targets F1, F4
Add a CLAUDE.md rule: **before** writing any fix that touches sync, callbacks, modals, or optimistic updates, the agent must:
1. Grep for the function name being modified across all sources
2. Grep for the most relevant string literal or property name across `crates/app/src/` and `crates/app/ui/`
3. Document the search results in the commit body or PLAN.md

This costs ~2 grep calls per fix and would have prevented the `invoke_sync_cards_updated` cascade and the `image-fit` inconsistency.

**Concrete artifact:** New section in CLAUDE.md titled "Pre-fix sibling search". Optionally enforced via a PreToolUse hook on Edit/Write that warns if the pattern hasn't been searched for in this session.

### M-3 (P0) — BUGSWEEPER gate before UAT-claim — targets F4, F5
Promote BUGSWEEPER from "should run before UAT" to **mandatory verification step**. The `gsd-verifier` agent must include a BUGSWEEPER smoke-test step for any phase touching: data flow, modal state, sync, callbacks, or filter pipeline. Verification report cannot be marked complete without a BUGSWEEPER transcript.

CLAUDE.md already pairs BUGSWEEPER with screen-timelapse — formalize the pairing in the verifier prompt: BUGSWEEPER for state assertions, screen-timelapse for visual regression.

**Concrete artifact:** Update `gsd-verifier` agent prompt. Add a "BUGSWEEPER Coverage" subsection to VERIFICATION.md template.

### M-4 (P1) — Modal-state architectural review — targets F1, F2, F5
The modal layer is the single most fragile component (71% of recent gap-closure fixes; 2 reopened bugs). Spawn a one-shot "Modal State Audit" phase:
- Inventory every modal (`lookup-modal`, `state-transition-modal`, `settings-modal`, etc.)
- Document every callback in/out of each
- Identify every place modal state is read or mutated
- Produce a single canonical state diagram in `.planning/MODAL-STATE.md`

Subsequent modal changes must update this doc.

**Concrete artifact:** Add a phase under the current milestone. Output: `.planning/MODAL-STATE.md`.

### M-5 (P1) — Sync-merge contract doc — targets F1, F2
Add `.planning/SYNC-MERGE.md` documenting:
- Every field that is "local-only" (not from Shopify/GH)
- Every sync entry point and what it merges
- Explicit rule: any field added to a struct must declare `local_only: true|false` and, if true, be listed in this doc

The `dc622640` cascade and the `e6003e1` (G-10/G-13) commit both stem from missing this contract.

**Concrete artifact:** Create `.planning/SYNC-MERGE.md`. Cross-link from DATA-FLOW.md.

### M-6 (P1) — Commit size + parallel-path verification — targets F6
Adopt a soft commit-size rule: any single commit > 200 net lines must include in its body either (a) an explicit "parallel sites checked" list, or (b) a link to the PLAN.md section that enumerated them. Enforce via a pre-commit hook that prompts the agent.

**Concrete artifact:** New hook in `.claude/hooks/` that runs on PreToolUse for Bash `git commit` and surfaces the rule when diff size is large. Soft-block (warn, don't fail).

### M-7 (P1) — Promote `apply_filters` and `invoke_sync_cards_updated` to invariants — targets F2, F4
Add a `code_tips/CALLBACK_PIPELINE.md` codifying:
- Every mutation to `all_cards_ref` MUST be followed by `apply_filters(&w, …)`
- Every secondary sync path MUST end in `invoke_sync_cards_updated()`
- A grep regex for reviewers/agents to spot violations: `apply_optimistic_[a-z_]+\(` not followed within 30 lines by `apply_filters\(`

**Concrete artifact:** New `code_tips/CALLBACK_PIPELINE.md`. Linked from CLAUDE.md.

### M-8 (P2) — Reopened-bug ledger — targets F4, F6
Maintain `.planning/REOPENED.md` listing every bug whose debug doc has been moved out of `/resolved/` more than once. Each reopen requires writing a "why this returned" note. Patterns will surface (and the doc will become a forcing function).

**Concrete artifact:** New `.planning/REOPENED.md`. One-line policy in CLAUDE.md.

### M-9 (P2) — UAT checklist generator — targets F5
For every phase, the `gsd-verifier` should auto-generate a runnable UAT checklist (Markdown numbered list, mirroring the user's own format) covering: every modified callback, every modal Esc/click-outside path, every optimistic update, every sync entry point. The user can run through it in 5 minutes; agent runs the BUGSWEEPER equivalent first.

**Concrete artifact:** Add to `gsd-verifier` agent. Output: `.planning/phases/<id>/UAT-CHECKLIST.md`.

### M-10 (P2) — Replace-vs-refactor decision gate — targets F3
Add a CLAUDE.md rule: any time an agent proposes replacing a non-trivial existing component (e.g. LookupModal), the planner must produce a one-page "Requirements Preserved" doc listing every behavior of the original that the replacement must keep. Catches the LookupModal regression class.

**Concrete artifact:** Add to PLAN.md template + `gsd-planner` prompt.

---

## 7. Mitigation impact summary

| Mitigation | Targets failure modes | Effort | Expected impact |
|---|---|---|---|
| M-1 Parallel-Sites checklist | F1, F6 | Low (doc + agent prompt) | High — addresses dominant root cause |
| M-2 Pre-fix sibling search | F1, F4 | Low (CLAUDE.md + optional hook) | High — cheap, every fix benefits |
| M-3 BUGSWEEPER gate | F4, F5 | Medium (verifier rewrite) | High — most-cited user complaint |
| M-4 Modal-state audit | F1, F2, F5 | High (one full phase) | High in long term — single fragile area |
| M-5 Sync-merge contract | F1, F2 | Medium (new doc + struct discipline) | Medium-high — prevents data destruction class |
| M-6 Commit size rule | F6 | Low (hook) | Medium — soft signal |
| M-7 Callback invariants | F2, F4 | Low (code_tips file) | Medium-high — codifies known anti-pattern |
| M-8 Reopened-bug ledger | F4, F6 | Very low (single file + policy) | Medium over time |
| M-9 UAT checklist generator | F5 | Medium (verifier agent change) | Medium-high — closes "user is the test harness" loop |
| M-10 Replace-vs-refactor gate | F3 | Low (PLAN template) | Medium — narrow but important class |

---

## 8. Open questions for approval

1. **Scope:** Approve all 10 mitigations, or a subset? Recommended starting set: M-1, M-2, M-3, M-7 (the four P0/low-effort/high-impact items).
2. **Modal audit (M-4):** Run as the next phase, defer to next milestone, or skip in favor of incremental improvement?
3. **Hooks (M-2, M-6):** Implement as soft warnings (recommended) or hard blocks?
4. **Doc home:** Should the new artifacts (SYNC-MERGE.md, MODAL-STATE.md, REOPENED.md, CALLBACK_PIPELINE.md) live at `.planning/` root, or under a new `.planning/architecture/` subdirectory?
5. **CLAUDE.md edits:** OK to add rules directly, or prefer a separate `AGENT-RULES.md` referenced from CLAUDE.md?