--- status: resolved trigger: "Wrapper consumed `.claude/{self_id}-commune.md` file-drop control, deleted file (correct), then 2s later a poll DRAIN of ~649 queued spool messages (sender = self) re-triggered the file_drop control for the same path. Wrapper re-dispatched, tried to read the gone file, got `os error 2`, and per current code path 'logged retaining' instead of dropping." created: 2026-05-20 updated: 2026-05-20 resolved_at: 2026-05-20 milestone: v1.8 quick_id: 260520-pra --- ## Current Focus hypothesis: Two compounding defects produce the symptom: (a) Self's listener has no dedupe — every poll iteration during a long wrapper resume turn queues another identical `` envelope into the wrapper's spool; (b) the wrapper's `process_file_drop` retain-on-read-error path treats `ENOENT` (file gone) the same as a transient read error, scheduling a retry that can never succeed. test: Static code reading of `src/owl/poll.rs::scan_drop_files` emit site, `src/common/spool.rs::spool_message`, `src/live/wrapper/mod.rs::process_file_drop` read+retain logic. expecting: Confirm no dedupe at any layer (producer/spool/consumer) AND confirm the read-error branch unconditionally returns `Continue` (= retain). Both confirmed. next_action: Diagnosis complete — return to parent for fix decision. ## Symptoms expected: One file_drop EVENT per `.claude/{self_id}-commune.md` lifecycle. After successful consume+delete, the wrapper sees no further file_drop dispatches for that path. actual: Long resume turn (~6 min) blocks the wrapper while Self's listener (separate process, ~1s loop cadence) queues ~649 identical file_drop EVENTs into the wrapper's spool. Wrapper completes one dispatch (file consumed + deleted), then drains the spool — each queued envelope re-fires `process_file_drop`, which calls `fs::read_to_string(path)` against the now-gone file, hits `os error 2` (ENOENT), and logs `[FILE-DROP] read failed kind=commune path=... err=...; retaining`. The "retain" outcome means the next iteration of the producing listener will queue yet more copies (the file is still absent so no new emit — but the spool backlog isn't cleared either; each queued row still triggers a useless dispatch). errors: `[FILE-DROP] read failed kind=commune path=C:/.../.claude/doyle-commune.md err=The system cannot find the file specified. (os error 2); retaining` reproduction: Drop a `.claude/{self_id}-commune.md` file while wrapper is mid-resume on a slow turn (>1s of blocking). Self's listener queues ≥2 copies before wrapper consumes the first. After wrapper deletes the file, remaining queued copies fire repeatedly with ENOENT-retain. started: Phase 30 Plan 02 Task 3 — `process_file_drop` was authored with retain-on-read-error semantics that don't distinguish "transient I/O failure" from "file is gone." ## Evidence - timestamp: investigation checked: `src/owl/poll.rs` lines 262–296 (Self listener's per-iteration file-drop scan/emit) found: Listener does `scan_drop_files(cwd, id)` every loop iteration (after the TCP `check_message_blocking` returns or any other arm completes). On hit it calls `compose_file_drop_event(...)` then `send::deliver_body(&psyche_id, id, &envelope)` with NO dedupe, NO suppress-while-pending, NO mtime/inode tracking. Every loop iteration where the file exists emits a fresh envelope. implication: The producer queues unbounded duplicates while the file is present and the consumer (wrapper) is blocked. With Self's listener tick ~1s and a wrapper resume of ~6 min, the observed ~649 duplicates is consistent with one emit per ~0.55s — well within plausible listener tick rate during active polling. - timestamp: investigation checked: `src/owl/send.rs::deliver_message` lines 66–97 found: TCP-first; falls back to `spool::spool_message(target, from, body, &owlery)` if TCP delivery is unavailable. When the wrapper is mid-resume, its perch listener is `close_busy`/unable to accept new TCP connections, so deliveries spool. Spool insert is unconditional. implication: All 649 emits land in the wrapper's spool.db as separate rows. - timestamp: investigation checked: `src/common/spool.rs::spool_message` lines 55–65 found: `INSERT INTO messages (from_id, body, created_at, ttl_seconds, deferred) VALUES (?1, ?2, ?3, ?4, 0)` — pure blind INSERT, no `INSERT OR IGNORE`, no body-hash uniqueness, no scan for existing-pending. Grep across the whole spool module for `dedup|unique|duplicate|already_queued|pending` confirmed no existing dedupe logic. implication: Producer-side dedupe is not present anywhere in the data path. Each emit becomes a row. - timestamp: investigation checked: `src/live/wrapper/mod.rs::process_file_drop` lines 1203–1218 (read step) found: ```rust let body = match std::fs::read_to_string(path) { Ok(s) => s, Err(e) => { self.log(&format!( "[FILE-DROP] read failed kind={} path={} err={}; retaining", kind, path, e )); return FileDropOutcome::Continue; } }; ``` All read errors are funneled into the same "retain" branch. There is no `e.kind() == ErrorKind::NotFound` discriminator. ENOENT is treated identically to a real I/O failure (disk full, permission denied, etc.). implication: When the file has already been consumed and deleted by an earlier iteration of the SAME wrapper, every subsequent dispatch for a stale spool row hits ENOENT and "retains." Note: "retain" here only means the in-memory dispatch returned `Continue` and the wrapper did NOT call `fs::remove_file` (which is moot because the file is gone). The spool ROW itself has already been marked `delivered=1` by `drain_all` BEFORE the dispatch ran, so the row will not re-emit. The symptom is therefore NOT an infinite retry loop — it is one wasteful no-op `resume_session_checked` … wait, let me re-read. - timestamp: investigation checked: `src/live/wrapper/mod.rs::process_file_drop` lines 1209–1217 retain-and-return-early flow vs lines 1264–1270 resume_session call site found: The retain branch returns `FileDropOutcome::Continue` BEFORE reaching the `resume_session_with_exit` call. So on ENOENT we don't spawn claude. Good. However, the wrapper still drained the message off the spool (drain_all marks delivered=1 first), and it still LOGGED `[FILE-DROP] read failed ... retaining` which is misleading observability — and it still spent time inside `process_file_drop` for every one of the 649 stale rows. implication: Cost per stale row is small (one stat-fail-as-open + one log line) BUT the log spam is severe and the semantics are wrong. The word "retaining" implies retry-later, which is false: the spool row is gone and the file is gone. Nothing is retained. Misleading log is operationally damaging during incident triage. - timestamp: investigation checked: `src/owl/poll.rs::drain_all_with_metadata` flow (lines 148–193 of poll.rs) and `src/common/spool.rs::drain_all` (line 128) — confirm rows are marked delivered=1 on drain (not deleted) found: `drain_all` UPDATEs delivered=1 atomically before returning rows. So a drained row never re-emits in a future drain. The 649 dispatches occur in a SINGLE drain pass (all queued rows flush at once when the wrapper's listener comes back online). implication: This is a one-shot burst per resume-turn, not an unbounded loop. But each burst is N=duration-of-blocked-resume × listener-tick-rate, which scales linearly with how slow Psyche is on that turn. On a 6-min turn at ~1s cadence that's ~360 expected and 649 observed (consistent with listener doing 2+ iterations/sec during active drain). - timestamp: investigation checked: `src/live/wrapper/mod.rs::handle_file_drop_arm` lines 435–460 — predicate/parse layer above process_file_drop found: No dedupe here either. Every drained envelope that contains `` envelope on EVERY loop iteration where `scan_drop_files` reports the file present. There is no per-emit memory (no "I just emitted this", no mtime track, no in-flight flag). While the wrapper is mid-resume (~6 min), this produces hundreds of identical envelopes queued into the wrapper's spool via `send::deliver_body` → `spool::spool_message`. The spool itself does blind INSERTs (`src/common/spool.rs::spool_message` line 58–61, no `INSERT OR IGNORE`, no body-hash uniqueness). **(2) Consumer-side: read-error branch treats ENOENT identically to transient I/O failure.** `src/live/wrapper/mod.rs::process_file_drop` lines 1209–1218: every `fs::read_to_string` error falls into the same `[FILE-DROP] read failed ... retaining` branch. There is no `ErrorKind::NotFound` discriminator. When the wrapper drains 648 stale duplicates after consuming the first one and deleting the file, every drained envelope fires this branch, producing 648 misleading "retaining" log lines for a file that has been definitively consumed. The 17:44:05 → 17:44:07 timing in doyle's brief lines up exactly: 17:44:05 wrapper consumes the first envelope, reads file, runs resume, deletes file. ~2s later the wrapper's listener accepts a new TCP connection (or completes the next inner-poll iteration) and the queued spool drains — 649 envelopes flush through `handle_file_drop_arm` → `process_file_drop`, every one hitting ENOENT on the read. fix: | **Recommendation: BOTH surfaces (defense in depth). Consumer-side fix is the minimum correctness fix; producer-side fix prevents the spool log spam and reduces wasted work.** ### (A) Consumer-side fix — discriminate ENOENT from real I/O errors **File:** `src/live/wrapper/mod.rs` **Lines:** 1209–1218 (the `let body = match std::fs::read_to_string(path)` block) Replace the catch-all retain with an explicit ENOENT branch: ```rust let body = match std::fs::read_to_string(path) { Ok(s) => s, Err(e) if e.kind() == std::io::ErrorKind::NotFound => { // File already consumed (e.g., earlier iteration of THIS wrapper // processed the file_drop and deleted it, then we drained queued // duplicates from the spool — see debug: // spool-retain-replays-file-drop-after-delete.md). Definitive drop: // there is no file to retry against, and the spool row has already // been marked delivered=1 by drain_all. self.log(&format!( "[FILE-DROP] dropped kind={} path={} (file already consumed; stale spool duplicate)", kind, path )); return FileDropOutcome::Continue; } Err(e) => { self.log(&format!( "[FILE-DROP] read failed kind={} path={} err={}; retaining", kind, path, e )); return FileDropOutcome::Continue; } }; ``` Note: both branches still return `Continue` (the spool row is already drained, there is no retry to gate). The change is purely semantic clarity in the log + intent: ENOENT is no longer mis-described as "retaining." Transient errors (permission, I/O) keep the old wording. This is a 6-line surgical change. ### (B) Producer-side fix — suppress duplicate emits while file is present **File:** `src/owl/poll.rs` **Lines:** 271–295 (the per-iteration `scan_drop_files` → `deliver_body` loop) Track per-listener-instance "emitted paths" in a `HashSet` carried in the poll-loop scope. Suppress emit when the path is already in the set; clear the path from the set when the file disappears (stat returns absent in a subsequent scan). Sketch (additions only): ```rust // Before the loop: let mut emitted_drops: std::collections::HashSet = std::collections::HashSet::new(); // Inside the loop, replace the existing block: if let Some(ref cwd) = listener_cwd { let drops = scan_drop_files(cwd, id); let current_paths: std::collections::HashSet = drops.iter().map(|(_, p)| p.clone()).collect(); // Drop tracked paths that no longer exist (wrapper consumed + deleted them). emitted_drops.retain(|p| current_paths.contains(p)); for (kind, abs_fs) in &drops { if !emitted_drops.insert(abs_fs.clone()) { // Already emitted this lifecycle of the file. Skip. continue; } let envelope = compose_file_drop_event(kind, abs_fs, id); // ... rest of existing emit block unchanged } } ``` Semantics: emit-once per file lifecycle. If the wrapper fails to consume (file still present next iteration), the set still contains the path — no re-emit. This is safe because the wrapper's own D8 retain-on-error path already handles the retry: when the wrapper genuinely retains (e.g., resume_session non-zero exit), the message body is already in the wrapper's spool (delivered=1 but the in-memory dispatch already ran). A *new* envelope is not needed. **Caveat for surface (B):** If the wrapper crashes/restarts WITHOUT consuming the file, the new wrapper instance won't see a queued envelope, and Self's listener has the path "stuck" in `emitted_drops`. Mitigation: the `emitted_drops` set is per-poll-process-instance. When Self's listener restarts (or the wrapper is restarted via signoff/orphan/handoff and Self's listener also recycles), the set is fresh and the file is re-emitted. The cross-process case (wrapper crash, Self listener survives) needs a sentinel — out of scope for this fix; the consumer-side fix (A) already neutralises the symptom. verification: | ### Test plan **Unit tests** (no network, no subprocess): 1. `process_file_drop_drops_on_enoent` — call `process_file_drop("commune", "/nonexistent/path.md")` on a `WrapperState`-equivalent harness; assert log contains `dropped` and NOT `retaining`, return value is `Continue`, no spawn occurred. Add to `file_drop_handler_tests` mod at bottom of `src/live/wrapper/mod.rs`. 2. `process_file_drop_retains_on_permission_denied` — Unix only: chmod 000 a tempfile, call `process_file_drop` against it, assert log contains `retaining` (legacy wording preserved for true I/O errors). Windows analogue via a locked-handle tempfile if feasible; otherwise unix-only `#[cfg(unix)]`. 3. (Producer fix only) `scan_drop_files_emit_once_per_lifecycle` — drive the new emit-tracking loop with a mock that simulates: file present → emit, file present → suppressed, file absent → set clears, file present again → emit. Asserts `emitted_drops.len()` transitions {0,1,0,1}. **Integration test** (existing harness, no new infrastructure): 4. Extend `tests/native_owl.rs` or add a new `tests/file_drop_dedupe.rs`: spawn a real Self listener, drop a commune file, sleep 3s (let listener emit several times), then stat the wrapper's spool.db and assert `pending_count <= 1` (producer fix) or — without producer fix — assert that draining the spool produces N envelopes but only ONE log line containing `dispatching` and the rest containing `dropped` (consumer fix). **Golden test:** 5. Update `tests/golden/` for the new `[FILE-DROP] dropped` log line. The wording is new so existing fixtures won't drift unless they happened to capture an ENOENT replay (none should — the resolved-debug list does not show one). **Manual regression:** 6. Reproduce doyle's gen 49 scenario: deliberately stall Psyche resume (e.g., temporarily set a long sleep in `resume_session_checked` or use a known slow prompt), drop a commune file, observe wrapper log. Expect: ONE `dispatching` line, ONE `consumed ... (deleted)` line, then N-1 `dropped ... (file already consumed; stale spool duplicate)` lines instead of `retaining` lines. files_changed: [] ``` ## Risk / Alternatives Considered ### Alternative 1: Spool-level dedupe via `INSERT OR IGNORE` + body-hash index Add a `body_hash` column with a UNIQUE index. `spool_message` becomes `INSERT OR IGNORE`. **Rejected because:** (i) breaks legitimate duplicate user prose ("hello" sent twice should arrive twice); (ii) schema migration risk on existing spool.dbs across all deployed perches; (iii) overkill — the duplication is structural to file_drop only, not a general spool concern; (iv) `from_id == self_id` + identical body is also a valid pattern for self-pulse heartbeats in other phases. ### Alternative 2: Listener-side mtime tracking on the source file Stat `.claude/{id}-{kind}.md` mtime, only emit when mtime advances. **Rejected because:** the file is consumed-then-deleted by the wrapper, so the file is normally absent. mtime tracking adds complexity (need to remember last-seen mtime even across "file absent" gaps) and doesn't address the core issue: emit-once-per-lifecycle is what we want, and the HashSet-of-paths sketch in fix (B) is the minimal expression of that. ### Alternative 3: Consumer-side: drop the row but ALSO clear remaining file_drop envelopes from spool Walk the wrapper's spool on ENOENT and delete all queued envelopes with `kind=` matching this kind. **Rejected because:** races with concurrent producer emits (we'd delete a fresh emit that's actually about a new file lifecycle); requires the wrapper to hold a spool write handle while draining (currently it does not); adds non-local state mutation to what should be a per-envelope handler. ### Risk of fix (A) alone (consumer-side only) **Low.** The change is 6 lines, return value identical (`Continue`), no fan-out. The only behavioural diff is the log wording: `[FILE-DROP] dropped` vs `[FILE-DROP] read failed ... retaining`. Operators reading logs see the correct semantic. Spool-row already-marked-delivered means no retry-loop ever existed; we're just renaming the no-op. ### Risk of fix (B) alone (producer-side only) **Medium-low.** The `HashSet` is per-process; on listener restart the set is empty. If a wrapper crashes mid-consume (file still present), Self listener restart re-emits — that's correct behaviour. The edge case where Self listener restarts WHILE wrapper is alive AND file is present produces a re-emit even though one is already in the wrapper's spool — but the wrapper has dedupe via fix (A) (the second envelope would just hit ENOENT after first consume). So fix (A) covers fix (B)'s edge cases. ### Recommendation **Land both, in order: (A) first (correctness + observability), (B) second (efficiency + log noise reduction).** They are independent commits. (A) alone fully neutralises the operational symptom doyle reported. (B) reduces wasted spool I/O and improves throughput during long resume turns. ## Specialist Hint **Specialist Hint:** rust ## Resolution Note - **resolved_at:** 2026-05-20 - **milestone:** v1.8 - **quick_id:** 260520-pra - **surface_A_commit:** `c1adb43` (Surface A — consumer ENOENT discriminator in `process_file_drop`) - **surface_B_commit:** `a9ba7d0` (Surface B — producer-side `dedupe_drops` HashSet in `owl::poll::run`) **Summary:** Both surfaces landed; ENOENT no longer logs misleading `retaining`, and Self listener emits at most once per file lifecycle (HashSet-based path dedupe).