---
status: resolved
trigger: "Wrapper consumed `.claude/{self_id}-commune.md` file-drop control, deleted file (correct), then 2s later a poll DRAIN of ~649 queued spool messages (sender = self) re-triggered the file_drop control for the same path. Wrapper re-dispatched, tried to read the gone file, got `os error 2`, and per current code path 'logged retaining' instead of dropping."
created: 2026-05-20
updated: 2026-05-20
resolved_at: 2026-05-20
milestone: v1.8
quick_id: 260520-pra
---

## Current Focus

hypothesis: Two compounding defects produce the symptom: (a) Self's listener has no dedupe — every poll iteration during a long wrapper resume turn queues another identical `<EVENT type="file_drop" ...>` envelope into the wrapper's spool; (b) the wrapper's `process_file_drop` retain-on-read-error path treats `ENOENT` (file gone) the same as a transient read error, scheduling a retry that can never succeed.
test: Static code reading of `src/owl/poll.rs::scan_drop_files` emit site, `src/common/spool.rs::spool_message`, `src/live/wrapper/mod.rs::process_file_drop` read+retain logic.
expecting: Confirm no dedupe at any layer (producer/spool/consumer) AND confirm the read-error branch unconditionally returns `Continue` (= retain). Both confirmed.
next_action: Diagnosis complete — return to parent for fix decision.

## Symptoms

expected: One file_drop EVENT per `.claude/{self_id}-commune.md` lifecycle. After successful consume+delete, the wrapper sees no further file_drop dispatches for that path.
actual: Long resume turn (~6 min) blocks the wrapper while Self's listener (separate process, ~1s loop cadence) queues ~649 identical file_drop EVENTs into the wrapper's spool. Wrapper completes one dispatch (file consumed + deleted), then drains the spool — each queued envelope re-fires `process_file_drop`, which calls `fs::read_to_string(path)` against the now-gone file, hits `os error 2` (ENOENT), and logs `[FILE-DROP] read failed kind=commune path=... err=...; retaining`. The "retain" outcome means the next iteration of the producing listener will queue yet more copies (the file is still absent so no new emit — but the spool backlog isn't cleared either; each queued row still triggers a useless dispatch).
errors: `[FILE-DROP] read failed kind=commune path=C:/.../.claude/doyle-commune.md err=The system cannot find the file specified. (os error 2); retaining`
reproduction: Drop a `.claude/{self_id}-commune.md` file while wrapper is mid-resume on a slow turn (>1s of blocking). Self's listener queues ≥2 copies before wrapper consumes the first. After wrapper deletes the file, remaining queued copies fire repeatedly with ENOENT-retain.
started: Phase 30 Plan 02 Task 3 — `process_file_drop` was authored with retain-on-read-error semantics that don't distinguish "transient I/O failure" from "file is gone."

## Evidence

- timestamp: investigation
  checked: `src/owl/poll.rs` lines 262–296 (Self listener's per-iteration file-drop scan/emit)
  found: Listener does `scan_drop_files(cwd, id)` every loop iteration (after the TCP `check_message_blocking` returns or any other arm completes). On hit it calls `compose_file_drop_event(...)` then `send::deliver_body(&psyche_id, id, &envelope)` with NO dedupe, NO suppress-while-pending, NO mtime/inode tracking. Every loop iteration where the file exists emits a fresh envelope.
  implication: The producer queues unbounded duplicates while the file is present and the consumer (wrapper) is blocked. With Self's listener tick ~1s and a wrapper resume of ~6 min, the observed ~649 duplicates is consistent with one emit per ~0.55s — well within plausible listener tick rate during active polling.

- timestamp: investigation
  checked: `src/owl/send.rs::deliver_message` lines 66–97
  found: TCP-first; falls back to `spool::spool_message(target, from, body, &owlery)` if TCP delivery is unavailable. When the wrapper is mid-resume, its perch listener is `close_busy`/unable to accept new TCP connections, so deliveries spool. Spool insert is unconditional.
  implication: All 649 emits land in the wrapper's spool.db as separate rows.

- timestamp: investigation
  checked: `src/common/spool.rs::spool_message` lines 55–65
  found: `INSERT INTO messages (from_id, body, created_at, ttl_seconds, deferred) VALUES (?1, ?2, ?3, ?4, 0)` — pure blind INSERT, no `INSERT OR IGNORE`, no body-hash uniqueness, no scan for existing-pending. Grep across the whole spool module for `dedup|unique|duplicate|already_queued|pending` confirmed no existing dedupe logic.
  implication: Producer-side dedupe is not present anywhere in the data path. Each emit becomes a row.

- timestamp: investigation
  checked: `src/live/wrapper/mod.rs::process_file_drop` lines 1203–1218 (read step)
  found: ```rust
  let body = match std::fs::read_to_string(path) {
      Ok(s) => s,
      Err(e) => {
          self.log(&format!(
              "[FILE-DROP] read failed kind={} path={} err={}; retaining",
              kind, path, e
          ));
          return FileDropOutcome::Continue;
      }
  };
  ```
  All read errors are funneled into the same "retain" branch. There is no `e.kind() == ErrorKind::NotFound` discriminator. ENOENT is treated identically to a real I/O failure (disk full, permission denied, etc.).
  implication: When the file has already been consumed and deleted by an earlier iteration of the SAME wrapper, every subsequent dispatch for a stale spool row hits ENOENT and "retains." Note: "retain" here only means the in-memory dispatch returned `Continue` and the wrapper did NOT call `fs::remove_file` (which is moot because the file is gone). The spool ROW itself has already been marked `delivered=1` by `drain_all` BEFORE the dispatch ran, so the row will not re-emit. The symptom is therefore NOT an infinite retry loop — it is one wasteful no-op `resume_session_checked` … wait, let me re-read.

- timestamp: investigation
  checked: `src/live/wrapper/mod.rs::process_file_drop` lines 1209–1217 retain-and-return-early flow vs lines 1264–1270 resume_session call site
  found: The retain branch returns `FileDropOutcome::Continue` BEFORE reaching the `resume_session_with_exit` call. So on ENOENT we don't spawn claude. Good.
  However, the wrapper still drained the message off the spool (drain_all marks delivered=1 first), and it still LOGGED `[FILE-DROP] read failed ... retaining` which is misleading observability — and it still spent time inside `process_file_drop` for every one of the 649 stale rows.
  implication: Cost per stale row is small (one stat-fail-as-open + one log line) BUT the log spam is severe and the semantics are wrong. The word "retaining" implies retry-later, which is false: the spool row is gone and the file is gone. Nothing is retained. Misleading log is operationally damaging during incident triage.

- timestamp: investigation
  checked: `src/owl/poll.rs::drain_all_with_metadata` flow (lines 148–193 of poll.rs) and `src/common/spool.rs::drain_all` (line 128) — confirm rows are marked delivered=1 on drain (not deleted)
  found: `drain_all` UPDATEs delivered=1 atomically before returning rows. So a drained row never re-emits in a future drain. The 649 dispatches occur in a SINGLE drain pass (all queued rows flush at once when the wrapper's listener comes back online).
  implication: This is a one-shot burst per resume-turn, not an unbounded loop. But each burst is N=duration-of-blocked-resume × listener-tick-rate, which scales linearly with how slow Psyche is on that turn. On a 6-min turn at ~1s cadence that's ~360 expected and 649 observed (consistent with listener doing 2+ iterations/sec during active drain).

- timestamp: investigation
  checked: `src/live/wrapper/mod.rs::handle_file_drop_arm` lines 435–460 — predicate/parse layer above process_file_drop
  found: No dedupe here either. Every drained envelope that contains `<event type="file_drop"` (case-insensitive) parses and dispatches.
  implication: No mitigation at the wrapper-side arm dispatch. Every stale envelope reaches `process_file_drop`.

## Eliminated

- hypothesis: The retain-on-error branch re-emits the file_drop into the wrapper's own spool (infinite loop).
  evidence: `process_file_drop` retain branch only logs and returns `Continue`. It does NOT call `send::deliver_body` or `spool::spool_message`. The drained spool row stays `delivered=1`. There is no loop — just a one-shot burst of stale dispatches per drain pass.
  timestamp: investigation

- hypothesis: Self's listener detects the deletion and stops scanning.
  evidence: `scan_drop_files` is a stateless stat-check per loop iteration. It doesn't track previous emissions. But once the wrapper deletes the file, `scan_drop_files` returns empty on subsequent iterations — so the producer DOES stop emitting once the file is gone. The 649 figure is bounded by the wrapper's resume-turn duration, not unbounded.
  timestamp: investigation

## Resolution

root_cause: |
  Two layered defects compound:

  **(1) Producer-side: unbounded duplicate emit during blocked consumer.**
  `src/owl/poll.rs` lines 271–295: Self's listener emits a fresh `<EVENT type="file_drop">` envelope on EVERY loop iteration where `scan_drop_files` reports the file present. There is no per-emit memory (no "I just emitted this", no mtime track, no in-flight flag). While the wrapper is mid-resume (~6 min), this produces hundreds of identical envelopes queued into the wrapper's spool via `send::deliver_body` → `spool::spool_message`. The spool itself does blind INSERTs (`src/common/spool.rs::spool_message` line 58–61, no `INSERT OR IGNORE`, no body-hash uniqueness).

  **(2) Consumer-side: read-error branch treats ENOENT identically to transient I/O failure.**
  `src/live/wrapper/mod.rs::process_file_drop` lines 1209–1218: every `fs::read_to_string` error falls into the same `[FILE-DROP] read failed ... retaining` branch. There is no `ErrorKind::NotFound` discriminator. When the wrapper drains 648 stale duplicates after consuming the first one and deleting the file, every drained envelope fires this branch, producing 648 misleading "retaining" log lines for a file that has been definitively consumed.

  The 17:44:05 → 17:44:07 timing in doyle's brief lines up exactly: 17:44:05 wrapper consumes the first envelope, reads file, runs resume, deletes file. ~2s later the wrapper's listener accepts a new TCP connection (or completes the next inner-poll iteration) and the queued spool drains — 649 envelopes flush through `handle_file_drop_arm` → `process_file_drop`, every one hitting ENOENT on the read.

fix: |
  **Recommendation: BOTH surfaces (defense in depth). Consumer-side fix is the minimum correctness fix; producer-side fix prevents the spool log spam and reduces wasted work.**

  ### (A) Consumer-side fix — discriminate ENOENT from real I/O errors

  **File:** `src/live/wrapper/mod.rs`
  **Lines:** 1209–1218 (the `let body = match std::fs::read_to_string(path)` block)

  Replace the catch-all retain with an explicit ENOENT branch:

  ```rust
  let body = match std::fs::read_to_string(path) {
      Ok(s) => s,
      Err(e) if e.kind() == std::io::ErrorKind::NotFound => {
          // File already consumed (e.g., earlier iteration of THIS wrapper
          // processed the file_drop and deleted it, then we drained queued
          // duplicates from the spool — see debug:
          // spool-retain-replays-file-drop-after-delete.md). Definitive drop:
          // there is no file to retry against, and the spool row has already
          // been marked delivered=1 by drain_all.
          self.log(&format!(
              "[FILE-DROP] dropped kind={} path={} (file already consumed; stale spool duplicate)",
              kind, path
          ));
          return FileDropOutcome::Continue;
      }
      Err(e) => {
          self.log(&format!(
              "[FILE-DROP] read failed kind={} path={} err={}; retaining",
              kind, path, e
          ));
          return FileDropOutcome::Continue;
      }
  };
  ```

  Note: both branches still return `Continue` (the spool row is already drained, there is no retry to gate). The change is purely semantic clarity in the log + intent: ENOENT is no longer mis-described as "retaining." Transient errors (permission, I/O) keep the old wording. This is a 6-line surgical change.

  ### (B) Producer-side fix — suppress duplicate emits while file is present

  **File:** `src/owl/poll.rs`
  **Lines:** 271–295 (the per-iteration `scan_drop_files` → `deliver_body` loop)

  Track per-listener-instance "emitted paths" in a `HashSet<String>` carried in the poll-loop scope. Suppress emit when the path is already in the set; clear the path from the set when the file disappears (stat returns absent in a subsequent scan).

  Sketch (additions only):
  ```rust
  // Before the loop:
  let mut emitted_drops: std::collections::HashSet<String> = std::collections::HashSet::new();

  // Inside the loop, replace the existing block:
  if let Some(ref cwd) = listener_cwd {
      let drops = scan_drop_files(cwd, id);
      let current_paths: std::collections::HashSet<String> =
          drops.iter().map(|(_, p)| p.clone()).collect();

      // Drop tracked paths that no longer exist (wrapper consumed + deleted them).
      emitted_drops.retain(|p| current_paths.contains(p));

      for (kind, abs_fs) in &drops {
          if !emitted_drops.insert(abs_fs.clone()) {
              // Already emitted this lifecycle of the file. Skip.
              continue;
          }
          let envelope = compose_file_drop_event(kind, abs_fs, id);
          // ... rest of existing emit block unchanged
      }
  }
  ```

  Semantics: emit-once per file lifecycle. If the wrapper fails to consume (file still present next iteration), the set still contains the path — no re-emit. This is safe because the wrapper's own D8 retain-on-error path already handles the retry: when the wrapper genuinely retains (e.g., resume_session non-zero exit), the message body is already in the wrapper's spool (delivered=1 but the in-memory dispatch already ran). A *new* envelope is not needed.

  **Caveat for surface (B):** If the wrapper crashes/restarts WITHOUT consuming the file, the new wrapper instance won't see a queued envelope, and Self's listener has the path "stuck" in `emitted_drops`. Mitigation: the `emitted_drops` set is per-poll-process-instance. When Self's listener restarts (or the wrapper is restarted via signoff/orphan/handoff and Self's listener also recycles), the set is fresh and the file is re-emitted. The cross-process case (wrapper crash, Self listener survives) needs a sentinel — out of scope for this fix; the consumer-side fix (A) already neutralises the symptom.

verification: |
  ### Test plan

  **Unit tests** (no network, no subprocess):

  1. `process_file_drop_drops_on_enoent` — call `process_file_drop("commune", "/nonexistent/path.md")` on a `WrapperState`-equivalent harness; assert log contains `dropped` and NOT `retaining`, return value is `Continue`, no spawn occurred. Add to `file_drop_handler_tests` mod at bottom of `src/live/wrapper/mod.rs`.

  2. `process_file_drop_retains_on_permission_denied` — Unix only: chmod 000 a tempfile, call `process_file_drop` against it, assert log contains `retaining` (legacy wording preserved for true I/O errors). Windows analogue via a locked-handle tempfile if feasible; otherwise unix-only `#[cfg(unix)]`.

  3. (Producer fix only) `scan_drop_files_emit_once_per_lifecycle` — drive the new emit-tracking loop with a mock that simulates: file present → emit, file present → suppressed, file absent → set clears, file present again → emit. Asserts `emitted_drops.len()` transitions {0,1,0,1}.

  **Integration test** (existing harness, no new infrastructure):

  4. Extend `tests/native_owl.rs` or add a new `tests/file_drop_dedupe.rs`: spawn a real Self listener, drop a commune file, sleep 3s (let listener emit several times), then stat the wrapper's spool.db and assert `pending_count <= 1` (producer fix) or — without producer fix — assert that draining the spool produces N envelopes but only ONE log line containing `dispatching` and the rest containing `dropped` (consumer fix).

  **Golden test:**

  5. Update `tests/golden/` for the new `[FILE-DROP] dropped` log line. The wording is new so existing fixtures won't drift unless they happened to capture an ENOENT replay (none should — the resolved-debug list does not show one).

  **Manual regression:**

  6. Reproduce doyle's gen 49 scenario: deliberately stall Psyche resume (e.g., temporarily set a long sleep in `resume_session_checked` or use a known slow prompt), drop a commune file, observe wrapper log. Expect: ONE `dispatching` line, ONE `consumed ... (deleted)` line, then N-1 `dropped ... (file already consumed; stale spool duplicate)` lines instead of `retaining` lines.

files_changed: []
```

## Risk / Alternatives Considered

### Alternative 1: Spool-level dedupe via `INSERT OR IGNORE` + body-hash index
Add a `body_hash` column with a UNIQUE index. `spool_message` becomes `INSERT OR IGNORE`.

**Rejected because:** (i) breaks legitimate duplicate user prose ("hello" sent twice should arrive twice); (ii) schema migration risk on existing spool.dbs across all deployed perches; (iii) overkill — the duplication is structural to file_drop only, not a general spool concern; (iv) `from_id == self_id` + identical body is also a valid pattern for self-pulse heartbeats in other phases.

### Alternative 2: Listener-side mtime tracking on the source file
Stat `.claude/{id}-{kind}.md` mtime, only emit when mtime advances.

**Rejected because:** the file is consumed-then-deleted by the wrapper, so the file is normally absent. mtime tracking adds complexity (need to remember last-seen mtime even across "file absent" gaps) and doesn't address the core issue: emit-once-per-lifecycle is what we want, and the HashSet-of-paths sketch in fix (B) is the minimal expression of that.

### Alternative 3: Consumer-side: drop the row but ALSO clear remaining file_drop envelopes from spool
Walk the wrapper's spool on ENOENT and delete all queued envelopes with `kind=` matching this kind.

**Rejected because:** races with concurrent producer emits (we'd delete a fresh emit that's actually about a new file lifecycle); requires the wrapper to hold a spool write handle while draining (currently it does not); adds non-local state mutation to what should be a per-envelope handler.

### Risk of fix (A) alone (consumer-side only)
**Low.** The change is 6 lines, return value identical (`Continue`), no fan-out. The only behavioural diff is the log wording: `[FILE-DROP] dropped` vs `[FILE-DROP] read failed ... retaining`. Operators reading logs see the correct semantic. Spool-row already-marked-delivered means no retry-loop ever existed; we're just renaming the no-op.

### Risk of fix (B) alone (producer-side only)
**Medium-low.** The `HashSet` is per-process; on listener restart the set is empty. If a wrapper crashes mid-consume (file still present), Self listener restart re-emits — that's correct behaviour. The edge case where Self listener restarts WHILE wrapper is alive AND file is present produces a re-emit even though one is already in the wrapper's spool — but the wrapper has dedupe via fix (A) (the second envelope would just hit ENOENT after first consume). So fix (A) covers fix (B)'s edge cases.

### Recommendation
**Land both, in order: (A) first (correctness + observability), (B) second (efficiency + log noise reduction).** They are independent commits. (A) alone fully neutralises the operational symptom doyle reported. (B) reduces wasted spool I/O and improves throughput during long resume turns.

## Specialist Hint

**Specialist Hint:** rust

## Resolution Note

- **resolved_at:** 2026-05-20
- **milestone:** v1.8
- **quick_id:** 260520-pra
- **surface_A_commit:** `c1adb43` (Surface A — consumer ENOENT discriminator in `process_file_drop`)
- **surface_B_commit:** `a9ba7d0` (Surface B — producer-side `dedupe_drops` HashSet in `owl::poll::run`)

**Summary:** Both surfaces landed; ENOENT no longer logs misleading
`retaining`, and Self listener emits at most once per file lifecycle
(HashSet-based path dedupe).
