1. **Resolve the current contract contradictions (`preflight-only`, `dry-run`, `sync_runs`)** Why this improves the plan: - Right now constraints conflict: “zero DB writes before commit” vs inserting `sync_runs` during preflight. - This ambiguity will cause implementation drift and flaky acceptance tests. - Splitting control-plane writes from content-plane writes keeps safety guarantees strict while preserving observability. ```diff @@ ## Design Constraints -6. **Preflight-then-commit**: All remote fetches happen BEFORE any DB writes. If any IID fetch fails (404, network error), the entire operation aborts with zero DB mutations. +6. **Preflight-then-commit (content-plane)**: All remote fetches happen BEFORE any writes to content tables (`issues`, `merge_requests`, `discussions`, `resource_events`, `documents`, `embeddings`). +7. **Control-plane exception**: `sync_runs` / `sync_run_entities` writes are allowed during preflight for observability and crash diagnostics. @@ -11. **Preflight-only mode**: `--preflight-only` validates remote entity existence and permissions with zero DB writes. +11. **Preflight-only mode**: `--preflight-only` performs zero content writes; control-plane run-ledger writes are allowed. @@ ### For me to evaluate (functional): -24. **Preflight-only mode** ... no DB mutations beyond the sync_runs ledger entry +24. **Preflight-only mode** ... no content DB mutations; only run-ledger rows may be written ``` --- 2. **Add stale-write protection to avoid TOCTOU regressions during unlocked preflight** Why this improves the plan: - You intentionally preflight without lock; that’s good for throughput but introduces race risk. - Without a guard, a slower surgical run can overwrite newer data ingested by a concurrent normal sync. - This is a correctness bug under contention, not a nice-to-have. ```diff @@ ## Design Constraints +12. **Stale-write protection**: Surgical ingest MUST NOT overwrite fresher local rows. If local `updated_at` is newer than the preflight payload’s `updated_at`, skip that entity and record `skipped_stale`. @@ ## Step 7: Create `src/ingestion/surgical.rs` - let labels_created = process_single_issue(conn, config, project_id, issue)?; + // Skip stale payloads to avoid TOCTOU overwrite after unlocked preflight. + if is_local_newer_issue(conn, project_id, issue.iid, issue.updated_at)? { + result.skipped_stale += 1; + return Ok(result); + } + let labels_created = process_single_issue(conn, config, project_id, issue)?; @@ +// same guard for MR path @@ ## Step 15: Update `SyncResult` + /// Entities skipped because local row was newer than preflight payload + pub skipped_stale: usize, @@ ### Edge cases to verify: +38. **TOCTOU safety**: if a normal sync updates entity after preflight but before ingest, surgical run skips stale payload (no overwrite) ``` --- 3. **Make dirty-source scoping exact (do not capture pre-existing rows for same entity)** Why this improves the plan: - Current “query dirty rows by `source_id` after ingest” can accidentally include older dirty rows for the same entity. - That silently violates strict run scoping and can delete unrelated backlog rows. - You can fix this without adding `origin_run_id` to `dirty_sources` (which you already rejected). ```diff @@ ## Step 7: Create `src/ingestion/surgical.rs` - // Collect dirty_source rows for this entity - let mut stmt = conn.prepare( - "SELECT id FROM dirty_sources WHERE source_type = 'issue' AND source_id = ?1" - )?; + // Capture only rows inserted by THIS call using high-water mark. + let before_dirty_id: i64 = conn.query_row( + "SELECT COALESCE(MAX(id), 0) FROM dirty_sources", + [], |r| r.get(0), + )?; + // ... call process_single_issue ... + let mut stmt = conn.prepare( + "SELECT id FROM dirty_sources + WHERE id > ?1 AND source_type = 'issue' AND source_id = ?2" + )?; @@ + // same pattern for MR @@ ### 1d. Scoping invariant tests +#[test] +fn surgical_docs_scope_ignores_preexisting_dirty_rows_for_same_entity() { + // pre-insert dirty row for iid=7, then surgical ingest iid=7 + // assert result.dirty_source_ids only contains newly inserted rows +} ``` --- 4. **Fix embed-stage leakage when `--no-docs` is used in surgical mode** Why this improves the plan: - Current design can run global embed even when docs stage is skipped, which may embed unrelated backlog docs. - That breaks the surgical “scope only this run” promise. - This is both correctness and operator-trust critical. ```diff @@ ## Step 9: Create `run_sync_surgical` - if !options.no_embed { + // Surgical embed only runs when surgical docs actually regenerated docs in this run. + if !options.no_embed && !options.no_docs && result.documents_regenerated > 0 { @@ ## Step 4: Wire new fields in `handle_sync_cmd` + if options.is_surgical() && options.no_docs && !options.no_embed { + return Err(Box::new(LoreError::Other( + "In surgical mode, --no-docs requires --no-embed (to preserve scoping guarantees)".to_string() + ))); + } @@ ### For me to evaluate +39. **No embed leakage**: `sync --issue X --no-docs` never embeds unrelated unembedded docs ``` --- 5. **Add queue-failure hygiene so scoped jobs do not leak forever** Why this improves the plan: - Scoped drains prevent accidental processing, but failed runs can strand pending jobs permanently. - You need explicit terminalization (`aborted`) and optional replay mechanics. - Otherwise queue bloat and confusing diagnostics accumulate. ```diff @@ ## Step 8a: Add `sync_runs` table migration +ALTER TABLE dependent_queue ADD COLUMN aborted_reason TEXT; +-- status domain now includes: pending, claimed, done, failed, aborted @@ ## Step 9: run_sync_surgical failure paths +// On run failure/cancel: +conn.execute( + "UPDATE dependent_queue + SET status='aborted', aborted_reason=?1 + WHERE project_id=?2 AND scope_run_id=?3 AND status='pending'", + rusqlite::params![failure_summary, project_id, run_id], +)?; @@ ## Acceptance Criteria +40. **No stranded scoped jobs**: failed surgical runs leave no `pending` rows for their `scope_run_id` ``` --- 6. **Persist per-entity lifecycle (`sync_run_entities`) for real observability and deterministic retry** Why this improves the plan: - `sync_runs` alone gives aggregate counters but not which IID failed at which stage. - Per-entity records make retries deterministic and robot output far more useful. - This is the missing piece for your stated “deterministic retry decisions.” ```diff @@ ## Step 8a: Add `sync_runs` table migration +CREATE TABLE IF NOT EXISTS sync_run_entities ( + id INTEGER PRIMARY KEY, + run_id TEXT NOT NULL REFERENCES sync_runs(run_id), + entity_type TEXT NOT NULL CHECK(entity_type IN ('issue','merge_request')), + iid INTEGER NOT NULL, + stage TEXT NOT NULL, + status TEXT NOT NULL CHECK(status IN ('ok','failed','skipped_stale')), + error_code TEXT, + error_message TEXT, + updated_at INTEGER NOT NULL +); +CREATE INDEX IF NOT EXISTS idx_sync_run_entities_run ON sync_run_entities(run_id, entity_type, iid); @@ ## Step 15: Update `SyncResult` + pub failed_iids: Vec<(String, u64)>, + pub skipped_stale_iids: Vec<(String, u64)>, @@ ## CLI Interface +lore --robot sync-runs --run-id +lore --robot sync-runs --run-id --retry-failed ``` --- 7. **Use explicit error type for surgical preflight failures (not `LoreError::Other`)** Why this improves the plan: - `Other(String)` loses machine semantics, weakens robot mode, and leads to bad exit-code behavior. - A typed error preserves structured failures and enables actionable recovery commands. ```diff @@ ## Step 9: run_sync_surgical - return Err(LoreError::Other( - format!("Surgical preflight failed for {} of {} IIDs: {}", ...) - ).into()); + return Err(LoreError::SurgicalPreflightFailed { + run_id: run_id.to_string(), + total: total_items, + failures: preflight.failures.clone(), + }.into()); @@ ## Step 15: Update `SyncResult` + /// Machine-actionable error summary for robot mode + pub error_code: Option, @@ ## Acceptance Criteria +41. **Typed failure**: preflight failures serialize structured errors (not generic `Other`) with machine-usable codes/actions ``` --- 8. **Strengthen tests for rollback, contention, and stale-skip guarantees** Why this improves the plan: - Current tests cover many happy-paths and scoping invariants, but key race/rollback behaviors are still under-tested. - These are exactly where regressions will appear first in production. ```diff @@ ## Step 1: TDD — Write Failing Tests First +### 1f. Transactional rollback + TOCTOU tests +1. `preflight_success_then_ingest_failure_rolls_back_all_content_writes` +2. `stale_payload_is_skipped_when_local_updated_at_is_newer` +3. `failed_run_aborts_pending_scoped_jobs` +4. `surgical_no_docs_requires_no_embed` @@ ### Automated scoping invariants -38. **Scoped queue/docs invariants are enforced by automated tests** +42. **Rollback and race invariants are enforced by automated tests** (no partial writes on ingest failure, no stale overwrite) ``` --- These eight revisions keep your core approach intact, avoid your explicitly rejected ideas, and close the biggest correctness/operability gaps before implementation.