Below are the strongest **new** revisions I’d make (excluding everything in your rejected list), with rationale and plan-level diffs. ### 1. Add a durable run ledger (`sync_runs`) with phase state This makes surgical sync crash-resumable, auditable, and safer under Ctrl+C. Right now `run_id` is mostly ephemeral; persisting phase state removes ambiguity about what completed. ```diff @@ Design Constraints +9. **Durable run state**: Surgical sync MUST persist a `sync_runs` row keyed by `run_id` + with phase transitions (`preflight`, `ingest`, `dependents`, `docs`, `embed`, `done`, `failed`). + This is required for crash recovery, observability, and deterministic retries. @@ Step 9: Create `run_sync_surgical` +Before Stage 0, insert `sync_runs(run_id, project_id, mode='surgical', requested_counts, started_at)`. +After each stage, update `sync_runs.phase`, counters, and `last_error` if present. +On success/failure, set terminal state (`done`/`failed`) and `finished_at`. ``` ### 2. Add `--preflight-only` (network validation without writes) `--dry-run` is intentionally zero-network, so it cannot validate IIDs. `--preflight-only` is high-value for agents: verifies existence/permissions quickly with no DB mutation. ```diff @@ CLI Interface lore sync --dry-run --issue 123 -p myproject +lore sync --preflight-only --issue 123 -p myproject @@ Step 2: Add `--issue`, `--mr`, `-p` to `SyncArgs` + /// Validate remote entities and auth without any DB writes + #[arg(long, default_value_t = false)] + pub preflight_only: bool, @@ Step 10: Add branch in `run_sync` +if options.preflight_only && options.is_surgical() { + return run_sync_surgical_preflight_only(config, &options, run_id, signal).await; +} ``` ### 3. Preflight should aggregate all missing/failed IIDs, not fail-fast Fail-fast causes repeated reruns. Aggregating errors gives one-shot correction and better robot automation. ```diff @@ Step 7: Create `src/ingestion/surgical.rs` -/// Returns the fetched payloads. If ANY fetch fails, the entire operation should abort. +/// Returns fetched payloads plus per-IID failures; caller aborts writes if failures exist. pub async fn preflight_fetch(...) -> Result { @@ #[derive(Debug, Default)] pub struct PreflightResult { pub issues: Vec, pub merge_requests: Vec, + pub failures: Vec, // stage="fetch" } @@ Step 9: Create `run_sync_surgical` -let preflight = preflight_fetch(...).await?; +let preflight = preflight_fetch(...).await?; +if !preflight.failures.is_empty() { + result.entity_failures = preflight.failures; + return Err(LoreError::Other("Surgical preflight failed for one or more IIDs".into()).into()); +} ``` ### 4. Stop filtering scoped queue drains with raw `json_extract` scans `json_extract(payload_json, '$.scope_run_id')` in hot drain queries will degrade as queue grows. Use indexed scope metadata. ```diff @@ Step 9b: Implement scoped drain helpers -// claim query adds: -// AND json_extract(payload_json, '$.scope_run_id') = ? +// Add migration: +// 1) Add `scope_run_id` generated/stored column derived from payload_json (or explicit column) +// 2) Create index on (project_id, job_type, scope_run_id, status, id) +// Scoped drains filter by indexed `scope_run_id`, not full-table JSON extraction. ``` ### 5. Replace `dirty_source_ids` collection-by-query with explicit run scoping Current approach can accidentally include prior dirty rows for same source and can duplicate work. Tag dirty rows with `origin_run_id` and consume by run. ```diff @@ Design Constraints -2. **Dirty queue scoping**: ... MUST call ... `run_generate_docs_for_dirty_ids` +2. **Dirty queue scoping**: Surgical sync MUST scope docs by `origin_run_id` on `dirty_sources` + (or equivalent exact run marker) and MUST NOT drain unrelated dirty rows. @@ Step 7: `SurgicalIngestResult` - pub dirty_source_ids: Vec, + pub origin_run_id: String, @@ Step 9a: Implement `run_generate_docs_for_dirty_ids` -pub fn run_generate_docs_for_dirty_ids(config: &Config, dirty_source_ids: &[i64]) -> Result<...> +pub fn run_generate_docs_for_run_id(config: &Config, run_id: &str) -> Result<...> ``` ### 6. Enforce transaction safety at the type boundary `unchecked_transaction()` + `&Connection` signatures is fragile. Accept `&Transaction` for ingest internals and use `TransactionBehavior::Immediate` for deterministic lock behavior. ```diff @@ Step 7: Create `src/ingestion/surgical.rs` -pub fn ingest_issue_by_iid_from_payload(conn: &Connection, ...) +pub fn ingest_issue_by_iid_from_payload(tx: &rusqlite::Transaction<'_>, ...) -pub fn ingest_mr_by_iid_from_payload(conn: &Connection, ...) +pub fn ingest_mr_by_iid_from_payload(tx: &rusqlite::Transaction<'_>, ...) -let tx = conn.unchecked_transaction()?; +let tx = conn.transaction_with_behavior(rusqlite::TransactionBehavior::Immediate)?; ``` ### 7. Acquire sync lock only for mutation phases, not remote preflight This materially reduces lock contention and keeps normal sync throughput higher, while still guaranteeing mutation serialization. ```diff @@ Design Constraints +10. **Lock window minimization**: Preflight fetch runs without sync lock; lock is acquired immediately + before first DB mutation and held through all mutation stages. @@ Step 9: Create `run_sync_surgical` -// ── Acquire sync lock ── -... -// ── Stage 0: Preflight fetch ── +// ── Stage 0: Preflight fetch (no lock, no writes) ── ... +// ── Acquire sync lock just before Stage 1 mutation ── ``` ### 8. Add explicit transient retry policy beyond 429 Client already handles rate limits; surgical reliability improves a lot if 5xx/timeouts are retried with bounded backoff. ```diff @@ Design Constraints +11. **Transient retry policy**: Preflight and dependent remote fetches MUST retry boundedly on + timeout/5xx with jittered backoff; permanent errors (404/401/403) fail immediately. @@ Step 5: Add `get_issue_by_iid` / `get_mr_by_iid` +Document retry behavior for transient transport/server failures. ``` ### 9. Tighten automated tests around scoping invariants You already list manual checks; these should be enforced in unit/integration tests to prevent regressions. ```diff @@ Step 1: TDD — Write Failing Tests First +### 1d. New invariants tests +- `surgical_docs_scope_ignores_preexisting_dirty_rows` +- `scoped_queue_drain_ignores_orphaned_jobs` +- `preflight_aggregates_multiple_missing_iids` +- `preflight_only_performs_zero_writes` +- `dry_run_performs_zero_network_calls` +- `lock_window_does_not_block_during_preflight` @@ Acceptance Criteria +32. Scoped queue/docs invariants are covered by automated tests (not manual-only verification). ``` ### 10. Make robot-mode surgical output first-class For agent workflows, include full stage telemetry and actionable recovery commands. ```diff @@ Step 15: Update `SyncResult` for robot mode structured output + /// Per-stage elapsed ms for deterministic performance tracking + pub stage_timings_ms: std::collections::BTreeMap, + /// Suggested recovery commands (robot ergonomics) + pub recovery_actions: Vec, @@ Step 14: Update `robot-docs` manifest +Document surgical-specific error codes and `actions` schema for automated recovery. ``` If you want, I can now produce a fully rewritten **iteration 3** plan that merges these into your current structure end-to-end.