Highest-impact revisions after reviewing your v5 plan: 1. **Fix a real scoping hole: embed can still process unrelated docs** Rationale: Current plan assumes scoped docs implies scoped embed, but that only holds while no other run creates unembedded docs. You explicitly release sync lock before embed, so another sync can enqueue/regenerate docs in between, and `run_embed` may embed unrelated backlog. This breaks surgical isolation and can hide backlog debt. ```diff diff --git a/plan.md b/plan.md @@ Design Constraints -3. Embed scoping: Embedding runs only for documents regenerated by this surgical run. Because `run_embed` processes only unembedded docs, scoping is automatic IF docs are scoped correctly... +3. Embed scoping: Embedding MUST be explicitly scoped to documents regenerated by this surgical run. + `run_generate_docs_for_sources` returns regenerated `document_ids`; surgical mode calls + `run_embed_for_document_ids(document_ids)` and never global `run_embed`. + This remains true even after lock release and under concurrent normal sync activity. @@ Step 9a: Implement `run_generate_docs_for_sources` -pub fn run_generate_docs_for_sources(...) -> Result { +pub fn run_generate_docs_for_sources(...) -> Result { + // Return regenerated document IDs for scoped embedding. + // GenerateDocsResult { regenerated, errored, regenerated_document_ids: Vec } @@ Step 9: Embed stage - match run_embed(config, false, false, None, signal).await { + match run_embed_for_document_ids(config, &result.regenerated_document_ids, signal).await { ``` 2. **Make run-ledger lifecycle actually durable (and consistent with your own constraint 10)** Rationale: Plan text says “reuse `SyncRunRecorder`”, but Step 9 writes raw SQL directly. That creates lifecycle drift, missing heartbeats, and inconsistent failure handling as code evolves. ```diff diff --git a/plan.md b/plan.md @@ Design Constraints -10. Durable run state: ... Reuses `SyncRunRecorder` row lifecycle ... +10. Durable run state: surgical sync MUST use `SyncRunRecorder` end-to-end (no ad-hoc SQL updates). + Add recorder APIs for `set_mode`, `set_phase`, `set_counters`, `finish_succeeded`, + `finish_failed`, `finish_cancelled`, and periodic `heartbeat`. @@ Step 9: Create `run_sync_surgical` - conn.execute("INSERT INTO sync_runs ...") - conn.execute("UPDATE sync_runs SET phase = ...") + let mut recorder = SyncRunRecorder::start_surgical(...)?; + recorder.set_phase("preflight")?; + recorder.heartbeat_if_due()?; + recorder.set_phase("ingest")?; + ... + recorder.finish_succeeded_with_warnings(...)?; ``` 3. **Add explicit `cancelled` terminal state** Rationale: Current early cancellation branches return `Ok(result)` without guaranteed run-row finalization. That leaves misleading `running` rows and weak crash diagnostics. ```diff diff --git a/plan.md b/plan.md @@ Design Constraints +15. Cancellation semantics: If shutdown is observed after run start, phase is set to `cancelled`, + status is `cancelled`, `finished_at` is written, and lock is released before return. @@ Step 8a migration +ALTER TABLE sync_runs ADD COLUMN warnings_count INTEGER NOT NULL DEFAULT 0; +ALTER TABLE sync_runs ADD COLUMN cancelled_at INTEGER; @@ Acceptance Criteria +47. Cancellation durability: Ctrl+C during surgical sync records `status='cancelled'`, + `phase='cancelled'`, and `finished_at` in `sync_runs`. ``` 4. **Reduce lock contention further by separating dependent fetch and dependent write** Rationale: You currently hold lock through network-heavy dependent stages. That maximizes contention and increases lock timeout risk. Better: fetch dependents unlocked, write in short locked transactions with per-entity freshness guards. ```diff diff --git a/plan.md b/plan.md @@ Design Constraints -11. Lock window minimization: ... held through ingest, dependents, and docs stages. +11. Lock window minimization: lock is held only for DB mutation windows. + Dependents run in two phases: + (a) fetch from GitLab without lock, + (b) write results under lock in short transactions. + Apply per-entity freshness checks before dependent writes. @@ Step 9: Dependent stages - // All dependents run INLINE per-entity ... while lock is held + // Dependents fetch outside lock, then write under lock with CAS-style watermark guards. ``` 5. **Introduce stage timeout budgets to prevent hung surgical runs** Rationale: A single slow GitLab endpoint can stall the whole run and hold resources too long. Timeout budgets plus per-entity failure recording keep the run bounded and predictable. ```diff diff --git a/plan.md b/plan.md @@ Design Constraints +16. Stage timeout budgets: each dependent fetch has a per-entity timeout and a global stage budget. + Timed-out entities are recorded in `entity_failures` with code `TIMEOUT` and run continues best-effort. @@ Step 9 notes + - Wrap dependent network calls with `tokio::time::timeout`. + - Add config knobs: + `sync.surgical_entity_timeout_seconds` (default 20), + `sync.surgical_dependents_budget_seconds` (default 120). ``` 6. **Add payload integrity checks (project mismatch hard-fail)** Rationale: Surgical mode is precision tooling. If API/proxy misconfiguration returns payloads from wrong project, you should fail preflight loudly, not trust downstream assumptions. ```diff diff --git a/plan.md b/plan.md @@ Step 7: preflight_fetch + // Integrity check: payload.project_id must equal requested gitlab_project_id. + // On mismatch, record EntityFailure { code: "PROJECT_MISMATCH", stage: "fetch" }. @@ Step 9d: error codes +PROJECT_MISMATCH -> usage/config data integrity failure (typed, machine-readable) @@ Acceptance Criteria +48. Project integrity: payloads with unexpected `project_id` are rejected in preflight + and produce zero content writes. ``` 7. **Upgrade robot output from aggregate-only to per-entity lifecycle** Rationale: `entity_failures` alone is not enough for robust automation. Agents need a complete entity outcome map (fetched, ingested, stale-skipped, dependent failures) to retry deterministically. ```diff diff --git a/plan.md b/plan.md @@ Step 15: Update `SyncResult` +pub struct EntityOutcome { + pub entity_type: String, + pub iid: u64, + pub fetched: bool, + pub ingested: bool, + pub stale_skipped: bool, + pub dependent_failures: Vec, +} @@ +pub entity_outcomes: Vec, +pub completion_status: String, // succeeded | succeeded_with_warnings | failed | cancelled @@ Robot mode - enables agents to detect partial failures via `entity_failures` + enables deterministic, per-IID retry and richer UI messaging. ``` 8. **Index `sync_runs` for real observability at scale** Rationale: You’re adding mode/phase/counters and then querying recent surgical runs. Without indexes, this degrades as run history grows. ```diff diff --git a/plan.md b/plan.md @@ Step 8a migration +CREATE INDEX IF NOT EXISTS idx_sync_runs_mode_started + ON sync_runs(mode, started_at DESC); +CREATE INDEX IF NOT EXISTS idx_sync_runs_status_phase_started + ON sync_runs(status, phase, started_at DESC); ``` 9. **Add tests specifically for the new failure-prone paths** Rationale: Current tests are strong on ingest and scoping, but still miss new high-risk runtime behavior (cancel state, timeout handling, scoped embed under concurrency). ```diff diff --git a/plan.md b/plan.md @@ Step 1f tests +#[tokio::test] +async fn cancellation_marks_sync_run_cancelled() { ... } + +#[tokio::test] +async fn dependent_timeout_records_entity_failure_and_continues() { ... } + +#[tokio::test] +async fn scoped_embed_does_not_embed_unrelated_docs_created_after_docs_stage() { ... } @@ Acceptance Criteria +49. Scoped embed isolation under concurrency is verified by automated test. +50. Timeout path is verified (TIMEOUT code + continued processing). ``` These revisions keep your core direction intact, avoid every rejected recommendation, and materially improve correctness under concurrency, operational observability, and agent automation quality.