gitlore/docs/plan-surgical-sync.feedback-4.md at git-5-search-output

Files

teernisse a1bca10408 feat(cli): implement 'lore file-history' command (bd-z94)

Adds file-history command showing which MRs touched a file, with:
- Rename chain resolution via BFS (resolve_rename_chain from bd-1yx)
- DiffNote discussion snippets with --discussions flag
- --merged filter, --no-follow-renames, -n limit
- Human output with styled MR list and rename chain display
- Robot JSON output with {ok, data, meta} envelope
- Autocorrect registry and robot-docs manifest entry
- Fixes pre-existing --no-status missing from sync autocorrect registry

2026-02-17 12:57:56 -05:00

9.3 KiB

Raw Permalink Blame History

Resolve the current contract contradictions (preflight-only, dry-run, sync_runs)

Why this improves the plan:

Right now constraints conflict: “zero DB writes before commit” vs inserting sync_runs during preflight.
This ambiguity will cause implementation drift and flaky acceptance tests.
Splitting control-plane writes from content-plane writes keeps safety guarantees strict while preserving observability.

@@ ## Design Constraints
-6. **Preflight-then-commit**: All remote fetches happen BEFORE any DB writes. If any IID fetch fails (404, network error), the entire operation aborts with zero DB mutations.
+6. **Preflight-then-commit (content-plane)**: All remote fetches happen BEFORE any writes to content tables (`issues`, `merge_requests`, `discussions`, `resource_events`, `documents`, `embeddings`).
+7. **Control-plane exception**: `sync_runs` / `sync_run_entities` writes are allowed during preflight for observability and crash diagnostics.
@@
-11. **Preflight-only mode**: `--preflight-only` validates remote entity existence and permissions with zero DB writes.
+11. **Preflight-only mode**: `--preflight-only` performs zero content writes; control-plane run-ledger writes are allowed.
@@ ### For me to evaluate (functional):
-24. **Preflight-only mode** ... no DB mutations beyond the sync_runs ledger entry
+24. **Preflight-only mode** ... no content DB mutations; only run-ledger rows may be written

Add stale-write protection to avoid TOCTOU regressions during unlocked preflight

Why this improves the plan:

You intentionally preflight without lock; that’s good for throughput but introduces race risk.
Without a guard, a slower surgical run can overwrite newer data ingested by a concurrent normal sync.
This is a correctness bug under contention, not a nice-to-have.

@@ ## Design Constraints
+12. **Stale-write protection**: Surgical ingest MUST NOT overwrite fresher local rows. If local `updated_at` is newer than the preflight payload’s `updated_at`, skip that entity and record `skipped_stale`.
@@ ## Step 7: Create `src/ingestion/surgical.rs`
-    let labels_created = process_single_issue(conn, config, project_id, issue)?;
+    // Skip stale payloads to avoid TOCTOU overwrite after unlocked preflight.
+    if is_local_newer_issue(conn, project_id, issue.iid, issue.updated_at)? {
+        result.skipped_stale += 1;
+        return Ok(result);
+    }
+    let labels_created = process_single_issue(conn, config, project_id, issue)?;
@@
+// same guard for MR path
@@ ## Step 15: Update `SyncResult`
+    /// Entities skipped because local row was newer than preflight payload
+    pub skipped_stale: usize,
@@ ### Edge cases to verify:
+38. **TOCTOU safety**: if a normal sync updates entity after preflight but before ingest, surgical run skips stale payload (no overwrite)

Make dirty-source scoping exact (do not capture pre-existing rows for same entity)

Why this improves the plan:

Current “query dirty rows by source_id after ingest” can accidentally include older dirty rows for the same entity.
That silently violates strict run scoping and can delete unrelated backlog rows.
You can fix this without adding origin_run_id to dirty_sources (which you already rejected).

@@ ## Step 7: Create `src/ingestion/surgical.rs`
-    // Collect dirty_source rows for this entity
-    let mut stmt = conn.prepare(
-        "SELECT id FROM dirty_sources WHERE source_type = 'issue' AND source_id = ?1"
-    )?;
+    // Capture only rows inserted by THIS call using high-water mark.
+    let before_dirty_id: i64 = conn.query_row(
+        "SELECT COALESCE(MAX(id), 0) FROM dirty_sources",
+        [], |r| r.get(0),
+    )?;
+    // ... call process_single_issue ...
+    let mut stmt = conn.prepare(
+        "SELECT id FROM dirty_sources
+         WHERE id > ?1 AND source_type = 'issue' AND source_id = ?2"
+    )?;
@@
+    // same pattern for MR
@@ ### 1d. Scoping invariant tests
+#[test]
+fn surgical_docs_scope_ignores_preexisting_dirty_rows_for_same_entity() {
+    // pre-insert dirty row for iid=7, then surgical ingest iid=7
+    // assert result.dirty_source_ids only contains newly inserted rows
+}

Fix embed-stage leakage when --no-docs is used in surgical mode

Why this improves the plan:

Current design can run global embed even when docs stage is skipped, which may embed unrelated backlog docs.
That breaks the surgical “scope only this run” promise.
This is both correctness and operator-trust critical.

@@ ## Step 9: Create `run_sync_surgical`
-        if !options.no_embed {
+        // Surgical embed only runs when surgical docs actually regenerated docs in this run.
+        if !options.no_embed && !options.no_docs && result.documents_regenerated > 0 {
@@ ## Step 4: Wire new fields in `handle_sync_cmd`
+        if options.is_surgical() && options.no_docs && !options.no_embed {
+            return Err(Box::new(LoreError::Other(
+                "In surgical mode, --no-docs requires --no-embed (to preserve scoping guarantees)".to_string()
+            )));
+        }
@@ ### For me to evaluate
+39. **No embed leakage**: `sync --issue X --no-docs` never embeds unrelated unembedded docs

Add queue-failure hygiene so scoped jobs do not leak forever

Why this improves the plan:

Scoped drains prevent accidental processing, but failed runs can strand pending jobs permanently.
You need explicit terminalization (aborted) and optional replay mechanics.
Otherwise queue bloat and confusing diagnostics accumulate.

@@ ## Step 8a: Add `sync_runs` table migration
+ALTER TABLE dependent_queue ADD COLUMN aborted_reason TEXT;
+-- status domain now includes: pending, claimed, done, failed, aborted
@@ ## Step 9: run_sync_surgical failure paths
+// On run failure/cancel:
+conn.execute(
+  "UPDATE dependent_queue
+   SET status='aborted', aborted_reason=?1
+   WHERE project_id=?2 AND scope_run_id=?3 AND status='pending'",
+  rusqlite::params![failure_summary, project_id, run_id],
+)?;
@@ ## Acceptance Criteria
+40. **No stranded scoped jobs**: failed surgical runs leave no `pending` rows for their `scope_run_id`

Persist per-entity lifecycle (sync_run_entities) for real observability and deterministic retry

Why this improves the plan:

sync_runs alone gives aggregate counters but not which IID failed at which stage.
Per-entity records make retries deterministic and robot output far more useful.
This is the missing piece for your stated “deterministic retry decisions.”

@@ ## Step 8a: Add `sync_runs` table migration
+CREATE TABLE IF NOT EXISTS sync_run_entities (
+  id INTEGER PRIMARY KEY,
+  run_id TEXT NOT NULL REFERENCES sync_runs(run_id),
+  entity_type TEXT NOT NULL CHECK(entity_type IN ('issue','merge_request')),
+  iid INTEGER NOT NULL,
+  stage TEXT NOT NULL,
+  status TEXT NOT NULL CHECK(status IN ('ok','failed','skipped_stale')),
+  error_code TEXT,
+  error_message TEXT,
+  updated_at INTEGER NOT NULL
+);
+CREATE INDEX IF NOT EXISTS idx_sync_run_entities_run ON sync_run_entities(run_id, entity_type, iid);
@@ ## Step 15: Update `SyncResult`
+    pub failed_iids: Vec<(String, u64)>,
+    pub skipped_stale_iids: Vec<(String, u64)>,
@@ ## CLI Interface
+lore --robot sync-runs --run-id <id>
+lore --robot sync-runs --run-id <id> --retry-failed

Use explicit error type for surgical preflight failures (not LoreError::Other)

Why this improves the plan:

Other(String) loses machine semantics, weakens robot mode, and leads to bad exit-code behavior.
A typed error preserves structured failures and enables actionable recovery commands.

@@ ## Step 9: run_sync_surgical
-            return Err(LoreError::Other(
-                format!("Surgical preflight failed for {} of {} IIDs: {}", ...)
-            ).into());
+            return Err(LoreError::SurgicalPreflightFailed {
+                run_id: run_id.to_string(),
+                total: total_items,
+                failures: preflight.failures.clone(),
+            }.into());
@@ ## Step 15: Update `SyncResult`
+    /// Machine-actionable error summary for robot mode
+    pub error_code: Option<String>,
@@ ## Acceptance Criteria
+41. **Typed failure**: preflight failures serialize structured errors (not generic `Other`) with machine-usable codes/actions

Strengthen tests for rollback, contention, and stale-skip guarantees

Why this improves the plan:

Current tests cover many happy-paths and scoping invariants, but key race/rollback behaviors are still under-tested.
These are exactly where regressions will appear first in production.

@@ ## Step 1: TDD — Write Failing Tests First
+### 1f. Transactional rollback + TOCTOU tests
+1. `preflight_success_then_ingest_failure_rolls_back_all_content_writes`
+2. `stale_payload_is_skipped_when_local_updated_at_is_newer`
+3. `failed_run_aborts_pending_scoped_jobs`
+4. `surgical_no_docs_requires_no_embed`
@@ ### Automated scoping invariants
-38. **Scoped queue/docs invariants are enforced by automated tests**
+42. **Rollback and race invariants are enforced by automated tests** (no partial writes on ingest failure, no stale overwrite)

These eight revisions keep your core approach intact, avoid your explicitly rejected ideas, and close the biggest correctness/operability gaps before implementation.

9.3 KiB Raw Permalink Blame History Unescape Escape

9.3 KiB

Raw Permalink Blame History