Add the ability to sync specific issues or merge requests by IID without
running a full incremental sync. This enables fast, targeted data refresh
for individual entities — useful for agent workflows, debugging, and
real-time investigation of specific issues or MRs.
Architecture:
- New CLI flags: --issue <IID> and --mr <IID> (repeatable, up to 100 total)
scoped to a single project via -p/--project
- Preflight phase validates all IIDs exist on GitLab before any DB writes,
with TOCTOU-aware soft verification at ingest time
- 6-stage pipeline: preflight -> fetch -> ingest -> dependents -> docs -> embed
- Each stage is cancellation-aware via ShutdownSignal
- Dedicated SyncRunRecorder extensions track surgical-specific counters
(issues_fetched, mrs_ingested, docs_regenerated, etc.)
New modules:
- src/ingestion/surgical.rs: Core surgical fetch/ingest/dependent logic
with preflight_fetch(), ingest_issue_by_iid(), ingest_mr_by_iid(),
and fetch_dependents_for_{issue,mr}()
- src/cli/commands/sync_surgical.rs: Full CLI orchestrator with progress
spinners, human/robot output, and cancellation handling
- src/embedding/pipeline.rs: embed_documents_by_ids() for scoped embedding
- src/documents/regenerator.rs: regenerate_dirty_documents_for_sources()
for scoped document regeneration
Database changes:
- Migration 027: Extends sync_runs with mode, phase, surgical_iids_json,
per-entity counters, and cancelled_at column
- New indexes: idx_sync_runs_mode_started, idx_sync_runs_status_phase_started
GitLab client:
- get_issue_by_iid() and get_mr_by_iid() single-entity fetch methods
Error handling:
- New SurgicalPreflightFailed error variant with entity_type, iid, project,
and reason fields. Shares exit code 6 with GitLabNotFound.
Includes comprehensive test coverage:
- 645 lines of surgical ingestion tests (wiremock-based)
- 184 lines of scoped embedding tests
- 85 lines of scoped regeneration tests
- 113 lines of GitLab client single-entity tests
- 236 lines of sync_run surgical column/counter tests
- Unit tests for SyncOptions, error codes, and CLI validation
Move inline #[cfg(test)] mod tests { ... } blocks from 22 source files
into dedicated _tests.rs companion files, wired via:
#[cfg(test)]
#[path = "module_tests.rs"]
mod tests;
This keeps implementation-focused source files leaner and more scannable
while preserving full access to private items through `use super::*;`.
Modules extracted:
core: db, note_parser, payloads, project, references, sync_run,
timeline_collect, timeline_expand, timeline_seed
cli: list (55 tests), who (75 tests)
documents: extractor (43 tests), regenerator
embedding: change_detector, chunking
gitlab: graphql (wiremock async tests), transformers/issue
ingestion: dirty_tracker, discussions, issues, mr_diffs
Also adds conflicts_with("explain_score") to the --detail flag in the
who command to prevent mutually exclusive flags from being combined.
All 629 unit tests pass. No behavior changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- regenerator: Include project_id in the ON CONFLICT UPDATE clause for
document upserts. Previously, if a document moved between projects
(e.g., during re-ingestion), the project_id would remain stale.
- truncation: Compute the omission marker ("N notes omitted") before
checking whether first+last notes fit in the budget. The old order
computed the marker after the budget check, meaning the marker's byte
cost was unaccounted for and could cause over-budget output.
- ollama: Tighten model name matching to require either an exact match
or a colon-delimited tag prefix (model == name or name starts with
"model:"). The prior starts_with check would false-positive on
"nomic-embed-text-v2" when looking for "nomic-embed-text". Tests
updated to cover exact match, tagged, wrong model, and prefix
false-positive cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The document regenerator was making two queries per document:
1. get_existing_hash() — SELECT content_hash
2. upsert_document_inner() — SELECT id, content_hash, labels_hash, paths_hash
Query 2 already returns the content_hash needed for change detection.
Remove get_existing_hash() entirely and compute content_changed inside
upsert_document_inner() from the existing row data.
upsert_document_inner now returns Result<bool> (true = content changed)
which propagates up through upsert_document and regenerate_one,
replacing the separate pre-check. The triple-hash fast-path (all three
hashes match → return Ok(false) with no writes) is preserved.
This halves the query count for unchanged documents, which dominate
incremental syncs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace individual INSERT-per-label and INSERT-per-path loops in
upsert_document_inner with single multi-row INSERT statements. For a
document with 5 labels, this reduces 5 SQL round-trips to 1.
Replace format!()+push_str() with writeln!() in all three document
extractors (issue, MR, discussion). writeln! writes directly into the
String buffer, avoiding the intermediate allocation that format!
creates. Benchmarked at ~1.9x faster for string building and ~1.6x
faster for batch inserts (measured over 5k iterations in-memory).
Also switch get_existing_hash from prepare() to prepare_cached() since
it is called once per document during regeneration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three defensive improvements from peer code review:
Replace unreachable!() in GitLab client retry loops:
Both request() and request_with_headers() had unreachable!() after
their for loops. While the logic was sound (the final iteration always
reaches the return/break), any refactor to the loop condition would
turn this into a runtime panic. Restructured both to store
last_response with explicit break, making the control flow
self-documenting and the .expect() message useful if ever violated.
Doctor model name comparison asymmetry:
Ollama model names were stripped of their tag (:latest, :v1.5) for
comparison, but the configured model name was compared as-is. A config
value like "nomic-embed-text:v1.5" would never match. Now strips the
tag from both sides before comparing.
Regenerator savepoint cleanup and progress accuracy:
- upsert_document's error path did ROLLBACK TO but never RELEASE,
leaving a dangling savepoint that could nest on the next call. Added
RELEASE after rollback so the connection is clean.
- estimated_total for progress reporting was computed once at start but
the dirty queue can grow during processing. Now recounts each loop
iteration with max() so the progress fraction never goes backwards.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add end-to-end observability to the sync and ingest pipelines:
Sync command:
- Generate UUID-based run_id for each sync invocation, propagated through
all child spans for log correlation across stages
- Accept MetricsLayer reference to extract hierarchical StageTiming data
after pipeline completion for robot-mode performance output
- Record sync runs in DB via SyncRunRecorder (start/succeed/fail lifecycle)
- Wrap entire sync execution in a root tracing span with run_id field
Ingest command:
- Wrap run_ingest in an instrumented root span with run_id and resource_type
- Add project path prefix to discussion progress bars for multi-project clarity
- Reset resource_events_synced_for_updated_at on --full re-sync
Sync status:
- Expand from single last_run to configurable recent runs list (default 10)
- Parse and expose StageTiming metrics from stored metrics_json
- Add run_id, total_items_processed, total_errors to SyncRunInfo
- Add mr_count to DataSummary for complete entity coverage
Orchestrator:
- Add #[instrument] with structured fields to issue and MR ingestion functions
- Record items_processed, items_skipped, errors on span close for MetricsLayer
- Emit granular progress events (IssuesFetchStarted, IssuesFetchComplete)
- Pass project_id through to drain_resource_events for scoped job claiming
Document regenerator and embedding pipeline:
- Add #[instrument] spans with items_processed, items_skipped, errors fields
- Record final counts on span close for metrics extraction
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the documents module that transforms raw ingested entities
(issues, MRs, discussions) into searchable document blobs stored in
the documents table. This is the foundation for both FTS5 lexical
search and vector embedding.
Key components:
- documents::extractor: Renders entities into structured text documents.
Issues include title, description, labels, milestone, assignees, and
threaded discussion summaries. MRs additionally include source/target
branches, reviewers, and approval status. Discussions are rendered
with full note threading.
- documents::regenerator: Drains the dirty_queue table to regenerate
only documents whose source entities changed since last sync. Supports
full rebuild mode (seeds all entities into dirty queue first) and
project-scoped regeneration.
- documents::truncation: Safety cap at 2MB per document to prevent
pathological outliers from degrading FTS or embedding performance.
- ingestion::dirty_tracker: Marks entities as dirty inside the
ingestion transaction so document regeneration stays consistent
with data changes. Uses INSERT OR IGNORE to deduplicate.
- ingestion::discussion_queue: Queue-based discussion fetching that
isolates individual discussion failures from the broader ingestion
pipeline, preventing a single corrupt discussion from blocking
an entire project sync.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>