Add the ability to sync specific issues or merge requests by IID without
running a full incremental sync. This enables fast, targeted data refresh
for individual entities — useful for agent workflows, debugging, and
real-time investigation of specific issues or MRs.
Architecture:
- New CLI flags: --issue <IID> and --mr <IID> (repeatable, up to 100 total)
scoped to a single project via -p/--project
- Preflight phase validates all IIDs exist on GitLab before any DB writes,
with TOCTOU-aware soft verification at ingest time
- 6-stage pipeline: preflight -> fetch -> ingest -> dependents -> docs -> embed
- Each stage is cancellation-aware via ShutdownSignal
- Dedicated SyncRunRecorder extensions track surgical-specific counters
(issues_fetched, mrs_ingested, docs_regenerated, etc.)
New modules:
- src/ingestion/surgical.rs: Core surgical fetch/ingest/dependent logic
with preflight_fetch(), ingest_issue_by_iid(), ingest_mr_by_iid(),
and fetch_dependents_for_{issue,mr}()
- src/cli/commands/sync_surgical.rs: Full CLI orchestrator with progress
spinners, human/robot output, and cancellation handling
- src/embedding/pipeline.rs: embed_documents_by_ids() for scoped embedding
- src/documents/regenerator.rs: regenerate_dirty_documents_for_sources()
for scoped document regeneration
Database changes:
- Migration 027: Extends sync_runs with mode, phase, surgical_iids_json,
per-entity counters, and cancelled_at column
- New indexes: idx_sync_runs_mode_started, idx_sync_runs_status_phase_started
GitLab client:
- get_issue_by_iid() and get_mr_by_iid() single-entity fetch methods
Error handling:
- New SurgicalPreflightFailed error variant with entity_type, iid, project,
and reason fields. Shares exit code 6 with GitLabNotFound.
Includes comprehensive test coverage:
- 645 lines of surgical ingestion tests (wiremock-based)
- 184 lines of scoped embedding tests
- 85 lines of scoped regeneration tests
- 113 lines of GitLab client single-entity tests
- 236 lines of sync_run surgical column/counter tests
- Unit tests for SyncOptions, error codes, and CLI validation
Miscellaneous fixes across CLI and core modules:
- Timeline: widen TAG_WIDTH from 10 to 11 to accommodate longer event
type labels without truncation
- render.rs: save and restore LORE_ICONS env var in glyph_mode test to
prevent interference from the test environment leaking into or from
other tests that set LORE_ICONS
- logging.rs: adjust verbose=1 to info level (was debug), verbose=2 to
debug — this reduces noise at -v while keeping -vv as the full debug
experience
- issues.rs, merge_requests.rs: use infodebug! macro consistently for
ingestion summary logging
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migrate all ingestion completion logs to use nonzero_summary() for compact,
zero-suppressed output. Before: 8-14 individual key=value structured fields
per completion message. After: a single summary field like
'42 fetched · 3 labels · 12 notes' that only shows non-zero counters.
Also downgrade all 'Shutdown requested...' messages from info! to debug!.
These are emitted on every Ctrl+C and add noise to the partial results
output that immediately follows. They remain visible at -vv for debugging
graceful shutdown behavior.
Affected modules:
- issues.rs: issue ingestion completion
- merge_requests.rs: MR ingestion completion, full-sync cursor reset
- mr_discussions.rs: discussion ingestion completion
- orchestrator.rs: project-level issue and MR completion summaries,
all shutdown-requested checkpoints across discussion sync, resource
events drain, closes-issues drain, and MR diffs drain
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Label upsert (issues + merge_requests): Replace INSERT ... ON CONFLICT DO
UPDATE RETURNING with INSERT OR IGNORE + SELECT. The prior RETURNING-based
approach relied on last_insert_rowid() matching the returned id, which is
not guaranteed when ON CONFLICT triggers an update (SQLite may return 0).
The new two-step approach is unambiguous and correctly tracks created_count.
Init: Add ON CONFLICT(gitlab_project_id) DO UPDATE to the project insert
so re-running `lore init` updates path/branch/url instead of failing with
a unique constraint violation.
MR discussions sync: Reset discussions_sync_attempts to 0 when clearing a
sync health error, so previously-failed MRs get a fresh retry budget after
successful sync.
Count: format_number now handles negative numbers correctly by extracting
the sign before inserting thousand-separators.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, drain_resource_events, drain_mr_closes_issues, and
drain_mr_diffs each opened a transaction only for the job-complete +
watermark update, but the store operation ran outside that transaction.
If the process crashed between the store and the watermark update, data
would be persisted without the watermark advancing, causing silent
duplicates on the next sync.
Now each drain function opens the transaction before the store call and
commits it only after both the store and the watermark update succeed.
On error, the transaction is explicitly dropped so the connection is
not left in a half-committed state.
Also:
- store_resource_events no longer manages its own transaction; the caller
passes in a connection (which is actually the transaction)
- upsert_mr_file_changes wraps DELETE + INSERT in a transaction internally
- reset_discussion_watermarks now also clears diffs_synced_for_updated_at
- Orchestrator error span now includes closes_issues_failed + mr_diffs_failed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The orchestrator already accepted a ShutdownSignal but only checked it
between phases (after all issues fetched, before discussions). The inner
loops in ingest_issues() and ingest_merge_requests() consumed entire
paginated streams without checking for cancellation.
On a large initial sync (thousands of issues/MRs), Ctrl+C could be
unresponsive for minutes while the current entity type finished draining.
Now both functions accept &ShutdownSignal and check is_cancelled() at
the top of each iteration, breaking out promptly and committing the
cursor for whatever was already processed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Migration 015 adds merge_commit_sha/squash_commit_sha to merge_requests
(Gate 4/5 prerequisites), closes_issues_synced_for_updated_at watermark
for incremental sync, and the missing idx_label_events_label index.
The MR transformer and ingestion pipeline now populate commit SHAs during
sync. The orchestrator uses watermark-based filtering for closes_issues
jobs instead of re-enqueuing all MRs every sync.
The Phase B PRD is updated to match the actual codebase: corrected
migration numbering (011-015), documented nullable label/milestone
fields (migration 012), watermark patterns (013), observability
infrastructure (014), simplified source_method values, and updated
entity_references schema to match implementation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Targeted fixes across multiple subsystems:
dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
(None returns global counts, Some(pid) returns per-project counts)
gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
behavior, not warnings) and add structured fields (path, status_code)
for better log filtering and aggregation
gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
to epoch 0, making data quality issues visible in logs
ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx
search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
eliminating the NaN edge case entirely (total_cmp treats NaN consistently)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 isomorphic performance fixes from deep audit (no behavior changes):
- Eliminate double serialization: store_payload now accepts pre-serialized
bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integrates the dirty tracking system into all four ingestion paths
(issues, MRs, issue discussions, MR discussions). After each entity
is upserted within its transaction, a corresponding dirty_queue entry
is inserted so the document regenerator knows which documents need
rebuilding.
This ensures that document generation stays transactionally consistent
with data changes: if the ingest transaction rolls back, the dirty
marker rolls back too, preventing stale document regeneration attempts.
Also updates GiError references to LoreError in these files as part
of the codebase-wide rename, and adjusts issue discussion logging
from info to debug level to reduce noise during normal sync runs.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds complete merge request ingestion pipeline with a novel two-phase
discussion sync strategy optimized for throughput.
New modules:
- merge_requests.rs: MR upsert with labels/assignees/reviewers handling,
stale MR cleanup, and watermark-based incremental sync
- mr_discussions.rs: Parallel prefetch strategy for MR discussions
Two-phase MR discussion sync:
1. PREFETCH PHASE: Spawn concurrent tasks to fetch discussions for
multiple MRs simultaneously (configurable concurrency, default 8).
Transform and validate in parallel, storing results in memory.
2. WRITE PHASE: Serial database writes to avoid lock contention.
Each MR's discussions written in a single transaction, with
proper stale discussion cleanup.
This approach achieves ~4-8x throughput vs serial fetching while
maintaining database consistency. Transform errors are tracked per-MR
to prevent partial writes from corrupting watermarks.
Orchestrator updates:
- ingest_merge_requests(): Coordinates MR fetch -> discussion sync flow
- Progress callbacks emit MR-specific events for UI feedback
- Respects --full flag to reset discussion watermarks for full resync
The prefetch strategy is critical for MRs which typically have more
discussions than issues, and where API latency dominates sync time.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>