- regenerator: Include project_id in the ON CONFLICT UPDATE clause for
document upserts. Previously, if a document moved between projects
(e.g., during re-ingestion), the project_id would remain stale.
- truncation: Compute the omission marker ("N notes omitted") before
checking whether first+last notes fit in the budget. The old order
computed the marker after the budget check, meaning the marker's byte
cost was unaccounted for and could cause over-budget output.
- ollama: Tighten model name matching to require either an exact match
or a colon-delimited tag prefix (model == name or name starts with
"model:"). The prior starts_with check would false-positive on
"nomic-embed-text-v2" when looking for "nomic-embed-text". Tests
updated to cover exact match, tagged, wrong model, and prefix
false-positive cases.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The document regenerator was making two queries per document:
1. get_existing_hash() — SELECT content_hash
2. upsert_document_inner() — SELECT id, content_hash, labels_hash, paths_hash
Query 2 already returns the content_hash needed for change detection.
Remove get_existing_hash() entirely and compute content_changed inside
upsert_document_inner() from the existing row data.
upsert_document_inner now returns Result<bool> (true = content changed)
which propagates up through upsert_document and regenerate_one,
replacing the separate pre-check. The triple-hash fast-path (all three
hashes match → return Ok(false) with no writes) is preserved.
This halves the query count for unchanged documents, which dominate
incremental syncs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace individual INSERT-per-label and INSERT-per-path loops in
upsert_document_inner with single multi-row INSERT statements. For a
document with 5 labels, this reduces 5 SQL round-trips to 1.
Replace format!()+push_str() with writeln!() in all three document
extractors (issue, MR, discussion). writeln! writes directly into the
String buffer, avoiding the intermediate allocation that format!
creates. Benchmarked at ~1.9x faster for string building and ~1.6x
faster for batch inserts (measured over 5k iterations in-memory).
Also switch get_existing_hash from prepare() to prepare_cached() since
it is called once per document during regeneration.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change detection queries (embedding/change_detector.rs):
- Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check
- SQLite now scans embedding_metadata once instead of three times
- Semantically identical: returns docs needing embedding when no
embedding exists, hash changed, or config mismatch
Count queries (cli/commands/count.rs):
- Consolidate 3 separate COUNT queries for issues into single query
using conditional aggregation (CASE WHEN state = 'x' THEN 1)
- Same optimization for MRs: 5 queries reduced to 1
Search filter queries (search/filters.rs):
- Replace N separate EXISTS clauses for label filtering with single
IN() clause with COUNT/GROUP BY HAVING pattern
- For multi-label AND queries, this reduces N subqueries to 1
FTS tokenization (search/fts.rs):
- Replace collect-into-Vec-then-join pattern with direct String building
- Pre-allocate capacity hint for result string
Discussion truncation (documents/truncation.rs):
- Calculate total length without allocating concatenated string first
- Only allocate full string when we know it fits within limit
Embedding pipeline (embedding/pipeline.rs):
- Add Vec::with_capacity hints for chunk work and cleared_docs hashset
- Reduces reallocations during embedding batch processing
Backoff calculation (core/backoff.rs):
- Replace unchecked addition with saturating_add to prevent overflow
- Add test case verifying overflow protection
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three defensive improvements from peer code review:
Replace unreachable!() in GitLab client retry loops:
Both request() and request_with_headers() had unreachable!() after
their for loops. While the logic was sound (the final iteration always
reaches the return/break), any refactor to the loop condition would
turn this into a runtime panic. Restructured both to store
last_response with explicit break, making the control flow
self-documenting and the .expect() message useful if ever violated.
Doctor model name comparison asymmetry:
Ollama model names were stripped of their tag (:latest, :v1.5) for
comparison, but the configured model name was compared as-is. A config
value like "nomic-embed-text:v1.5" would never match. Now strips the
tag from both sides before comparing.
Regenerator savepoint cleanup and progress accuracy:
- upsert_document's error path did ROLLBACK TO but never RELEASE,
leaving a dangling savepoint that could nest on the next call. Added
RELEASE after rollback so the connection is clean.
- estimated_total for progress reporting was computed once at start but
the dirty queue can grow during processing. Now recounts each loop
iteration with max() so the progress fraction never goes backwards.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add end-to-end observability to the sync and ingest pipelines:
Sync command:
- Generate UUID-based run_id for each sync invocation, propagated through
all child spans for log correlation across stages
- Accept MetricsLayer reference to extract hierarchical StageTiming data
after pipeline completion for robot-mode performance output
- Record sync runs in DB via SyncRunRecorder (start/succeed/fail lifecycle)
- Wrap entire sync execution in a root tracing span with run_id field
Ingest command:
- Wrap run_ingest in an instrumented root span with run_id and resource_type
- Add project path prefix to discussion progress bars for multi-project clarity
- Reset resource_events_synced_for_updated_at on --full re-sync
Sync status:
- Expand from single last_run to configurable recent runs list (default 10)
- Parse and expose StageTiming metrics from stored metrics_json
- Add run_id, total_items_processed, total_errors to SyncRunInfo
- Add mr_count to DataSummary for complete entity coverage
Orchestrator:
- Add #[instrument] with structured fields to issue and MR ingestion functions
- Record items_processed, items_skipped, errors on span close for MetricsLayer
- Emit granular progress events (IssuesFetchStarted, IssuesFetchComplete)
- Pass project_id through to drain_resource_events for scoped job claiming
Document regenerator and embedding pipeline:
- Add #[instrument] spans with items_processed, items_skipped, errors fields
- Record final counts on span close for metrics extraction
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 isomorphic performance fixes from deep audit (no behavior changes):
- Eliminate double serialization: store_payload now accepts pre-serialized
bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two bug fixes:
1. extractor.rs: The content hash was computed on the pre-truncation
content, meaning the hash stored in the document didn't correspond
to the actual stored (truncated) content. This would cause change
detection to miss updates when content changed only within the
truncated portion. Hash is now computed after truncate_hard_cap()
so it always matches the persisted content.
2. dependent_queue.rs: claim_jobs() had a TOCTOU race between the
SELECT that found available jobs and the UPDATE that locked them.
Under concurrent callers, two drain runs could claim the same job.
Replaced with a single UPDATE ... RETURNING statement that
atomically selects and locks jobs in one operation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the documents module that transforms raw ingested entities
(issues, MRs, discussions) into searchable document blobs stored in
the documents table. This is the foundation for both FTS5 lexical
search and vector embedding.
Key components:
- documents::extractor: Renders entities into structured text documents.
Issues include title, description, labels, milestone, assignees, and
threaded discussion summaries. MRs additionally include source/target
branches, reviewers, and approval status. Discussions are rendered
with full note threading.
- documents::regenerator: Drains the dirty_queue table to regenerate
only documents whose source entities changed since last sync. Supports
full rebuild mode (seeds all entities into dirty queue first) and
project-scoped regeneration.
- documents::truncation: Safety cap at 2MB per document to prevent
pathological outliers from degrading FTS or embedding performance.
- ingestion::dirty_tracker: Marks entities as dirty inside the
ingestion transaction so document regeneration stays consistent
with data changes. Uses INSERT OR IGNORE to deduplicate.
- ingestion::discussion_queue: Queue-based discussion fetching that
isolates individual discussion failures from the broader ingestion
pipeline, preventing a single corrupt discussion from blocking
an entire project sync.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>