Commit Graph

9 Commits

Author SHA1 Message Date
Taylor Eernisse
3e9cf2358e perf(search+embed): zero-copy embedding API and deferred RRF mapping
Change OllamaClient::embed_batch to accept &[&str] instead of
Vec<String>. The EmbedRequest struct now borrows both model name and
input texts, eliminating per-batch cloning of chunk text (up to 32KB
per chunk x 32 chunks per batch). Serialization output is identical
since serde serializes &str and String to the same JSON.

In hybrid search, defer the RrfResult->HybridResult mapping until
after filter+take, so only `limit` items (typically 20) are
constructed instead of up to 1,500 at RECALL_CAP. Also switch
filtered_ids to into_iter() to avoid an extra .copied() pass.

Switch FTS search_fts from prepare() to prepare_cached() for statement
reuse across repeated searches. Benchmarked at ~1.6x faster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-05 17:35:53 -05:00
Taylor Eernisse
72f1cafdcf perf: Optimize SQL queries and reduce allocations in hot paths
Change detection queries (embedding/change_detector.rs):
- Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check
- SQLite now scans embedding_metadata once instead of three times
- Semantically identical: returns docs needing embedding when no
  embedding exists, hash changed, or config mismatch

Count queries (cli/commands/count.rs):
- Consolidate 3 separate COUNT queries for issues into single query
  using conditional aggregation (CASE WHEN state = 'x' THEN 1)
- Same optimization for MRs: 5 queries reduced to 1

Search filter queries (search/filters.rs):
- Replace N separate EXISTS clauses for label filtering with single
  IN() clause with COUNT/GROUP BY HAVING pattern
- For multi-label AND queries, this reduces N subqueries to 1

FTS tokenization (search/fts.rs):
- Replace collect-into-Vec-then-join pattern with direct String building
- Pre-allocate capacity hint for result string

Discussion truncation (documents/truncation.rs):
- Calculate total length without allocating concatenated string first
- Only allocate full string when we know it fits within limit

Embedding pipeline (embedding/pipeline.rs):
- Add Vec::with_capacity hints for chunk work and cleared_docs hashset
- Reduces reallocations during embedding batch processing

Backoff calculation (core/backoff.rs):
- Replace unchecked addition with saturating_add to prevent overflow
- Add test case verifying overflow protection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 11:21:28 -05:00
Taylor Eernisse
65583ed5d6 refactor: Remove redundant doc comments throughout codebase
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)

Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs

Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.

Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:04:32 -05:00
Taylor Eernisse
1fdc6d03cc fix: Savepoint leak in embedding pipeline, atomic fail_job, RRF dedup
Three correctness fixes found during peer code review:

Embedding pipeline savepoint leak (HIGH severity):
The SAVEPOINT embed_page / RELEASE embed_page pattern had ~10 `?`
propagation points between them. Any error from record_embedding_error,
clear_document_embeddings, or store_embedding would exit the function
without rolling back, leaving the SQLite connection in a broken
transactional state and causing cascading failures for the rest of the
session. Fixed by extracting page processing into `embed_page()` and
wrapping with explicit rollback-on-error handling.

Dependent queue fail_job race (MEDIUM severity):
fail_job performed a SELECT followed by a separate UPDATE on the
attempts counter without a transaction. Under concurrent lock
reclamation, the attempts value could be read stale. Replaced with a
single atomic UPDATE that increments attempts and computes exponential
backoff entirely in SQL, also halving DB round-trips. Added explicit
error when the job no longer exists.

RRF duplicate document score inflation (MEDIUM severity):
If a retriever returned the same document_id multiple times, the RRF
score accumulated multiple rank contributions while the rank only
recorded the first occurrence. Moved the score accumulation inside the
`if is_none` guard so only the first occurrence per list contributes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:16:38 -05:00
teernisse
f6d19a9467 feat(sync): Instrument pipeline with tracing spans, run_id correlation, and metrics
Add end-to-end observability to the sync and ingest pipelines:

Sync command:
- Generate UUID-based run_id for each sync invocation, propagated through
  all child spans for log correlation across stages
- Accept MetricsLayer reference to extract hierarchical StageTiming data
  after pipeline completion for robot-mode performance output
- Record sync runs in DB via SyncRunRecorder (start/succeed/fail lifecycle)
- Wrap entire sync execution in a root tracing span with run_id field

Ingest command:
- Wrap run_ingest in an instrumented root span with run_id and resource_type
- Add project path prefix to discussion progress bars for multi-project clarity
- Reset resource_events_synced_for_updated_at on --full re-sync

Sync status:
- Expand from single last_run to configurable recent runs list (default 10)
- Parse and expose StageTiming metrics from stored metrics_json
- Add run_id, total_items_processed, total_errors to SyncRunInfo
- Add mr_count to DataSummary for complete entity coverage

Orchestrator:
- Add #[instrument] with structured fields to issue and MR ingestion functions
- Record items_processed, items_skipped, errors on span close for MetricsLayer
- Emit granular progress events (IssuesFetchStarted, IssuesFetchComplete)
- Pass project_id through to drain_resource_events for scoped job claiming

Document regenerator and embedding pipeline:
- Add #[instrument] spans with items_processed, items_skipped, errors fields
- Record final counts on span close for metrics extraction

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 13:39:00 -05:00
Taylor Eernisse
ee5c5f9645 perf: Eliminate double serialization, add SQLite tuning, optimize hot paths
11 isomorphic performance fixes from deep audit (no behavior changes):

- Eliminate double serialization: store_payload now accepts pre-serialized
  bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
  Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
  RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
  of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
  all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 08:12:37 -05:00
Taylor Eernisse
a50fc78823 style: Apply cargo fmt and clippy fixes across codebase
Automated formatting and lint corrections from parallel agent work:

- cargo fmt: import reordering (alphabetical), line wrapping to respect
  max width, trailing comma normalization, destructuring alignment,
  function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
  i64::from() instead of `as i64` casts, .clamp() instead of
  .max().min() chains, let-chain refactors (if-let with &&),
  #[allow(clippy::too_many_arguments)] and
  #[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace

No behavioral changes. All existing tests pass unmodified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 13:01:59 -05:00
Taylor Eernisse
7d07f95d4c fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:

- Config drift detection: find_pending_documents and
  count_pending_documents now take model_name and compare
  chunk_max_bytes, model, and dims against stored metadata. Documents
  embedded with stale config are automatically re-queued.

- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
  are skipped with a sentinel error recorded in embedding_metadata,
  preventing both rowid collision and infinite re-processing loops.

- Deferred clearing: old embeddings are no longer cleared before
  attempting new ones. clear_document_embeddings is deferred until the
  first successful chunk embedding, so if all chunks fail the document
  retains its previous embeddings rather than losing all data.

- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
  so a crash mid-page rolls back atomically instead of leaving partial
  state (cleared embeddings with no replacements).

- Per-chunk retry on context overflow: when a batch fails with a
  context-length error, each chunk is retried individually so one
  oversized chunk doesn't poison the entire batch.

- Adaptive dedup in vector search: replaces the static 3x over-fetch
  multiplier with a dynamic one based on actual max chunks per document
  (using the new chunk_count column with a fallback COUNT query for
  pre-migration data). Also replaces partial_cmp with total_cmp for
  f64 distance sorting.

- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
  embedding_metadata to support config drift detection and adaptive
  dedup without runtime queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:08 -05:00
Taylor Eernisse
723703bed9 feat(embedding): Add Ollama-powered vector embedding pipeline
Implements the embedding module that generates vector representations
of documents using a local Ollama instance with the nomic-embed-text
model. These embeddings enable semantic (vector) search and the hybrid
search mode that fuses lexical and semantic results via RRF.

Key components:

- embedding::ollama: HTTP client for the Ollama /api/embeddings
  endpoint. Handles connection errors with actionable error messages
  (OllamaUnavailable, OllamaModelNotFound) and validates response
  dimensions.

- embedding::chunking: Splits long documents into overlapping
  paragraph-aware chunks for embedding. Uses a configurable max token
  estimate (8192 default for nomic-embed-text) with 10% overlap to
  preserve cross-chunk context.

- embedding::chunk_ids: Encodes chunk identity as
  doc_id * 1000 + chunk_index for the embeddings table rowid. This
  allows vector search to map results back to documents and
  deduplicate by doc_id efficiently.

- embedding::change_detector: Compares document content_hash against
  stored embedding hashes to skip re-embedding unchanged documents,
  making incremental embedding runs fast.

- embedding::pipeline: Orchestrates the full embedding flow: detect
  changed documents, chunk them, call Ollama in configurable
  concurrency (default 4), store results. Supports --retry-failed
  to re-attempt previously failed embeddings.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:46:30 -05:00