Targeted fixes across multiple subsystems:
dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
(None returns global counts, Some(pid) returns per-project counts)
gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
behavior, not warnings) and add structured fields (path, status_code)
for better log filtering and aggregation
gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
to epoch 0, making data quality issues visible in logs
ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx
search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
eliminating the NaN edge case entirely (total_cmp treats NaN consistently)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:
- Config drift detection: find_pending_documents and
count_pending_documents now take model_name and compare
chunk_max_bytes, model, and dims against stored metadata. Documents
embedded with stale config are automatically re-queued.
- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
are skipped with a sentinel error recorded in embedding_metadata,
preventing both rowid collision and infinite re-processing loops.
- Deferred clearing: old embeddings are no longer cleared before
attempting new ones. clear_document_embeddings is deferred until the
first successful chunk embedding, so if all chunks fail the document
retains its previous embeddings rather than losing all data.
- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
so a crash mid-page rolls back atomically instead of leaving partial
state (cleared embeddings with no replacements).
- Per-chunk retry on context overflow: when a batch fails with a
context-length error, each chunk is retried individually so one
oversized chunk doesn't poison the entire batch.
- Adaptive dedup in vector search: replaces the static 3x over-fetch
multiplier with a dynamic one based on actual max chunks per document
(using the new chunk_count column with a fallback COUNT query for
pre-migration data). Also replaces partial_cmp with total_cmp for
f64 distance sorting.
- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
embedding_metadata to support config drift detection and adaptive
dedup without runtime queries.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the search module providing three search modes:
- Lexical (FTS5): Full-text search using SQLite FTS5 with safe query
sanitization. User queries are automatically tokenized and wrapped
in proper FTS5 syntax. Supports a "raw" mode for power users who
want direct FTS5 query syntax (NEAR, column filters, etc.).
- Semantic (vector): Embeds the search query via Ollama, then performs
cosine similarity search against stored document embeddings. Results
are deduplicated by doc_id since documents may have multiple chunks.
- Hybrid (default): Executes both lexical and semantic searches in
parallel, then fuses results using Reciprocal Rank Fusion (RRF) with
k=60. This avoids the complexity of score normalization while
producing high-quality merged rankings. Gracefully degrades to
lexical-only when embeddings are unavailable.
Additional components:
- search::filters: Post-retrieval filtering by source_type, author,
project, labels (AND logic), file path prefix, created_after, and
updated_after. Date filters accept relative formats (7d, 2w) and
ISO dates.
- search::rrf: Reciprocal Rank Fusion implementation with configurable
k parameter and optional explain mode that annotates each result
with its component ranks and fusion score breakdown.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>