Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three correctness fixes found during peer code review:
Embedding pipeline savepoint leak (HIGH severity):
The SAVEPOINT embed_page / RELEASE embed_page pattern had ~10 `?`
propagation points between them. Any error from record_embedding_error,
clear_document_embeddings, or store_embedding would exit the function
without rolling back, leaving the SQLite connection in a broken
transactional state and causing cascading failures for the rest of the
session. Fixed by extracting page processing into `embed_page()` and
wrapping with explicit rollback-on-error handling.
Dependent queue fail_job race (MEDIUM severity):
fail_job performed a SELECT followed by a separate UPDATE on the
attempts counter without a transaction. Under concurrent lock
reclamation, the attempts value could be read stale. Replaced with a
single atomic UPDATE that increments attempts and computes exponential
backoff entirely in SQL, also halving DB round-trips. Added explicit
error when the job no longer exists.
RRF duplicate document score inflation (MEDIUM severity):
If a retriever returned the same document_id multiple times, the RRF
score accumulated multiple rank contributions while the rank only
recorded the first occurrence. Moved the score accumulation inside the
`if is_none` guard so only the first occurrence per list contributes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Targeted fixes across multiple subsystems:
dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
(None returns global counts, Some(pid) returns per-project counts)
gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
behavior, not warnings) and add structured fields (path, status_code)
for better log filtering and aggregation
gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
to epoch 0, making data quality issues visible in logs
ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx
search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
eliminating the NaN edge case entirely (total_cmp treats NaN consistently)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the search module providing three search modes:
- Lexical (FTS5): Full-text search using SQLite FTS5 with safe query
sanitization. User queries are automatically tokenized and wrapped
in proper FTS5 syntax. Supports a "raw" mode for power users who
want direct FTS5 query syntax (NEAR, column filters, etc.).
- Semantic (vector): Embeds the search query via Ollama, then performs
cosine similarity search against stored document embeddings. Results
are deduplicated by doc_id since documents may have multiple chunks.
- Hybrid (default): Executes both lexical and semantic searches in
parallel, then fuses results using Reciprocal Rank Fusion (RRF) with
k=60. This avoids the complexity of score normalization while
producing high-quality merged rankings. Gracefully degrades to
lexical-only when embeddings are unavailable.
Additional components:
- search::filters: Post-retrieval filtering by source_type, author,
project, labels (AND logic), file path prefix, created_after, and
updated_after. Date filters accept relative formats (7d, 2w) and
ISO dates.
- search::rrf: Reciprocal Rank Fusion implementation with configurable
k parameter and optional explain mode that annotates each result
with its component ranks and fusion score breakdown.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>