Peer code review found multiple panic-reachable paths:
1. serde_json::to_string().unwrap() in 4 robot-mode output functions
(who.rs, main.rs x3). If serialization ever failed (e.g., NaN from
edge-case division), the CLI would panic with an unhelpful stack trace.
Replaced with unwrap_or_else that emits a structured JSON error fallback.
2. encode_rowid() in chunk_ids.rs used unchecked multiplication
(document_id * 1000). On extreme document IDs this could silently wrap
in release mode, causing embedding rowid collisions. Now uses
checked_mul + checked_add with a diagnostic panic message.
3. HTTP response body truncation at byte index 500 in client.rs could
split a multi-byte UTF-8 character, causing a panic. Now uses
floor_char_boundary(500) for safe truncation.
4. who.rs reviews mode: SQL used `m.author_username != ?1` which silently
dropped MRs with NULL author_username (SQL NULL != anything = NULL).
Changed to `(m.author_username IS NULL OR m.author_username != ?1)`
to match the pattern already used in expert mode.
5. handle_auth_test hardcoded exit code 5 for all errors regardless of
type. Config not found (20), token not set (4), and network errors (8)
all incorrectly returned 5. Now uses e.exit_code() from the actual
LoreError, with proper suggestion hints in human mode.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Targeted fixes across multiple subsystems:
dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
(None returns global counts, Some(pid) returns per-project counts)
gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
behavior, not warnings) and add structured fields (path, status_code)
for better log filtering and aggregation
gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
to epoch 0, making data quality issues visible in logs
ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx
search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
eliminating the NaN edge case entirely (total_cmp treats NaN consistently)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:
- Config drift detection: find_pending_documents and
count_pending_documents now take model_name and compare
chunk_max_bytes, model, and dims against stored metadata. Documents
embedded with stale config are automatically re-queued.
- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
are skipped with a sentinel error recorded in embedding_metadata,
preventing both rowid collision and infinite re-processing loops.
- Deferred clearing: old embeddings are no longer cleared before
attempting new ones. clear_document_embeddings is deferred until the
first successful chunk embedding, so if all chunks fail the document
retains its previous embeddings rather than losing all data.
- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
so a crash mid-page rolls back atomically instead of leaving partial
state (cleared embeddings with no replacements).
- Per-chunk retry on context overflow: when a batch fails with a
context-length error, each chunk is retried individually so one
oversized chunk doesn't poison the entire batch.
- Adaptive dedup in vector search: replaces the static 3x over-fetch
multiplier with a dynamic one based on actual max chunks per document
(using the new chunk_count column with a fallback COUNT query for
pre-migration data). Also replaces partial_cmp with total_cmp for
f64 distance sorting.
- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
embedding_metadata to support config drift detection and adaptive
dedup without runtime queries.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the embedding module that generates vector representations
of documents using a local Ollama instance with the nomic-embed-text
model. These embeddings enable semantic (vector) search and the hybrid
search mode that fuses lexical and semantic results via RRF.
Key components:
- embedding::ollama: HTTP client for the Ollama /api/embeddings
endpoint. Handles connection errors with actionable error messages
(OllamaUnavailable, OllamaModelNotFound) and validates response
dimensions.
- embedding::chunking: Splits long documents into overlapping
paragraph-aware chunks for embedding. Uses a configurable max token
estimate (8192 default for nomic-embed-text) with 10% overlap to
preserve cross-chunk context.
- embedding::chunk_ids: Encodes chunk identity as
doc_id * 1000 + chunk_index for the embeddings table rowid. This
allows vector search to map results back to documents and
deduplicate by doc_id efficiently.
- embedding::change_detector: Compares document content_hash against
stored embedding hashes to skip re-embedding unchanged documents,
making incremental embedding runs fast.
- embedding::pipeline: Orchestrates the full embedding flow: detect
changed documents, chunk them, call Ollama in configurable
concurrency (default 4), store results. Supports --retry-failed
to re-attempt previously failed embeddings.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>