Move inline #[cfg(test)] mod tests { ... } blocks from 22 source files
into dedicated _tests.rs companion files, wired via:
#[cfg(test)]
#[path = "module_tests.rs"]
mod tests;
This keeps implementation-focused source files leaner and more scannable
while preserving full access to private items through `use super::*;`.
Modules extracted:
core: db, note_parser, payloads, project, references, sync_run,
timeline_collect, timeline_expand, timeline_seed
cli: list (55 tests), who (75 tests)
documents: extractor (43 tests), regenerator
embedding: change_detector, chunking
gitlab: graphql (wiremock async tests), transformers/issue
ingestion: dirty_tracker, discussions, issues, mr_diffs
Also adds conflicts_with("explain_score") to the --detail flag in the
who command to prevent mutually exclusive flags from being combined.
All 629 unit tests pass. No behavior changes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Embedding pipeline improvements building on the concurrent batching
foundation:
- Track docs_embedded vs chunks_embedded separately. A document counts
as embedded only when ALL its chunks succeed, giving accurate
progress reporting. The sync command reads docs_embedded for its
document count.
- Reuse a single Vec<u8> buffer (embed_buf) across all store_embedding
calls instead of allocating per chunk. Eliminates ~3KB allocation per
768-dim embedding.
- Detect and record errors when Ollama silently returns fewer
embeddings than inputs (batch mismatch). Previously these dropped
chunks were invisible.
- Improve retry error messages: distinguish "retry returned unexpected
result" (wrong dims/count) from "retry request failed" (network
error) instead of generic "chunk too large" message.
- Convert all hot-path SQL from conn.execute() to prepare_cached() for
statement cache reuse (clear_document_embeddings, store_embedding,
record_embedding_error).
- Record embedding_metadata errors for empty documents so they don't
appear as perpetually pending on subsequent runs.
- Accept concurrency parameter (configurable via config.embedding.concurrency)
instead of hardcoded EMBED_CONCURRENCY=2.
- Add schema version pre-flight check in embed command to fail fast
with actionable error instead of cryptic SQL errors.
- Fix --retry-failed to use DELETE instead of UPDATE. UPDATE clears
last_error but the row still matches config params in the LEFT JOIN,
making the doc permanently invisible to find_pending_documents.
DELETE removes the row entirely so the LEFT JOIN returns NULL.
Regression test added (old_update_approach_leaves_doc_invisible).
- Add chunking forward-progress guard: after floor_char_boundary()
rounds backward, ensure start advances by at least one full
character to prevent infinite loops on multi-byte sequences
(box-drawing chars, smart quotes). Test cases cover the exact
patterns that caused production hangs on document 18526.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes to the embedding pipeline:
1. Concurrent HTTP batching: fire EMBED_CONCURRENCY (2) Ollama requests
in parallel via join_all, then write results serially to SQLite.
~2x throughput improvement on GPU-bound workloads.
2. UTF-8 boundary safety: all computed byte offsets in split_into_chunks
(paragraph/sentence/word break finders + overlap advance) now use
floor_char_boundary() to prevent panics on multi-byte characters
like smart quotes and non-breaking spaces.
3. CHUNK_MAX_BYTES reduced from 6000 to 1500 to fit nomic-embed-text's
actual 2048-token context window, eliminating context-length retry
storms that were causing 10x slowdowns.
Also threads ShutdownSignal through embed pipeline for graceful Ctrl+C.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:
- Config drift detection: find_pending_documents and
count_pending_documents now take model_name and compare
chunk_max_bytes, model, and dims against stored metadata. Documents
embedded with stale config are automatically re-queued.
- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
are skipped with a sentinel error recorded in embedding_metadata,
preventing both rowid collision and infinite re-processing loops.
- Deferred clearing: old embeddings are no longer cleared before
attempting new ones. clear_document_embeddings is deferred until the
first successful chunk embedding, so if all chunks fail the document
retains its previous embeddings rather than losing all data.
- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
so a crash mid-page rolls back atomically instead of leaving partial
state (cleared embeddings with no replacements).
- Per-chunk retry on context overflow: when a batch fails with a
context-length error, each chunk is retried individually so one
oversized chunk doesn't poison the entire batch.
- Adaptive dedup in vector search: replaces the static 3x over-fetch
multiplier with a dynamic one based on actual max chunks per document
(using the new chunk_count column with a fallback COUNT query for
pre-migration data). Also replaces partial_cmp with total_cmp for
f64 distance sorting.
- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
embedding_metadata to support config drift detection and adaptive
dedup without runtime queries.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the embedding module that generates vector representations
of documents using a local Ollama instance with the nomic-embed-text
model. These embeddings enable semantic (vector) search and the hybrid
search mode that fuses lexical and semantic results via RRF.
Key components:
- embedding::ollama: HTTP client for the Ollama /api/embeddings
endpoint. Handles connection errors with actionable error messages
(OllamaUnavailable, OllamaModelNotFound) and validates response
dimensions.
- embedding::chunking: Splits long documents into overlapping
paragraph-aware chunks for embedding. Uses a configurable max token
estimate (8192 default for nomic-embed-text) with 10% overlap to
preserve cross-chunk context.
- embedding::chunk_ids: Encodes chunk identity as
doc_id * 1000 + chunk_index for the embeddings table rowid. This
allows vector search to map results back to documents and
deduplicate by doc_id efficiently.
- embedding::change_detector: Compares document content_hash against
stored embedding hashes to skip re-embedding unchanged documents,
making incremental embedding runs fast.
- embedding::pipeline: Orchestrates the full embedding flow: detect
changed documents, chunk them, call Ollama in configurable
concurrency (default 4), store results. Supports --retry-failed
to re-attempt previously failed embeddings.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>