fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures

Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from 500 to 200 to stay within nomic-embed-text's 8,192-token context window. This commit addresses all downstream consequences of that reduction: - Config drift detection: find_pending_documents and count_pending_documents now take model_name and compare chunk_max_bytes, model, and dims against stored metadata. Documents embedded with stale config are automatically re-queued. - Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks are skipped with a sentinel error recorded in embedding_metadata, preventing both rowid collision and infinite re-processing loops. - Deferred clearing: old embeddings are no longer cleared before attempting new ones. clear_document_embeddings is deferred until the first successful chunk embedding, so if all chunks fail the document retains its previous embeddings rather than losing all data. - Savepoints: each page of DB writes is wrapped in a SQLite savepoint so a crash mid-page rolls back atomically instead of leaving partial state (cleared embeddings with no replacements). - Per-chunk retry on context overflow: when a batch fails with a context-length error, each chunk is retried individually so one oversized chunk doesn't poison the entire batch. - Adaptive dedup in vector search: replaces the static 3x over-fetch multiplier with a dynamic one based on actual max chunks per document (using the new chunk_count column with a fallback COUNT query for pre-migration data). Also replaces partial_cmp with total_cmp for f64 distance sorting. - Stores chunk_max_bytes and chunk_count (on sentinel rows) in embedding_metadata to support config drift detection and adaptive dedup without runtime queries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:08 -05:00
parent 2a52594a60
commit 7d07f95d4c
5 changed files with 275 additions and 59 deletions
--- a/src/embedding/chunking.rs
+++ b/src/embedding/chunking.rs
@@ -2,11 +2,19 @@

 /// Maximum bytes per chunk.
 /// Named `_BYTES` because `str::len()` returns byte count; multi-byte UTF-8
-/// sequences mean byte length ≥ char count.
-pub const CHUNK_MAX_BYTES: usize = 32_000;
+/// sequences mean byte length >= char count.
+///
+/// nomic-embed-text has an 8,192-token context window. English prose averages
+/// ~4 chars/token, but technical content (code, URLs, JSON) can be 1-2
+/// chars/token. We use 6,000 bytes as a conservative limit that stays safe
+/// even for code-heavy chunks (~6,000 tokens worst-case).
+pub const CHUNK_MAX_BYTES: usize = 6_000;
+
+/// Expected embedding dimensions for nomic-embed-text.
+pub const EXPECTED_DIMS: usize = 768;

 /// Character overlap between adjacent chunks.
-pub const CHUNK_OVERLAP_CHARS: usize = 500;
+pub const CHUNK_OVERLAP_CHARS: usize = 200;

 /// Split document content into chunks suitable for embedding.
 ///