fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from 500 to 200 to stay within nomic-embed-text's 8,192-token context window. This commit addresses all downstream consequences of that reduction: - Config drift detection: find_pending_documents and count_pending_documents now take model_name and compare chunk_max_bytes, model, and dims against stored metadata. Documents embedded with stale config are automatically re-queued. - Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks are skipped with a sentinel error recorded in embedding_metadata, preventing both rowid collision and infinite re-processing loops. - Deferred clearing: old embeddings are no longer cleared before attempting new ones. clear_document_embeddings is deferred until the first successful chunk embedding, so if all chunks fail the document retains its previous embeddings rather than losing all data. - Savepoints: each page of DB writes is wrapped in a SQLite savepoint so a crash mid-page rolls back atomically instead of leaving partial state (cleared embeddings with no replacements). - Per-chunk retry on context overflow: when a batch fails with a context-length error, each chunk is retried individually so one oversized chunk doesn't poison the entire batch. - Adaptive dedup in vector search: replaces the static 3x over-fetch multiplier with a dynamic one based on actual max chunks per document (using the new chunk_count column with a fallback COUNT query for pre-migration data). Also replaces partial_cmp with total_cmp for f64 distance sorting. - Stores chunk_max_bytes and chunk_count (on sentinel rows) in embedding_metadata to support config drift detection and adaptive dedup without runtime queries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -2,11 +2,19 @@
|
||||
|
||||
/// Maximum bytes per chunk.
|
||||
/// Named `_BYTES` because `str::len()` returns byte count; multi-byte UTF-8
|
||||
/// sequences mean byte length ≥ char count.
|
||||
pub const CHUNK_MAX_BYTES: usize = 32_000;
|
||||
/// sequences mean byte length >= char count.
|
||||
///
|
||||
/// nomic-embed-text has an 8,192-token context window. English prose averages
|
||||
/// ~4 chars/token, but technical content (code, URLs, JSON) can be 1-2
|
||||
/// chars/token. We use 6,000 bytes as a conservative limit that stays safe
|
||||
/// even for code-heavy chunks (~6,000 tokens worst-case).
|
||||
pub const CHUNK_MAX_BYTES: usize = 6_000;
|
||||
|
||||
/// Expected embedding dimensions for nomic-embed-text.
|
||||
pub const EXPECTED_DIMS: usize = 768;
|
||||
|
||||
/// Character overlap between adjacent chunks.
|
||||
pub const CHUNK_OVERLAP_CHARS: usize = 500;
|
||||
pub const CHUNK_OVERLAP_CHARS: usize = 200;
|
||||
|
||||
/// Split document content into chunks suitable for embedding.
|
||||
///
|
||||
|
||||
Reference in New Issue
Block a user