fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures

Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:

- Config drift detection: find_pending_documents and
  count_pending_documents now take model_name and compare
  chunk_max_bytes, model, and dims against stored metadata. Documents
  embedded with stale config are automatically re-queued.

- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
  are skipped with a sentinel error recorded in embedding_metadata,
  preventing both rowid collision and infinite re-processing loops.

- Deferred clearing: old embeddings are no longer cleared before
  attempting new ones. clear_document_embeddings is deferred until the
  first successful chunk embedding, so if all chunks fail the document
  retains its previous embeddings rather than losing all data.

- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
  so a crash mid-page rolls back atomically instead of leaving partial
  state (cleared embeddings with no replacements).

- Per-chunk retry on context overflow: when a batch fails with a
  context-length error, each chunk is retried individually so one
  oversized chunk doesn't poison the entire batch.

- Adaptive dedup in vector search: replaces the static 3x over-fetch
  multiplier with a dynamic one based on actual max chunks per document
  (using the new chunk_count column with a fallback COUNT query for
  pre-migration data). Also replaces partial_cmp with total_cmp for
  f64 distance sorting.

- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
  embedding_metadata to support config drift detection and adaptive
  dedup without runtime queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This commit is contained in:

Taylor Eernisse

2026-02-03 09:35:08 -05:00

parent 2a52594a60

commit 7d07f95d4c

5 changed files with 275 additions and 59 deletions

									
										4

src/embedding/chunk_ids.rs
									
												View File
												
				@@ -1,5 +1,7 @@

				/// Multiplier for encoding (document_id, chunk_index) into a single rowid.

				/// Supports up to 1000 chunks per document (32M chars at 32k/chunk).

				/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,

				/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.

				/// The pipeline enforces chunk_count < CHUNK_ROWID_MULTIPLIER at runtime.

				pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;

				/// Encode (document_id, chunk_index) into a sqlite-vec rowid.

fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures

4 src/embedding/chunk_ids.rs Unescape Escape View File

4

src/embedding/chunk_ids.rs

View File