# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring > **Status:** Proposed > **Date:** 2026-02-02 > **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change. ## Problem Statement Three issues stem from the chunk size reduction: 1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed. 2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift. 3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data. ## Decision Record | Decision | Choice | Rationale | |----------|--------|-----------| | Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. | | Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. | | `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. | | Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. | --- ## Changes ### Change 1: Wire `--full` flag through to embed **Files:** - `src/main.rs` (line 1116) - `src/cli/commands/sync.rs` (line 105) **main.rs `handle_embed`** (line 1116): ```rust // BEFORE: let result = run_embed(&config, retry_failed).await?; // AFTER: let result = run_embed_full(&config, args.full, retry_failed).await?; ``` Update the import at top of main.rs from `run_embed` to `run_embed_full`. **sync.rs `run_sync`** (line 105): ```rust // BEFORE: match run_embed(config, false).await { // AFTER: match run_embed_full(config, options.full, false).await { ``` Update the import at line 11 from `run_embed` to `run_embed_full`. **Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`. Final signature: ```rust pub async fn run_embed( config: &Config, full: bool, retry_failed: bool, ) -> Result ``` --- ### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata` **New file:** `migrations/010_chunk_config.sql` ```sql -- Migration 010: Chunk config tracking -- Schema version: 10 -- Adds chunk_max_bytes to embedding_metadata for drift detection. -- Existing rows get NULL, which the change detector treats as "needs re-embed". ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER; UPDATE schema_version SET version = 10 WHERE version = (SELECT MAX(version) FROM schema_version); -- Or if using INSERT pattern: INSERT INTO schema_version (version, applied_at, description) VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection'); ``` Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency. --- ### Change 3: Store `chunk_max_bytes` when writing embeddings **File:** `src/embedding/pipeline.rs` **`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT: ```rust // Add import at top: use crate::embedding::chunking::CHUNK_MAX_BYTES; // In store_embedding, update SQL: conn.execute( "INSERT OR REPLACE INTO embedding_metadata (document_id, chunk_index, model, dims, document_hash, chunk_hash, created_at, attempt_count, last_error, chunk_max_bytes) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)", rusqlite::params![ doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64, doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64 ], )?; ``` **`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under: ```rust conn.execute( "INSERT INTO embedding_metadata (document_id, chunk_index, model, dims, document_hash, chunk_hash, created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9) ON CONFLICT(document_id, chunk_index) DO UPDATE SET attempt_count = embedding_metadata.attempt_count + 1, last_error = ?8, last_attempt_at = ?7, chunk_max_bytes = ?9", rusqlite::params![ doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64, doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64 ], )?; ``` --- ### Change 4: Detect chunk config drift in change detector **File:** `src/embedding/change_detector.rs` Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings). ```rust use crate::embedding::chunking::CHUNK_MAX_BYTES; pub fn find_pending_documents( conn: &Connection, page_size: usize, last_id: i64, ) -> Result> { let sql = r#" SELECT d.id, d.content_text, d.content_hash FROM documents d WHERE d.id > ?1 AND ( -- Case 1: No embedding metadata (new document) NOT EXISTS ( SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 ) -- Case 2: Document content changed OR EXISTS ( SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 AND em.document_hash != d.content_hash ) -- Case 3: Chunk config drift (different chunk size or pre-migration NULL) OR EXISTS ( SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3) ) ) ORDER BY d.id LIMIT ?2 "#; let mut stmt = conn.prepare(sql)?; let rows = stmt .query_map( rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64], |row| { Ok(PendingDocument { document_id: row.get(0)?, content_text: row.get(1)?, content_hash: row.get(2)?, }) }, )? .collect::, _>>()?; Ok(rows) } ``` Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter. --- ### Change 5: Adaptive dedup multiplier in vector search **File:** `src/search/vector.rs` Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database. ```rust /// Query the max chunks any single document has in the embedding table. /// Returns the max chunk count, or a default floor if no data exists. fn max_chunks_per_document(conn: &Connection) -> i64 { conn.query_row( "SELECT COALESCE(MAX(cnt), 1) FROM ( SELECT COUNT(*) as cnt FROM embedding_metadata WHERE last_error IS NULL GROUP BY document_id )", [], |row| row.get(0), ) .unwrap_or(1) } pub fn search_vector( conn: &Connection, query_embedding: &[f32], limit: usize, ) -> Result> { if query_embedding.is_empty() || limit == 0 { return Ok(Vec::new()); } let embedding_bytes: Vec = query_embedding .iter() .flat_map(|f| f.to_le_bytes()) .collect(); // Adaptive over-fetch: use actual max chunks per doc, with floor of 8x // The 1.5x safety margin handles clustering in KNN results let max_chunks = max_chunks_per_document(conn); let multiplier = (max_chunks as usize * 3 / 2).max(8); let k = limit * multiplier; // ... rest unchanged ... } ``` **Why `max_chunks * 1.5` with floor of 8**: - `max_chunks` is the worst case for a single document dominating results - `* 1.5` adds margin for multiple clustered documents - Floor of `8` ensures reasonable over-fetch even with single-chunk documents - This is a single aggregate query on an indexed column -- sub-millisecond --- ### Change 6: Update chunk_ids.rs comment **File:** `src/embedding/chunk_ids.rs` (line 1-3) Update the comment to reflect current reality: ```rust /// Multiplier for encoding (document_id, chunk_index) into a single rowid. /// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000, /// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks. pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000; ``` --- ## Files Modified (Summary) | File | Change | |------|--------| | `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column | | `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes | | `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) | | `src/search/vector.rs` | Adaptive dedup multiplier from DB | | `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` | | `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import | | `src/main.rs` | Call `run_embed` with `args.full`, update import | | `src/embedding/chunk_ids.rs` | Comment update only | ## Verification 1. **Compile check**: `cargo build` -- no errors 2. **Unit tests**: `cargo test` -- all existing tests pass 3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly 4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild). 5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically. 6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full` 7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution. ## Decision Record (for future reference) **Date:** 2026-02-02 **Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens). **Downstream consequences identified:** 1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed 2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier 3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB) 4. Mixed chunk sizes degrade search -- fixed with config drift detection 5. Ollama API call volume increases proportionally -- acceptable for local model **Rejected alternatives:** - Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query. - Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting. - Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.