gitlore/docs/embedding-pipeline-hardening.md

# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring

> **Status:** Proposed
> **Date:** 2026-02-02
> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.

## Problem Statement

Three issues stem from the chunk size reduction:

1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.

2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.

3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.

## Decision Record

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |

---

## Changes

### Change 1: Wire `--full` flag through to embed

**Files:**
- `src/main.rs` (line 1116)
- `src/cli/commands/sync.rs` (line 105)

**main.rs `handle_embed`** (line 1116):
```rust
// BEFORE:
let result = run_embed(&config, retry_failed).await?;

// AFTER:
let result = run_embed_full(&config, args.full, retry_failed).await?;
```

Update the import at top of main.rs from `run_embed` to `run_embed_full`.

**sync.rs `run_sync`** (line 105):
```rust
// BEFORE:
match run_embed(config, false).await {

// AFTER:
match run_embed_full(config, options.full, false).await {
```

Update the import at line 11 from `run_embed` to `run_embed_full`.

**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.

Final signature:
```rust
pub async fn run_embed(
    config: &Config,
    full: bool,
    retry_failed: bool,
) -> Result<EmbedCommandResult>
```

---

### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`

**New file:** `migrations/010_chunk_config.sql`

```sql
-- Migration 010: Chunk config tracking
-- Schema version: 10
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
-- Existing rows get NULL, which the change detector treats as "needs re-embed".

ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;

UPDATE schema_version SET version = 10
  WHERE version = (SELECT MAX(version) FROM schema_version);
-- Or if using INSERT pattern:
INSERT INTO schema_version (version, applied_at, description)
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
```

Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.

---

### Change 3: Store `chunk_max_bytes` when writing embeddings

**File:** `src/embedding/pipeline.rs`

**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:

```rust
// Add import at top:
use crate::embedding::chunking::CHUNK_MAX_BYTES;

// In store_embedding, update SQL:
conn.execute(
    "INSERT OR REPLACE INTO embedding_metadata
     (document_id, chunk_index, model, dims, document_hash, chunk_hash,
      created_at, attempt_count, last_error, chunk_max_bytes)
     VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
    rusqlite::params![
        doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
        doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
    ],
)?;
```

**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:

```rust
conn.execute(
    "INSERT INTO embedding_metadata
     (document_id, chunk_index, model, dims, document_hash, chunk_hash,
      created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
     VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
     ON CONFLICT(document_id, chunk_index) DO UPDATE SET
       attempt_count = embedding_metadata.attempt_count + 1,
       last_error = ?8,
       last_attempt_at = ?7,
       chunk_max_bytes = ?9",
    rusqlite::params![
        doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
        doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
    ],
)?;
```

---

### Change 4: Detect chunk config drift in change detector

**File:** `src/embedding/change_detector.rs`

Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).

```rust
use crate::embedding::chunking::CHUNK_MAX_BYTES;

pub fn find_pending_documents(
    conn: &Connection,
    page_size: usize,
    last_id: i64,
) -> Result<Vec<PendingDocument>> {
    let sql = r#"
        SELECT d.id, d.content_text, d.content_hash
        FROM documents d
        WHERE d.id > ?1
          AND (
            -- Case 1: No embedding metadata (new document)
            NOT EXISTS (
                SELECT 1 FROM embedding_metadata em
                WHERE em.document_id = d.id AND em.chunk_index = 0
            )
            -- Case 2: Document content changed
            OR EXISTS (
                SELECT 1 FROM embedding_metadata em
                WHERE em.document_id = d.id AND em.chunk_index = 0
                  AND em.document_hash != d.content_hash
            )
            -- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
            OR EXISTS (
                SELECT 1 FROM embedding_metadata em
                WHERE em.document_id = d.id AND em.chunk_index = 0
                  AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
            )
          )
        ORDER BY d.id
        LIMIT ?2
    "#;

    let mut stmt = conn.prepare(sql)?;
    let rows = stmt
        .query_map(
            rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
            |row| {
                Ok(PendingDocument {
                    document_id: row.get(0)?,
                    content_text: row.get(1)?,
                    content_hash: row.get(2)?,
                })
            },
        )?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(rows)
}
```

Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.

---

### Change 5: Adaptive dedup multiplier in vector search

**File:** `src/search/vector.rs`

Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.

```rust
/// Query the max chunks any single document has in the embedding table.
/// Returns the max chunk count, or a default floor if no data exists.
fn max_chunks_per_document(conn: &Connection) -> i64 {
    conn.query_row(
        "SELECT COALESCE(MAX(cnt), 1) FROM (
            SELECT COUNT(*) as cnt FROM embedding_metadata
            WHERE last_error IS NULL
            GROUP BY document_id
        )",
        [],
        |row| row.get(0),
    )
    .unwrap_or(1)
}

pub fn search_vector(
    conn: &Connection,
    query_embedding: &[f32],
    limit: usize,
) -> Result<Vec<VectorResult>> {
    if query_embedding.is_empty() || limit == 0 {
        return Ok(Vec::new());
    }

    let embedding_bytes: Vec<u8> = query_embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    // Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
    // The 1.5x safety margin handles clustering in KNN results
    let max_chunks = max_chunks_per_document(conn);
    let multiplier = (max_chunks as usize * 3 / 2).max(8);
    let k = limit * multiplier;

    // ... rest unchanged ...
}
```

**Why `max_chunks * 1.5` with floor of 8**:
- `max_chunks` is the worst case for a single document dominating results
- `* 1.5` adds margin for multiple clustered documents
- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
- This is a single aggregate query on an indexed column -- sub-millisecond

---

### Change 6: Update chunk_ids.rs comment

**File:** `src/embedding/chunk_ids.rs` (line 1-3)

Update the comment to reflect current reality:
```rust
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
```

---

## Files Modified (Summary)

| File | Change |
|------|--------|
| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
| `src/main.rs` | Call `run_embed` with `args.full`, update import |
| `src/embedding/chunk_ids.rs` | Comment update only |

## Verification

1. **Compile check**: `cargo build` -- no errors
2. **Unit tests**: `cargo test` -- all existing tests pass
3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.

## Decision Record (for future reference)

**Date:** 2026-02-02
**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).

**Downstream consequences identified:**
1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
4. Mixed chunk sizes degrade search -- fixed with config drift detection
5. Ollama API call volume increases proportionally -- acceptable for local model

**Rejected alternatives:**
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.