Files
gitlore/docs/embedding-pipeline-hardening.md
Taylor Eernisse a417640faa docs: Overhaul AGENTS.md, update README, add pipeline spec and Phase B plan
AGENTS.md: Comprehensive rewrite adding file deletion safeguards,
destructive git command protocol, Rust toolchain conventions, code
editing discipline rules, compiler check requirements, TDD mandate,
MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass
tool documentation, and session completion workflow.

README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration,
project resolution cascading match logic, lore health and robot-docs
commands, exit codes 17 (not found) and 18 (ambiguous match),
--color/--quiet global flags, dirty_sources and
pending_discussion_fetches tables, and version command git hash output.

docs/embedding-pipeline-hardening.md: Detailed spec covering the three
problems from the chunk size reduction (broken --full wiring, mixed
chunk sizes in vector space, static dedup multiplier) with decision
records, implementation plan, and acceptance criteria.

docs/phase-b-temporal-intelligence.md: Draft planning document for
transforming gitlore from a search engine into a temporal code
intelligence system by ingesting structured event data from GitLab.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:51 -05:00

12 KiB

Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring

Status: Proposed Date: 2026-02-02 Context: Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.

Problem Statement

Three issues stem from the chunk size reduction:

  1. Broken --full wiring: handle_embed in main.rs ignores args.full (calls run_embed instead of run_embed_full). run_sync hardcodes false for retry_failed and never passes options.full to embed. Users running lore sync --full or lore embed --full don't get a full re-embed.

  2. Mixed chunk sizes in vector space: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.

  3. Static dedup multiplier: search_vector uses limit * 8 to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching limit unique documents. The multiplier should adapt to actual data.

Decision Record

Decision Choice Rationale
Detect chunk config drift Store chunk_max_bytes in embedding_metadata Allows automatic invalidation without user intervention. Self-heals on next sync.
Dedup multiplier strategy Adaptive from DB with static floor One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget.
--full propagation sync --full passes full to embed step Matches user expectation: "start fresh" means everything, not just ingest+docs.
Migration strategy New migration 010 for chunk_max_bytes column Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed.

Changes

Change 1: Wire --full flag through to embed

Files:

  • src/main.rs (line 1116)
  • src/cli/commands/sync.rs (line 105)

main.rs handle_embed (line 1116):

// BEFORE:
let result = run_embed(&config, retry_failed).await?;

// AFTER:
let result = run_embed_full(&config, args.full, retry_failed).await?;

Update the import at top of main.rs from run_embed to run_embed_full.

sync.rs run_sync (line 105):

// BEFORE:
match run_embed(config, false).await {

// AFTER:
match run_embed_full(config, options.full, false).await {

Update the import at line 11 from run_embed to run_embed_full.

Cleanup embed.rs: Remove run_embed (the wrapper that hardcodes full: false). All callers should use run_embed_full directly. Rename run_embed_full to run_embed with the 3-arg signature (config, full, retry_failed).

Final signature:

pub async fn run_embed(
    config: &Config,
    full: bool,
    retry_failed: bool,
) -> Result<EmbedCommandResult>

Change 2: Migration 010 -- add chunk_max_bytes to embedding_metadata

New file: migrations/010_chunk_config.sql

-- Migration 010: Chunk config tracking
-- Schema version: 10
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
-- Existing rows get NULL, which the change detector treats as "needs re-embed".

ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;

UPDATE schema_version SET version = 10
  WHERE version = (SELECT MAX(version) FROM schema_version);
-- Or if using INSERT pattern:
INSERT INTO schema_version (version, applied_at, description)
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');

Check existing migration pattern in src/core/db.rs for how migrations are applied -- follow that exact pattern for consistency.


Change 3: Store chunk_max_bytes when writing embeddings

File: src/embedding/pipeline.rs

store_embedding (lines 238-266): Add chunk_max_bytes to the INSERT:

// Add import at top:
use crate::embedding::chunking::CHUNK_MAX_BYTES;

// In store_embedding, update SQL:
conn.execute(
    "INSERT OR REPLACE INTO embedding_metadata
     (document_id, chunk_index, model, dims, document_hash, chunk_hash,
      created_at, attempt_count, last_error, chunk_max_bytes)
     VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
    rusqlite::params![
        doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
        doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
    ],
)?;

record_embedding_error (lines 269-291): Also store chunk_max_bytes so error rows track which config they failed under:

conn.execute(
    "INSERT INTO embedding_metadata
     (document_id, chunk_index, model, dims, document_hash, chunk_hash,
      created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
     VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
     ON CONFLICT(document_id, chunk_index) DO UPDATE SET
       attempt_count = embedding_metadata.attempt_count + 1,
       last_error = ?8,
       last_attempt_at = ?7,
       chunk_max_bytes = ?9",
    rusqlite::params![
        doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
        doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
    ],
)?;

Change 4: Detect chunk config drift in change detector

File: src/embedding/change_detector.rs

Add a third condition to the pending detection: embeddings where chunk_max_bytes differs from the current CHUNK_MAX_BYTES constant (or is NULL, meaning pre-migration embeddings).

use crate::embedding::chunking::CHUNK_MAX_BYTES;

pub fn find_pending_documents(
    conn: &Connection,
    page_size: usize,
    last_id: i64,
) -> Result<Vec<PendingDocument>> {
    let sql = r#"
        SELECT d.id, d.content_text, d.content_hash
        FROM documents d
        WHERE d.id > ?1
          AND (
            -- Case 1: No embedding metadata (new document)
            NOT EXISTS (
                SELECT 1 FROM embedding_metadata em
                WHERE em.document_id = d.id AND em.chunk_index = 0
            )
            -- Case 2: Document content changed
            OR EXISTS (
                SELECT 1 FROM embedding_metadata em
                WHERE em.document_id = d.id AND em.chunk_index = 0
                  AND em.document_hash != d.content_hash
            )
            -- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
            OR EXISTS (
                SELECT 1 FROM embedding_metadata em
                WHERE em.document_id = d.id AND em.chunk_index = 0
                  AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
            )
          )
        ORDER BY d.id
        LIMIT ?2
    "#;

    let mut stmt = conn.prepare(sql)?;
    let rows = stmt
        .query_map(
            rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
            |row| {
                Ok(PendingDocument {
                    document_id: row.get(0)?,
                    content_text: row.get(1)?,
                    content_hash: row.get(2)?,
                })
            },
        )?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(rows)
}

Apply the same change to count_pending_documents -- add the third OR clause and the ?3 parameter.


File: src/search/vector.rs

Replace the static limit * 8 with an adaptive multiplier based on the actual max chunks-per-document in the database.

/// Query the max chunks any single document has in the embedding table.
/// Returns the max chunk count, or a default floor if no data exists.
fn max_chunks_per_document(conn: &Connection) -> i64 {
    conn.query_row(
        "SELECT COALESCE(MAX(cnt), 1) FROM (
            SELECT COUNT(*) as cnt FROM embedding_metadata
            WHERE last_error IS NULL
            GROUP BY document_id
        )",
        [],
        |row| row.get(0),
    )
    .unwrap_or(1)
}

pub fn search_vector(
    conn: &Connection,
    query_embedding: &[f32],
    limit: usize,
) -> Result<Vec<VectorResult>> {
    if query_embedding.is_empty() || limit == 0 {
        return Ok(Vec::new());
    }

    let embedding_bytes: Vec<u8> = query_embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    // Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
    // The 1.5x safety margin handles clustering in KNN results
    let max_chunks = max_chunks_per_document(conn);
    let multiplier = (max_chunks as usize * 3 / 2).max(8);
    let k = limit * multiplier;

    // ... rest unchanged ...
}

Why max_chunks * 1.5 with floor of 8:

  • max_chunks is the worst case for a single document dominating results
  • * 1.5 adds margin for multiple clustered documents
  • Floor of 8 ensures reasonable over-fetch even with single-chunk documents
  • This is a single aggregate query on an indexed column -- sub-millisecond

Change 6: Update chunk_ids.rs comment

File: src/embedding/chunk_ids.rs (line 1-3)

Update the comment to reflect current reality:

/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;

Files Modified (Summary)

File Change
migrations/010_chunk_config.sql NEW -- Add chunk_max_bytes column
src/embedding/pipeline.rs Store CHUNK_MAX_BYTES in metadata writes
src/embedding/change_detector.rs Detect chunk config drift (3rd OR clause)
src/search/vector.rs Adaptive dedup multiplier from DB
src/cli/commands/embed.rs Consolidate to single run_embed(config, full, retry_failed)
src/cli/commands/sync.rs Pass options.full to embed, update import
src/main.rs Call run_embed with args.full, update import
src/embedding/chunk_ids.rs Comment update only

Verification

  1. Compile check: cargo build -- no errors
  2. Unit tests: cargo test -- all existing tests pass
  3. Migration test: Run lore doctor or lore migrate -- migration 010 applies cleanly
  4. Full flag wiring: lore embed --full should clear all embeddings and re-embed. Verify by checking lore --robot stats before and after (embedded count should reset then rebuild).
  5. Chunk config drift: After migration, existing embeddings have chunk_max_bytes = NULL. Running lore embed (without --full) should detect all existing embeddings as stale and re-embed them automatically.
  6. Sync propagation: lore sync --full should produce the same embed behavior as lore embed --full
  7. Adaptive dedup: Run lore search "some query" and verify the result count matches the requested limit (default 20). Check with RUST_LOG=debug that the computed k value scales with actual chunk distribution.

Decision Record (for future reference)

Date: 2026-02-02 Trigger: Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).

Downstream consequences identified:

  1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
  2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
  3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
  4. Mixed chunk sizes degrade search -- fixed with config drift detection
  5. Ollama API call volume increases proportionally -- acceptable for local model

Rejected alternatives:

  • Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
  • Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
  • Manual --full as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.