docs: Overhaul AGENTS.md, update README, add pipeline spec and Phase B plan

AGENTS.md: Comprehensive rewrite adding file deletion safeguards, destructive git command protocol, Rust toolchain conventions, code editing discipline rules, compiler check requirements, TDD mandate, MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass tool documentation, and session completion workflow. README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration, project resolution cascading match logic, lore health and robot-docs commands, exit codes 17 (not found) and 18 (ambiguous match), --color/--quiet global flags, dirty_sources and pending_discussion_fetches tables, and version command git hash output. docs/embedding-pipeline-hardening.md: Detailed spec covering the three problems from the chunk size reduction (broken --full wiring, mixed chunk sizes in vector space, static dedup multiplier) with decision records, implementation plan, and acceptance criteria. docs/phase-b-temporal-intelligence.md: Draft planning document for transforming gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:51 -05:00
parent f560e6bc00
commit a417640faa
4 changed files with 1891 additions and 4 deletions
--- a/docs/embedding-pipeline-hardening.md
+++ b/docs/embedding-pipeline-hardening.md
@@ -0,0 +1,308 @@
+# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
+
+> **Status:** Proposed
+> **Date:** 2026-02-02
+> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
+
+## Problem Statement
+
+Three issues stem from the chunk size reduction:
+
+1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.
+
+2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
+
+3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.
+
+## Decision Record
+
+| Decision | Choice | Rationale |
+|----------|--------|-----------|
+| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
+| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
+| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
+| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
+
+---
+
+## Changes
+
+### Change 1: Wire `--full` flag through to embed
+
+**Files:**
+- `src/main.rs` (line 1116)
+- `src/cli/commands/sync.rs` (line 105)
+
+**main.rs `handle_embed`** (line 1116):
+```rust
+// BEFORE:
+let result = run_embed(&config, retry_failed).await?;
+
+// AFTER:
+let result = run_embed_full(&config, args.full, retry_failed).await?;
+```
+
+Update the import at top of main.rs from `run_embed` to `run_embed_full`.
+
+**sync.rs `run_sync`** (line 105):
+```rust
+// BEFORE:
+match run_embed(config, false).await {
+
+// AFTER:
+match run_embed_full(config, options.full, false).await {
+```
+
+Update the import at line 11 from `run_embed` to `run_embed_full`.
+
+**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.
+
+Final signature:
+```rust
+pub async fn run_embed(
+    config: &Config,
+    full: bool,
+    retry_failed: bool,
+) -> Result<EmbedCommandResult>
+```
+
+---
+
+### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`
+
+**New file:** `migrations/010_chunk_config.sql`
+
+```sql
+-- Migration 010: Chunk config tracking
+-- Schema version: 10
+-- Adds chunk_max_bytes to embedding_metadata for drift detection.
+-- Existing rows get NULL, which the change detector treats as "needs re-embed".
+
+ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
+
+UPDATE schema_version SET version = 10
+  WHERE version = (SELECT MAX(version) FROM schema_version);
+-- Or if using INSERT pattern:
+INSERT INTO schema_version (version, applied_at, description)
+VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
+```
+
+Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.
+
+---
+
+### Change 3: Store `chunk_max_bytes` when writing embeddings
+
+**File:** `src/embedding/pipeline.rs`
+
+**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:
+
+```rust
+// Add import at top:
+use crate::embedding::chunking::CHUNK_MAX_BYTES;
+
+// In store_embedding, update SQL:
+conn.execute(
+    "INSERT OR REPLACE INTO embedding_metadata
+     (document_id, chunk_index, model, dims, document_hash, chunk_hash,
+      created_at, attempt_count, last_error, chunk_max_bytes)
+     VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
+    rusqlite::params![
+        doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
+        doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
+    ],
+)?;
+```
+
+**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:
+
+```rust
+conn.execute(
+    "INSERT INTO embedding_metadata
+     (document_id, chunk_index, model, dims, document_hash, chunk_hash,
+      created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
+     VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
+     ON CONFLICT(document_id, chunk_index) DO UPDATE SET
+       attempt_count = embedding_metadata.attempt_count + 1,
+       last_error = ?8,
+       last_attempt_at = ?7,
+       chunk_max_bytes = ?9",
+    rusqlite::params![
+        doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
+        doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
+    ],
+)?;
+```
+
+---
+
+### Change 4: Detect chunk config drift in change detector
+
+**File:** `src/embedding/change_detector.rs`
+
+Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).
+
+```rust
+use crate::embedding::chunking::CHUNK_MAX_BYTES;
+
+pub fn find_pending_documents(
+    conn: &Connection,
+    page_size: usize,
+    last_id: i64,
+) -> Result<Vec<PendingDocument>> {
+    let sql = r#"
+        SELECT d.id, d.content_text, d.content_hash
+        FROM documents d
+        WHERE d.id > ?1
+          AND (
+            -- Case 1: No embedding metadata (new document)
+            NOT EXISTS (
+                SELECT 1 FROM embedding_metadata em
+                WHERE em.document_id = d.id AND em.chunk_index = 0
+            )
+            -- Case 2: Document content changed
+            OR EXISTS (
+                SELECT 1 FROM embedding_metadata em
+                WHERE em.document_id = d.id AND em.chunk_index = 0
+                  AND em.document_hash != d.content_hash
+            )
+            -- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
+            OR EXISTS (
+                SELECT 1 FROM embedding_metadata em
+                WHERE em.document_id = d.id AND em.chunk_index = 0
+                  AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
+            )
+          )
+        ORDER BY d.id
+        LIMIT ?2
+    "#;
+
+    let mut stmt = conn.prepare(sql)?;
+    let rows = stmt
+        .query_map(
+            rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
+            |row| {
+                Ok(PendingDocument {
+                    document_id: row.get(0)?,
+                    content_text: row.get(1)?,
+                    content_hash: row.get(2)?,
+                })
+            },
+        )?
+        .collect::<std::result::Result<Vec<_>, _>>()?;
+
+    Ok(rows)
+}
+```
+
+Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.
+
+---
+
+### Change 5: Adaptive dedup multiplier in vector search
+
+**File:** `src/search/vector.rs`
+
+Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.
+
+```rust
+/// Query the max chunks any single document has in the embedding table.
+/// Returns the max chunk count, or a default floor if no data exists.
+fn max_chunks_per_document(conn: &Connection) -> i64 {
+    conn.query_row(
+        "SELECT COALESCE(MAX(cnt), 1) FROM (
+            SELECT COUNT(*) as cnt FROM embedding_metadata
+            WHERE last_error IS NULL
+            GROUP BY document_id
+        )",
+        [],
+        |row| row.get(0),
+    )
+    .unwrap_or(1)
+}
+
+pub fn search_vector(
+    conn: &Connection,
+    query_embedding: &[f32],
+    limit: usize,
+) -> Result<Vec<VectorResult>> {
+    if query_embedding.is_empty() || limit == 0 {
+        return Ok(Vec::new());
+    }
+
+    let embedding_bytes: Vec<u8> = query_embedding
+        .iter()
+        .flat_map(|f| f.to_le_bytes())
+        .collect();
+
+    // Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
+    // The 1.5x safety margin handles clustering in KNN results
+    let max_chunks = max_chunks_per_document(conn);
+    let multiplier = (max_chunks as usize * 3 / 2).max(8);
+    let k = limit * multiplier;
+
+    // ... rest unchanged ...
+}
+```
+
+**Why `max_chunks * 1.5` with floor of 8**:
+- `max_chunks` is the worst case for a single document dominating results
+- `* 1.5` adds margin for multiple clustered documents
+- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
+- This is a single aggregate query on an indexed column -- sub-millisecond
+
+---
+
+### Change 6: Update chunk_ids.rs comment
+
+**File:** `src/embedding/chunk_ids.rs` (line 1-3)
+
+Update the comment to reflect current reality:
+```rust
+/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
+/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
+/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
+pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
+```
+
+---
+
+## Files Modified (Summary)
+
+| File | Change |
+|------|--------|
+| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
+| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
+| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
+| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
+| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
+| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
+| `src/main.rs` | Call `run_embed` with `args.full`, update import |
+| `src/embedding/chunk_ids.rs` | Comment update only |
+
+## Verification
+
+1. **Compile check**: `cargo build` -- no errors
+2. **Unit tests**: `cargo test` -- all existing tests pass
+3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
+4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
+5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
+6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
+7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.
+
+## Decision Record (for future reference)
+
+**Date:** 2026-02-02
+**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
+
+**Downstream consequences identified:**
+1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
+2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
+3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
+4. Mixed chunk sizes degrade search -- fixed with config drift detection
+5. Ollama API call volume increases proportionally -- acceptable for local model
+
+**Rejected alternatives:**
+- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
+- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
+- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.