AGENTS.md: Comprehensive rewrite adding file deletion safeguards, destructive git command protocol, Rust toolchain conventions, code editing discipline rules, compiler check requirements, TDD mandate, MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass tool documentation, and session completion workflow. README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration, project resolution cascading match logic, lore health and robot-docs commands, exit codes 17 (not found) and 18 (ambiguous match), --color/--quiet global flags, dirty_sources and pending_discussion_fetches tables, and version command git hash output. docs/embedding-pipeline-hardening.md: Detailed spec covering the three problems from the chunk size reduction (broken --full wiring, mixed chunk sizes in vector space, static dedup multiplier) with decision records, implementation plan, and acceptance criteria. docs/phase-b-temporal-intelligence.md: Draft planning document for transforming gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
309 lines
12 KiB
Markdown
309 lines
12 KiB
Markdown
# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
|
|
|
|
> **Status:** Proposed
|
|
> **Date:** 2026-02-02
|
|
> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
|
|
|
|
## Problem Statement
|
|
|
|
Three issues stem from the chunk size reduction:
|
|
|
|
1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.
|
|
|
|
2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
|
|
|
|
3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.
|
|
|
|
## Decision Record
|
|
|
|
| Decision | Choice | Rationale |
|
|
|----------|--------|-----------|
|
|
| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
|
|
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
|
|
| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
|
|
| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
|
|
|
|
---
|
|
|
|
## Changes
|
|
|
|
### Change 1: Wire `--full` flag through to embed
|
|
|
|
**Files:**
|
|
- `src/main.rs` (line 1116)
|
|
- `src/cli/commands/sync.rs` (line 105)
|
|
|
|
**main.rs `handle_embed`** (line 1116):
|
|
```rust
|
|
// BEFORE:
|
|
let result = run_embed(&config, retry_failed).await?;
|
|
|
|
// AFTER:
|
|
let result = run_embed_full(&config, args.full, retry_failed).await?;
|
|
```
|
|
|
|
Update the import at top of main.rs from `run_embed` to `run_embed_full`.
|
|
|
|
**sync.rs `run_sync`** (line 105):
|
|
```rust
|
|
// BEFORE:
|
|
match run_embed(config, false).await {
|
|
|
|
// AFTER:
|
|
match run_embed_full(config, options.full, false).await {
|
|
```
|
|
|
|
Update the import at line 11 from `run_embed` to `run_embed_full`.
|
|
|
|
**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.
|
|
|
|
Final signature:
|
|
```rust
|
|
pub async fn run_embed(
|
|
config: &Config,
|
|
full: bool,
|
|
retry_failed: bool,
|
|
) -> Result<EmbedCommandResult>
|
|
```
|
|
|
|
---
|
|
|
|
### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`
|
|
|
|
**New file:** `migrations/010_chunk_config.sql`
|
|
|
|
```sql
|
|
-- Migration 010: Chunk config tracking
|
|
-- Schema version: 10
|
|
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
|
|
-- Existing rows get NULL, which the change detector treats as "needs re-embed".
|
|
|
|
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
|
|
|
|
UPDATE schema_version SET version = 10
|
|
WHERE version = (SELECT MAX(version) FROM schema_version);
|
|
-- Or if using INSERT pattern:
|
|
INSERT INTO schema_version (version, applied_at, description)
|
|
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
|
|
```
|
|
|
|
Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.
|
|
|
|
---
|
|
|
|
### Change 3: Store `chunk_max_bytes` when writing embeddings
|
|
|
|
**File:** `src/embedding/pipeline.rs`
|
|
|
|
**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:
|
|
|
|
```rust
|
|
// Add import at top:
|
|
use crate::embedding::chunking::CHUNK_MAX_BYTES;
|
|
|
|
// In store_embedding, update SQL:
|
|
conn.execute(
|
|
"INSERT OR REPLACE INTO embedding_metadata
|
|
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
|
created_at, attempt_count, last_error, chunk_max_bytes)
|
|
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
|
|
rusqlite::params![
|
|
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
|
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
|
|
],
|
|
)?;
|
|
```
|
|
|
|
**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:
|
|
|
|
```rust
|
|
conn.execute(
|
|
"INSERT INTO embedding_metadata
|
|
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
|
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
|
|
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
|
|
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
|
|
attempt_count = embedding_metadata.attempt_count + 1,
|
|
last_error = ?8,
|
|
last_attempt_at = ?7,
|
|
chunk_max_bytes = ?9",
|
|
rusqlite::params![
|
|
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
|
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
|
|
],
|
|
)?;
|
|
```
|
|
|
|
---
|
|
|
|
### Change 4: Detect chunk config drift in change detector
|
|
|
|
**File:** `src/embedding/change_detector.rs`
|
|
|
|
Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).
|
|
|
|
```rust
|
|
use crate::embedding::chunking::CHUNK_MAX_BYTES;
|
|
|
|
pub fn find_pending_documents(
|
|
conn: &Connection,
|
|
page_size: usize,
|
|
last_id: i64,
|
|
) -> Result<Vec<PendingDocument>> {
|
|
let sql = r#"
|
|
SELECT d.id, d.content_text, d.content_hash
|
|
FROM documents d
|
|
WHERE d.id > ?1
|
|
AND (
|
|
-- Case 1: No embedding metadata (new document)
|
|
NOT EXISTS (
|
|
SELECT 1 FROM embedding_metadata em
|
|
WHERE em.document_id = d.id AND em.chunk_index = 0
|
|
)
|
|
-- Case 2: Document content changed
|
|
OR EXISTS (
|
|
SELECT 1 FROM embedding_metadata em
|
|
WHERE em.document_id = d.id AND em.chunk_index = 0
|
|
AND em.document_hash != d.content_hash
|
|
)
|
|
-- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
|
|
OR EXISTS (
|
|
SELECT 1 FROM embedding_metadata em
|
|
WHERE em.document_id = d.id AND em.chunk_index = 0
|
|
AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
|
|
)
|
|
)
|
|
ORDER BY d.id
|
|
LIMIT ?2
|
|
"#;
|
|
|
|
let mut stmt = conn.prepare(sql)?;
|
|
let rows = stmt
|
|
.query_map(
|
|
rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
|
|
|row| {
|
|
Ok(PendingDocument {
|
|
document_id: row.get(0)?,
|
|
content_text: row.get(1)?,
|
|
content_hash: row.get(2)?,
|
|
})
|
|
},
|
|
)?
|
|
.collect::<std::result::Result<Vec<_>, _>>()?;
|
|
|
|
Ok(rows)
|
|
}
|
|
```
|
|
|
|
Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.
|
|
|
|
---
|
|
|
|
### Change 5: Adaptive dedup multiplier in vector search
|
|
|
|
**File:** `src/search/vector.rs`
|
|
|
|
Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.
|
|
|
|
```rust
|
|
/// Query the max chunks any single document has in the embedding table.
|
|
/// Returns the max chunk count, or a default floor if no data exists.
|
|
fn max_chunks_per_document(conn: &Connection) -> i64 {
|
|
conn.query_row(
|
|
"SELECT COALESCE(MAX(cnt), 1) FROM (
|
|
SELECT COUNT(*) as cnt FROM embedding_metadata
|
|
WHERE last_error IS NULL
|
|
GROUP BY document_id
|
|
)",
|
|
[],
|
|
|row| row.get(0),
|
|
)
|
|
.unwrap_or(1)
|
|
}
|
|
|
|
pub fn search_vector(
|
|
conn: &Connection,
|
|
query_embedding: &[f32],
|
|
limit: usize,
|
|
) -> Result<Vec<VectorResult>> {
|
|
if query_embedding.is_empty() || limit == 0 {
|
|
return Ok(Vec::new());
|
|
}
|
|
|
|
let embedding_bytes: Vec<u8> = query_embedding
|
|
.iter()
|
|
.flat_map(|f| f.to_le_bytes())
|
|
.collect();
|
|
|
|
// Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
|
|
// The 1.5x safety margin handles clustering in KNN results
|
|
let max_chunks = max_chunks_per_document(conn);
|
|
let multiplier = (max_chunks as usize * 3 / 2).max(8);
|
|
let k = limit * multiplier;
|
|
|
|
// ... rest unchanged ...
|
|
}
|
|
```
|
|
|
|
**Why `max_chunks * 1.5` with floor of 8**:
|
|
- `max_chunks` is the worst case for a single document dominating results
|
|
- `* 1.5` adds margin for multiple clustered documents
|
|
- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
|
|
- This is a single aggregate query on an indexed column -- sub-millisecond
|
|
|
|
---
|
|
|
|
### Change 6: Update chunk_ids.rs comment
|
|
|
|
**File:** `src/embedding/chunk_ids.rs` (line 1-3)
|
|
|
|
Update the comment to reflect current reality:
|
|
```rust
|
|
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
|
|
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
|
|
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
|
|
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
|
|
```
|
|
|
|
---
|
|
|
|
## Files Modified (Summary)
|
|
|
|
| File | Change |
|
|
|------|--------|
|
|
| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
|
|
| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
|
|
| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
|
|
| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
|
|
| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
|
|
| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
|
|
| `src/main.rs` | Call `run_embed` with `args.full`, update import |
|
|
| `src/embedding/chunk_ids.rs` | Comment update only |
|
|
|
|
## Verification
|
|
|
|
1. **Compile check**: `cargo build` -- no errors
|
|
2. **Unit tests**: `cargo test` -- all existing tests pass
|
|
3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
|
|
4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
|
|
5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
|
|
6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
|
|
7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.
|
|
|
|
## Decision Record (for future reference)
|
|
|
|
**Date:** 2026-02-02
|
|
**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
|
|
|
|
**Downstream consequences identified:**
|
|
1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
|
|
2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
|
|
3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
|
|
4. Mixed chunk sizes degrade search -- fixed with config drift detection
|
|
5. Ollama API call volume increases proportionally -- acceptable for local model
|
|
|
|
**Rejected alternatives:**
|
|
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
|
|
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
|
|
- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.
|