docs: Overhaul AGENTS.md, update README, add pipeline spec and Phase B plan
AGENTS.md: Comprehensive rewrite adding file deletion safeguards, destructive git command protocol, Rust toolchain conventions, code editing discipline rules, compiler check requirements, TDD mandate, MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass tool documentation, and session completion workflow. README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration, project resolution cascading match logic, lore health and robot-docs commands, exit codes 17 (not found) and 18 (ambiguous match), --color/--quiet global flags, dirty_sources and pending_discussion_fetches tables, and version command git hash output. docs/embedding-pipeline-hardening.md: Detailed spec covering the three problems from the chunk size reduction (broken --full wiring, mixed chunk sizes in vector space, static dedup multiplier) with decision records, implementation plan, and acceptance criteria. docs/phase-b-temporal-intelligence.md: Draft planning document for transforming gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
308
docs/embedding-pipeline-hardening.md
Normal file
308
docs/embedding-pipeline-hardening.md
Normal file
@@ -0,0 +1,308 @@
|
||||
# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
|
||||
|
||||
> **Status:** Proposed
|
||||
> **Date:** 2026-02-02
|
||||
> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Three issues stem from the chunk size reduction:
|
||||
|
||||
1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.
|
||||
|
||||
2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
|
||||
|
||||
3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.
|
||||
|
||||
## Decision Record
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
|
||||
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
|
||||
| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
|
||||
| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
|
||||
|
||||
---
|
||||
|
||||
## Changes
|
||||
|
||||
### Change 1: Wire `--full` flag through to embed
|
||||
|
||||
**Files:**
|
||||
- `src/main.rs` (line 1116)
|
||||
- `src/cli/commands/sync.rs` (line 105)
|
||||
|
||||
**main.rs `handle_embed`** (line 1116):
|
||||
```rust
|
||||
// BEFORE:
|
||||
let result = run_embed(&config, retry_failed).await?;
|
||||
|
||||
// AFTER:
|
||||
let result = run_embed_full(&config, args.full, retry_failed).await?;
|
||||
```
|
||||
|
||||
Update the import at top of main.rs from `run_embed` to `run_embed_full`.
|
||||
|
||||
**sync.rs `run_sync`** (line 105):
|
||||
```rust
|
||||
// BEFORE:
|
||||
match run_embed(config, false).await {
|
||||
|
||||
// AFTER:
|
||||
match run_embed_full(config, options.full, false).await {
|
||||
```
|
||||
|
||||
Update the import at line 11 from `run_embed` to `run_embed_full`.
|
||||
|
||||
**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.
|
||||
|
||||
Final signature:
|
||||
```rust
|
||||
pub async fn run_embed(
|
||||
config: &Config,
|
||||
full: bool,
|
||||
retry_failed: bool,
|
||||
) -> Result<EmbedCommandResult>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`
|
||||
|
||||
**New file:** `migrations/010_chunk_config.sql`
|
||||
|
||||
```sql
|
||||
-- Migration 010: Chunk config tracking
|
||||
-- Schema version: 10
|
||||
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
|
||||
-- Existing rows get NULL, which the change detector treats as "needs re-embed".
|
||||
|
||||
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
|
||||
|
||||
UPDATE schema_version SET version = 10
|
||||
WHERE version = (SELECT MAX(version) FROM schema_version);
|
||||
-- Or if using INSERT pattern:
|
||||
INSERT INTO schema_version (version, applied_at, description)
|
||||
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
|
||||
```
|
||||
|
||||
Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.
|
||||
|
||||
---
|
||||
|
||||
### Change 3: Store `chunk_max_bytes` when writing embeddings
|
||||
|
||||
**File:** `src/embedding/pipeline.rs`
|
||||
|
||||
**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:
|
||||
|
||||
```rust
|
||||
// Add import at top:
|
||||
use crate::embedding::chunking::CHUNK_MAX_BYTES;
|
||||
|
||||
// In store_embedding, update SQL:
|
||||
conn.execute(
|
||||
"INSERT OR REPLACE INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error, chunk_max_bytes)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
|
||||
rusqlite::params![
|
||||
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
||||
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
|
||||
],
|
||||
)?;
|
||||
```
|
||||
|
||||
**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:
|
||||
|
||||
```rust
|
||||
conn.execute(
|
||||
"INSERT INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
|
||||
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
|
||||
attempt_count = embedding_metadata.attempt_count + 1,
|
||||
last_error = ?8,
|
||||
last_attempt_at = ?7,
|
||||
chunk_max_bytes = ?9",
|
||||
rusqlite::params![
|
||||
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
||||
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
|
||||
],
|
||||
)?;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 4: Detect chunk config drift in change detector
|
||||
|
||||
**File:** `src/embedding/change_detector.rs`
|
||||
|
||||
Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).
|
||||
|
||||
```rust
|
||||
use crate::embedding::chunking::CHUNK_MAX_BYTES;
|
||||
|
||||
pub fn find_pending_documents(
|
||||
conn: &Connection,
|
||||
page_size: usize,
|
||||
last_id: i64,
|
||||
) -> Result<Vec<PendingDocument>> {
|
||||
let sql = r#"
|
||||
SELECT d.id, d.content_text, d.content_hash
|
||||
FROM documents d
|
||||
WHERE d.id > ?1
|
||||
AND (
|
||||
-- Case 1: No embedding metadata (new document)
|
||||
NOT EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
)
|
||||
-- Case 2: Document content changed
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND em.document_hash != d.content_hash
|
||||
)
|
||||
-- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
|
||||
)
|
||||
)
|
||||
ORDER BY d.id
|
||||
LIMIT ?2
|
||||
"#;
|
||||
|
||||
let mut stmt = conn.prepare(sql)?;
|
||||
let rows = stmt
|
||||
.query_map(
|
||||
rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
|
||||
|row| {
|
||||
Ok(PendingDocument {
|
||||
document_id: row.get(0)?,
|
||||
content_text: row.get(1)?,
|
||||
content_hash: row.get(2)?,
|
||||
})
|
||||
},
|
||||
)?
|
||||
.collect::<std::result::Result<Vec<_>, _>>()?;
|
||||
|
||||
Ok(rows)
|
||||
}
|
||||
```
|
||||
|
||||
Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.
|
||||
|
||||
---
|
||||
|
||||
### Change 5: Adaptive dedup multiplier in vector search
|
||||
|
||||
**File:** `src/search/vector.rs`
|
||||
|
||||
Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.
|
||||
|
||||
```rust
|
||||
/// Query the max chunks any single document has in the embedding table.
|
||||
/// Returns the max chunk count, or a default floor if no data exists.
|
||||
fn max_chunks_per_document(conn: &Connection) -> i64 {
|
||||
conn.query_row(
|
||||
"SELECT COALESCE(MAX(cnt), 1) FROM (
|
||||
SELECT COUNT(*) as cnt FROM embedding_metadata
|
||||
WHERE last_error IS NULL
|
||||
GROUP BY document_id
|
||||
)",
|
||||
[],
|
||||
|row| row.get(0),
|
||||
)
|
||||
.unwrap_or(1)
|
||||
}
|
||||
|
||||
pub fn search_vector(
|
||||
conn: &Connection,
|
||||
query_embedding: &[f32],
|
||||
limit: usize,
|
||||
) -> Result<Vec<VectorResult>> {
|
||||
if query_embedding.is_empty() || limit == 0 {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
let embedding_bytes: Vec<u8> = query_embedding
|
||||
.iter()
|
||||
.flat_map(|f| f.to_le_bytes())
|
||||
.collect();
|
||||
|
||||
// Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
|
||||
// The 1.5x safety margin handles clustering in KNN results
|
||||
let max_chunks = max_chunks_per_document(conn);
|
||||
let multiplier = (max_chunks as usize * 3 / 2).max(8);
|
||||
let k = limit * multiplier;
|
||||
|
||||
// ... rest unchanged ...
|
||||
}
|
||||
```
|
||||
|
||||
**Why `max_chunks * 1.5` with floor of 8**:
|
||||
- `max_chunks` is the worst case for a single document dominating results
|
||||
- `* 1.5` adds margin for multiple clustered documents
|
||||
- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
|
||||
- This is a single aggregate query on an indexed column -- sub-millisecond
|
||||
|
||||
---
|
||||
|
||||
### Change 6: Update chunk_ids.rs comment
|
||||
|
||||
**File:** `src/embedding/chunk_ids.rs` (line 1-3)
|
||||
|
||||
Update the comment to reflect current reality:
|
||||
```rust
|
||||
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
|
||||
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
|
||||
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
|
||||
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Modified (Summary)
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
|
||||
| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
|
||||
| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
|
||||
| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
|
||||
| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
|
||||
| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
|
||||
| `src/main.rs` | Call `run_embed` with `args.full`, update import |
|
||||
| `src/embedding/chunk_ids.rs` | Comment update only |
|
||||
|
||||
## Verification
|
||||
|
||||
1. **Compile check**: `cargo build` -- no errors
|
||||
2. **Unit tests**: `cargo test` -- all existing tests pass
|
||||
3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
|
||||
4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
|
||||
5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
|
||||
6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
|
||||
7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.
|
||||
|
||||
## Decision Record (for future reference)
|
||||
|
||||
**Date:** 2026-02-02
|
||||
**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
|
||||
|
||||
**Downstream consequences identified:**
|
||||
1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
|
||||
2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
|
||||
3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
|
||||
4. Mixed chunk sizes degrade search -- fixed with config drift detection
|
||||
5. Ollama API call volume increases proportionally -- acceptable for local model
|
||||
|
||||
**Rejected alternatives:**
|
||||
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
|
||||
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
|
||||
- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.
|
||||
Reference in New Issue
Block a user