AGENTS.md: Comprehensive rewrite adding file deletion safeguards, destructive git command protocol, Rust toolchain conventions, code editing discipline rules, compiler check requirements, TDD mandate, MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass tool documentation, and session completion workflow. README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration, project resolution cascading match logic, lore health and robot-docs commands, exit codes 17 (not found) and 18 (ambiguous match), --color/--quiet global flags, dirty_sources and pending_discussion_fetches tables, and version command git hash output. docs/embedding-pipeline-hardening.md: Detailed spec covering the three problems from the chunk size reduction (broken --full wiring, mixed chunk sizes in vector space, static dedup multiplier) with decision records, implementation plan, and acceptance criteria. docs/phase-b-temporal-intelligence.md: Draft planning document for transforming gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
Status: Proposed Date: 2026-02-02 Context: Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
Problem Statement
Three issues stem from the chunk size reduction:
-
Broken
--fullwiring:handle_embedin main.rs ignoresargs.full(callsrun_embedinstead ofrun_embed_full).run_synchardcodesfalsefor retry_failed and never passesoptions.fullto embed. Users runninglore sync --fullorlore embed --fulldon't get a full re-embed. -
Mixed chunk sizes in vector space: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
-
Static dedup multiplier:
search_vectoruseslimit * 8to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reachinglimitunique documents. The multiplier should adapt to actual data.
Decision Record
| Decision | Choice | Rationale |
|---|---|---|
| Detect chunk config drift | Store chunk_max_bytes in embedding_metadata |
Allows automatic invalidation without user intervention. Self-heals on next sync. |
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
--full propagation |
sync --full passes full to embed step |
Matches user expectation: "start fresh" means everything, not just ingest+docs. |
| Migration strategy | New migration 010 for chunk_max_bytes column |
Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
Changes
Change 1: Wire --full flag through to embed
Files:
src/main.rs(line 1116)src/cli/commands/sync.rs(line 105)
main.rs handle_embed (line 1116):
// BEFORE:
let result = run_embed(&config, retry_failed).await?;
// AFTER:
let result = run_embed_full(&config, args.full, retry_failed).await?;
Update the import at top of main.rs from run_embed to run_embed_full.
sync.rs run_sync (line 105):
// BEFORE:
match run_embed(config, false).await {
// AFTER:
match run_embed_full(config, options.full, false).await {
Update the import at line 11 from run_embed to run_embed_full.
Cleanup embed.rs: Remove run_embed (the wrapper that hardcodes full: false). All callers should use run_embed_full directly. Rename run_embed_full to run_embed with the 3-arg signature (config, full, retry_failed).
Final signature:
pub async fn run_embed(
config: &Config,
full: bool,
retry_failed: bool,
) -> Result<EmbedCommandResult>
Change 2: Migration 010 -- add chunk_max_bytes to embedding_metadata
New file: migrations/010_chunk_config.sql
-- Migration 010: Chunk config tracking
-- Schema version: 10
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
-- Existing rows get NULL, which the change detector treats as "needs re-embed".
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
UPDATE schema_version SET version = 10
WHERE version = (SELECT MAX(version) FROM schema_version);
-- Or if using INSERT pattern:
INSERT INTO schema_version (version, applied_at, description)
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
Check existing migration pattern in src/core/db.rs for how migrations are applied -- follow that exact pattern for consistency.
Change 3: Store chunk_max_bytes when writing embeddings
File: src/embedding/pipeline.rs
store_embedding (lines 238-266): Add chunk_max_bytes to the INSERT:
// Add import at top:
use crate::embedding::chunking::CHUNK_MAX_BYTES;
// In store_embedding, update SQL:
conn.execute(
"INSERT OR REPLACE INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
],
)?;
record_embedding_error (lines 269-291): Also store chunk_max_bytes so error rows track which config they failed under:
conn.execute(
"INSERT INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
attempt_count = embedding_metadata.attempt_count + 1,
last_error = ?8,
last_attempt_at = ?7,
chunk_max_bytes = ?9",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
],
)?;
Change 4: Detect chunk config drift in change detector
File: src/embedding/change_detector.rs
Add a third condition to the pending detection: embeddings where chunk_max_bytes differs from the current CHUNK_MAX_BYTES constant (or is NULL, meaning pre-migration embeddings).
use crate::embedding::chunking::CHUNK_MAX_BYTES;
pub fn find_pending_documents(
conn: &Connection,
page_size: usize,
last_id: i64,
) -> Result<Vec<PendingDocument>> {
let sql = r#"
SELECT d.id, d.content_text, d.content_hash
FROM documents d
WHERE d.id > ?1
AND (
-- Case 1: No embedding metadata (new document)
NOT EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
)
-- Case 2: Document content changed
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND em.document_hash != d.content_hash
)
-- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
)
)
ORDER BY d.id
LIMIT ?2
"#;
let mut stmt = conn.prepare(sql)?;
let rows = stmt
.query_map(
rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
|row| {
Ok(PendingDocument {
document_id: row.get(0)?,
content_text: row.get(1)?,
content_hash: row.get(2)?,
})
},
)?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(rows)
}
Apply the same change to count_pending_documents -- add the third OR clause and the ?3 parameter.
Change 5: Adaptive dedup multiplier in vector search
File: src/search/vector.rs
Replace the static limit * 8 with an adaptive multiplier based on the actual max chunks-per-document in the database.
/// Query the max chunks any single document has in the embedding table.
/// Returns the max chunk count, or a default floor if no data exists.
fn max_chunks_per_document(conn: &Connection) -> i64 {
conn.query_row(
"SELECT COALESCE(MAX(cnt), 1) FROM (
SELECT COUNT(*) as cnt FROM embedding_metadata
WHERE last_error IS NULL
GROUP BY document_id
)",
[],
|row| row.get(0),
)
.unwrap_or(1)
}
pub fn search_vector(
conn: &Connection,
query_embedding: &[f32],
limit: usize,
) -> Result<Vec<VectorResult>> {
if query_embedding.is_empty() || limit == 0 {
return Ok(Vec::new());
}
let embedding_bytes: Vec<u8> = query_embedding
.iter()
.flat_map(|f| f.to_le_bytes())
.collect();
// Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
// The 1.5x safety margin handles clustering in KNN results
let max_chunks = max_chunks_per_document(conn);
let multiplier = (max_chunks as usize * 3 / 2).max(8);
let k = limit * multiplier;
// ... rest unchanged ...
}
Why max_chunks * 1.5 with floor of 8:
max_chunksis the worst case for a single document dominating results* 1.5adds margin for multiple clustered documents- Floor of
8ensures reasonable over-fetch even with single-chunk documents - This is a single aggregate query on an indexed column -- sub-millisecond
Change 6: Update chunk_ids.rs comment
File: src/embedding/chunk_ids.rs (line 1-3)
Update the comment to reflect current reality:
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
Files Modified (Summary)
| File | Change |
|---|---|
migrations/010_chunk_config.sql |
NEW -- Add chunk_max_bytes column |
src/embedding/pipeline.rs |
Store CHUNK_MAX_BYTES in metadata writes |
src/embedding/change_detector.rs |
Detect chunk config drift (3rd OR clause) |
src/search/vector.rs |
Adaptive dedup multiplier from DB |
src/cli/commands/embed.rs |
Consolidate to single run_embed(config, full, retry_failed) |
src/cli/commands/sync.rs |
Pass options.full to embed, update import |
src/main.rs |
Call run_embed with args.full, update import |
src/embedding/chunk_ids.rs |
Comment update only |
Verification
- Compile check:
cargo build-- no errors - Unit tests:
cargo test-- all existing tests pass - Migration test: Run
lore doctororlore migrate-- migration 010 applies cleanly - Full flag wiring:
lore embed --fullshould clear all embeddings and re-embed. Verify by checkinglore --robot statsbefore and after (embedded count should reset then rebuild). - Chunk config drift: After migration, existing embeddings have
chunk_max_bytes = NULL. Runninglore embed(without --full) should detect all existing embeddings as stale and re-embed them automatically. - Sync propagation:
lore sync --fullshould produce the same embed behavior aslore embed --full - Adaptive dedup: Run
lore search "some query"and verify the result count matches the requested limit (default 20). Check withRUST_LOG=debugthat the computedkvalue scales with actual chunk distribution.
Decision Record (for future reference)
Date: 2026-02-02 Trigger: Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
Downstream consequences identified:
- Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
- Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
- Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
- Mixed chunk sizes degrade search -- fixed with config drift detection
- Ollama API call volume increases proportionally -- acceptable for local model
Rejected alternatives:
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
- Manual
--fullas the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.