6 Commits

Author SHA1 Message Date
Taylor Eernisse
549a0646d7 chore: Add test-runner agent, agent-swarm-launcher skill, review artifacts, and beads updates
- .claude/agents/test-runner.md: New Claude Code agent definition for
  running cargo test suites and analyzing results, configured with
  haiku model for fast execution.

- skills/agent-swarm-launcher/: New skill for bootstrapping coordinated
  multi-agent workflows with AGENTS.md reconnaissance, Agent Mail
  coordination, and beads task tracking.

- api-review.html, phase-a-review.html: Self-contained HTML review
  artifacts for API audit and Phase A search pipeline review.

- .beads/issues.jsonl, .beads/last-touched: Updated issue tracker
  state reflecting current project work items.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:36:05 -05:00
Taylor Eernisse
a417640faa docs: Overhaul AGENTS.md, update README, add pipeline spec and Phase B plan
AGENTS.md: Comprehensive rewrite adding file deletion safeguards,
destructive git command protocol, Rust toolchain conventions, code
editing discipline rules, compiler check requirements, TDD mandate,
MCP Agent Mail coordination protocol, beads/bv/ubs/ast-grep/cass
tool documentation, and session completion workflow.

README.md: Document NO_COLOR/CLICOLOR env vars, --since 1m duration,
project resolution cascading match logic, lore health and robot-docs
commands, exit codes 17 (not found) and 18 (ambiguous match),
--color/--quiet global flags, dirty_sources and
pending_discussion_fetches tables, and version command git hash output.

docs/embedding-pipeline-hardening.md: Detailed spec covering the three
problems from the chunk size reduction (broken --full wiring, mixed
chunk sizes in vector space, static dedup multiplier) with decision
records, implementation plan, and acceptance criteria.

docs/phase-b-temporal-intelligence.md: Draft planning document for
transforming gitlore from a search engine into a temporal code
intelligence system by ingesting structured event data from GitLab.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:51 -05:00
Taylor Eernisse
f560e6bc00 test(embedding): Add regression tests for pipeline hardening bugs
Three targeted regression tests covering bugs fixed in the embedding
pipeline hardening:

- overflow_doc_with_error_sentinel_not_re_detected_as_pending: verifies
  that documents skipped for producing too many chunks have their
  sentinel error recorded in embedding_metadata and are NOT returned by
  find_pending_documents or count_pending_documents on subsequent runs
  (prevents infinite re-processing loop).

- count_and_find_pending_agree: exercises four states (empty DB, new
  document, fully-embedded document, config-drifted document) and
  asserts that count_pending_documents and find_pending_documents
  produce consistent results across all of them.

- full_embed_delete_is_atomic: confirms the --full flag's two DELETE
  statements (embedding_metadata + embeddings) execute atomically
  within a transaction.

Also updates test DB creation to apply migration 010.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:34 -05:00
Taylor Eernisse
aebbe6b795 feat(cli): Wire --full flag for embed, add sync stage spinners
- Add --full / --no-full flag pair to EmbedArgs with overrides_with
  semantics matching the existing flag pattern. When active, atomically
  DELETEs all embedding_metadata and embeddings before re-embedding.

- Thread the full flag through run_embed -> run_sync so that
  'lore sync --full' triggers a complete re-embed alongside the full
  re-ingest it already performed.

- Add indicatif spinners to sync stages with dynamic stage numbering
  that adjusts when --no-docs or --no-embed skip stages. Spinners are
  hidden in robot mode.

- Update robot-docs manifest to advertise the new --full flag on the
  embed command.

- Replace hardcoded schema version 9 in health check with the
  LATEST_SCHEMA_VERSION constant from db.rs.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:22 -05:00
Taylor Eernisse
7d07f95d4c fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:

- Config drift detection: find_pending_documents and
  count_pending_documents now take model_name and compare
  chunk_max_bytes, model, and dims against stored metadata. Documents
  embedded with stale config are automatically re-queued.

- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
  are skipped with a sentinel error recorded in embedding_metadata,
  preventing both rowid collision and infinite re-processing loops.

- Deferred clearing: old embeddings are no longer cleared before
  attempting new ones. clear_document_embeddings is deferred until the
  first successful chunk embedding, so if all chunks fail the document
  retains its previous embeddings rather than losing all data.

- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
  so a crash mid-page rolls back atomically instead of leaving partial
  state (cleared embeddings with no replacements).

- Per-chunk retry on context overflow: when a batch fails with a
  context-length error, each chunk is retried individually so one
  oversized chunk doesn't poison the entire batch.

- Adaptive dedup in vector search: replaces the static 3x over-fetch
  multiplier with a dynamic one based on actual max chunks per document
  (using the new chunk_count column with a fallback COUNT query for
  pre-migration data). Also replaces partial_cmp with total_cmp for
  f64 distance sorting.

- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
  embedding_metadata to support config drift detection and adaptive
  dedup without runtime queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:08 -05:00
Taylor Eernisse
2a52594a60 feat(db): Add migration 010 for chunk config tracking columns
Add chunk_max_bytes and chunk_count columns to embedding_metadata to
support config drift detection and adaptive dedup sizing. Includes a
partial index on sentinel rows (chunk_index=0) to accelerate the drift
detection and max-chunk queries.

Also exports LATEST_SCHEMA_VERSION as a public constant derived from
the MIGRATIONS array length, replacing the previously hardcoded magic
number in the health check.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:34:48 -05:00
23 changed files with 5442 additions and 85 deletions

File diff suppressed because one or more lines are too long

View File

@@ -1 +1 @@
bd-35o
bd-1j1

View File

@@ -0,0 +1,59 @@
---
name: test-runner
description: "Use this agent when unit tests need to be run and results analyzed. This includes after writing or modifying code, before committing changes, or when explicitly asked to verify test status.\\n\\nExamples:\\n\\n- User: \"Please refactor the parse_session function to handle edge cases\"\\n Assistant: \"Here is the refactored function with edge case handling: ...\"\\n [code changes applied]\\n Since a significant piece of code was modified, use the Task tool to launch the test-runner agent to verify nothing is broken.\\n Assistant: \"Now let me run the test suite to make sure everything still passes.\"\\n\\n- User: \"Do all tests pass?\"\\n Assistant: \"Let me use the Task tool to launch the test-runner agent to check the current test status.\"\\n\\n- User: \"I just finished implementing the search feature\"\\n Assistant: \"Let me use the Task tool to launch the test-runner agent to validate the implementation.\"\\n\\n- After any logical chunk of code is written or modified, proactively use the Task tool to launch the test-runner agent to run the tests before reporting completion to the user."
tools: Bash
model: haiku
color: orange
---
You are an expert test execution and analysis engineer. Your sole responsibility is to run the project's unit test suite, interpret the results with precision, and deliver a clear, actionable summary.
## Execution Protocol
1. **Discover the test framework**: Examine the project structure to determine how tests are run:
- Look for `Cargo.toml` (Rust: `cargo test`)
- If unclear, check README or CLAUDE.md for test instructions
2. **Run the tests**: Execute the appropriate test command. Capture full output including stdout and stderr. Do NOT run tests interactively or with watch mode. Use flags that produce verbose or detailed output when available (e.g., `cargo test -- --nocapture`, `jest --verbose`).
3. **Analyze results**: Parse the test output carefully and categorize:
- Total tests run
- Tests passed
- Tests failed (with details)
- Tests skipped/ignored
- Compilation errors (if tests couldn't even run)
4. **Report findings**:
**If ALL tests pass:**
Provide a concise success summary:
- Total test count and pass count
- Execution time if available
- Note any skipped/ignored tests and why (if apparent)
- A clear statement: "All tests passed."
**If ANY tests fail:**
Provide a detailed failure report:
- List each failing test by its full name/path
- Include the assertion error or panic message for each failure
- Include relevant expected vs actual values
- Note the file and line number where the failure occurred (if available)
- Group failures by module/file if there are many
- Suggest likely root causes when the error messages make it apparent
- Note if failures appear related (e.g., same underlying issue)
**If tests cannot run (compilation/setup error):**
- Report the exact error preventing test execution
- Identify the file and line causing the issue
- Distinguish between test code errors and source code errors
## Rules
- NEVER modify any source code or test code. You are read-only except for running the test command.
- NEVER skip running tests and guess at results. Always execute the actual test command.
- NEVER run the full application or any destructive commands. Only run test commands.
- If the test suite is extremely large, run it fully anyway. Do not truncate or sample.
- If multiple test targets exist (unit, integration, e2e), run unit tests only unless instructed otherwise.
- Report raw numbers. Do not round or approximate test counts.
- If tests produce warnings (not failures), mention them briefly but clearly separate them from failures.
- Keep the summary structured and scannable. Use bullet points and clear headers.

587
AGENTS.md
View File

@@ -1,6 +1,570 @@
# AGENTS.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## RULE 0 - THE FUNDAMENTAL OVERRIDE PEROGATIVE
If I tell you to do something, even if it goes against what follows below, YOU MUST LISTEN TO ME. I AM IN CHARGE, NOT YOU.
---
## RULE NUMBER 1: NO FILE DELETION
**YOU ARE NEVER ALLOWED TO DELETE A FILE WITHOUT EXPRESS PERMISSION.** Even a new file that you yourself created, such as a test code file. You have a horrible track record of deleting critically important files or otherwise throwing away tons of expensive work. As a result, you have permanently lost any and all rights to determine that a file or folder should be deleted.
**YOU MUST ALWAYS ASK AND RECEIVE CLEAR, WRITTEN PERMISSION BEFORE EVER DELETING A FILE OR FOLDER OF ANY KIND.**
---
## Irreversible Git & Filesystem Actions — DO NOT EVER BREAK GLASS
> **Note:** Treat destructive commands as break-glass. If there's any doubt, stop and ask.
1. **Absolutely forbidden commands:** `git reset --hard`, `git clean -fd`, `rm -rf`, or any command that can delete or overwrite code/data must never be run unless the user explicitly provides the exact command and states, in the same message, that they understand and want the irreversible consequences.
2. **No guessing:** If there is any uncertainty about what a command might delete or overwrite, stop immediately and ask the user for specific approval. "I think it's safe" is never acceptable.
3. **Safer alternatives first:** When cleanup or rollbacks are needed, request permission to use non-destructive options (`git status`, `git diff`, `git stash`, copying to backups) before ever considering a destructive command.
4. **Mandatory explicit plan:** Even after explicit user authorization, restate the command verbatim, list exactly what will be affected, and wait for a confirmation that your understanding is correct. Only then may you execute it—if anything remains ambiguous, refuse and escalate.
5. **Document the confirmation:** When running any approved destructive command, record (in the session notes / final response) the exact user text that authorized it, the command actually run, and the execution time. If that record is absent, the operation did not happen.
---
## Toolchain: Rust & Cargo
We only use **Cargo** in this project, NEVER any other package manager.
- **Edition/toolchain:** Follow `rust-toolchain.toml` (if present). Do not assume stable vs nightly.
- **Dependencies:** Explicit versions for stability; keep the set minimal.
- **Configuration:** Cargo.toml only
- **Unsafe code:** Forbidden (`#![forbid(unsafe_code)]`)
### Release Profile
Use the release profile defined in `Cargo.toml`. If you need to change it, justify the
performance/size tradeoff and how it impacts determinism and cancellation behavior.
---
## Code Editing Discipline
### No Script-Based Changes
**NEVER** run a script that processes/changes code files in this repo. Brittle regex-based transformations create far more problems than they solve.
- **Always make code changes manually**, even when there are many instances
- For many simple changes: use parallel subagents
- For subtle/complex changes: do them methodically yourself
### No File Proliferation
If you want to change something or add a feature, **revise existing code files in place**.
**NEVER** create variations like:
- `mainV2.rs`
- `main_improved.rs`
- `main_enhanced.rs`
New files are reserved for **genuinely new functionality** that makes zero sense to include in any existing file. The bar for creating new files is **incredibly high**.
---
## Backwards Compatibility
We do not care about backwards compatibility—we're in early development with no users. We want to do things the **RIGHT** way with **NO TECH DEBT**.
- Never create "compatibility shims"
- Never create wrapper functions for deprecated APIs
- Just fix the code directly
---
## Compiler Checks (CRITICAL)
**After any substantive code changes, you MUST verify no errors were introduced:**
```bash
# Check for compiler errors and warnings
cargo check --all-targets
# Check for clippy lints (pedantic + nursery are enabled)
cargo clippy --all-targets -- -D warnings
# Verify formatting
cargo fmt --check
```
If you see errors, **carefully understand and resolve each issue**. Read sufficient context to fix them the RIGHT way.
---
## Testing
### Unit & Property Tests
```bash
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
```
When adding or changing primitives, add tests that assert the core invariants:
- no task leaks
- no obligation leaks
- losers are drained after races
- region close implies quiescence
Prefer deterministic lab-runtime tests for concurrency-sensitive behavior.
---
## MCP Agent Mail — Multi-Agent Coordination
A mail-like layer that lets coding agents coordinate asynchronously via MCP tools and resources. Provides identities, inbox/outbox, searchable threads, and advisory file reservations with human-auditable artifacts in Git.
### Why It's Useful
- **Prevents conflicts:** Explicit file reservations (leases) for files/globs
- **Token-efficient:** Messages stored in per-project archive, not in context
- **Quick reads:** `resource://inbox/...`, `resource://thread/...`
### Same Repository Workflow
1. **Register identity:**
```
ensure_project(project_key=<abs-path>)
register_agent(project_key, program, model)
```
2. **Reserve files before editing:**
```
file_reservation_paths(project_key, agent_name, ["src/**"], ttl_seconds=3600, exclusive=true)
```
3. **Communicate with threads:**
```
send_message(..., thread_id="FEAT-123")
fetch_inbox(project_key, agent_name)
acknowledge_message(project_key, agent_name, message_id)
```
4. **Quick reads:**
```
resource://inbox/{Agent}?project=<abs-path>&limit=20
resource://thread/{id}?project=<abs-path>&include_bodies=true
```
### Macros vs Granular Tools
- **Prefer macros for speed:** `macro_start_session`, `macro_prepare_thread`, `macro_file_reservation_cycle`, `macro_contact_handshake`
- **Use granular tools for control:** `register_agent`, `file_reservation_paths`, `send_message`, `fetch_inbox`, `acknowledge_message`
### Common Pitfalls
- `"from_agent not registered"`: Always `register_agent` in the correct `project_key` first
- `"FILE_RESERVATION_CONFLICT"`: Adjust patterns, wait for expiry, or use non-exclusive reservation
- **Auth errors:** If JWT+JWKS enabled, include bearer token with matching `kid`
---
## Beads (br) — Dependency-Aware Issue Tracking
Beads provides a lightweight, dependency-aware issue database and CLI (`br` / beads_rust) for selecting "ready work," setting priorities, and tracking status. It complements MCP Agent Mail's messaging and file reservations.
**Note:** `br` is non-invasive—it never executes git commands directly. You must run git commands manually after `br sync --flush-only`.
### Conventions
- **Single source of truth:** Beads for task status/priority/dependencies; Agent Mail for conversation and audit
- **Shared identifiers:** Use Beads issue ID (e.g., `br-123`) as Mail `thread_id` and prefix subjects with `[br-123]`
- **Reservations:** When starting a task, call `file_reservation_paths()` with the issue ID in `reason`
### Typical Agent Flow
1. **Pick ready work (Beads):**
```bash
br ready --json # Choose highest priority, no blockers
```
2. **Reserve edit surface (Mail):**
```
file_reservation_paths(project_key, agent_name, ["src/**"], ttl_seconds=3600, exclusive=true, reason="br-123")
```
3. **Announce start (Mail):**
```
send_message(..., thread_id="br-123", subject="[br-123] Start: <title>", ack_required=true)
```
4. **Work and update:** Reply in-thread with progress
5. **Complete and release:**
```bash
br close br-123 --reason "Completed"
```
```
release_file_reservations(project_key, agent_name, paths=["src/**"])
```
Final Mail reply: `[br-123] Completed` with summary
### Mapping Cheat Sheet
| Concept | Value |
|---------|-------|
| Mail `thread_id` | `br-###` |
| Mail subject | `[br-###] ...` |
| File reservation `reason` | `br-###` |
| Commit messages | Include `br-###` for traceability |
---
## bv — Graph-Aware Triage Engine
bv is a graph-aware triage engine for Beads projects (`.beads/beads.jsonl`). It computes PageRank, betweenness, critical path, cycles, HITS, eigenvector, and k-core metrics deterministically.
**Scope boundary:** bv handles *what to work on* (triage, priority, planning). For agent-to-agent coordination (messaging, work claiming, file reservations), use MCP Agent Mail.
**CRITICAL: Use ONLY `--robot-*` flags. Bare `bv` launches an interactive TUI that blocks your session.**
### The Workflow: Start With Triage
**`bv --robot-triage` is your single entry point.** It returns:
- `quick_ref`: at-a-glance counts + top 3 picks
- `recommendations`: ranked actionable items with scores, reasons, unblock info
- `quick_wins`: low-effort high-impact items
- `blockers_to_clear`: items that unblock the most downstream work
- `project_health`: status/type/priority distributions, graph metrics
- `commands`: copy-paste shell commands for next steps
```bash
bv --robot-triage # THE MEGA-COMMAND: start here
bv --robot-next # Minimal: just the single top pick + claim command
```
### Command Reference
**Planning:**
| Command | Returns |
|---------|---------|
| `--robot-plan` | Parallel execution tracks with `unblocks` lists |
| `--robot-priority` | Priority misalignment detection with confidence |
**Graph Analysis:**
| Command | Returns |
|---------|---------|
| `--robot-insights` | Full metrics: PageRank, betweenness, HITS, eigenvector, critical path, cycles, k-core, articulation points, slack |
| `--robot-label-health` | Per-label health: `health_level`, `velocity_score`, `staleness`, `blocked_count` |
| `--robot-label-flow` | Cross-label dependency: `flow_matrix`, `dependencies`, `bottleneck_labels` |
| `--robot-label-attention [--attention-limit=N]` | Attention-ranked labels |
**History & Change Tracking:**
| Command | Returns |
|---------|---------|
| `--robot-history` | Bead-to-commit correlations |
| `--robot-diff --diff-since <ref>` | Changes since ref: new/closed/modified issues, cycles |
**Other:**
| Command | Returns |
|---------|---------|
| `--robot-burndown <sprint>` | Sprint burndown, scope changes, at-risk items |
| `--robot-forecast <id\|all>` | ETA predictions with dependency-aware scheduling |
| `--robot-alerts` | Stale issues, blocking cascades, priority mismatches |
| `--robot-suggest` | Hygiene: duplicates, missing deps, label suggestions |
| `--robot-graph [--graph-format=json\|dot\|mermaid]` | Dependency graph export |
| `--export-graph <file.html>` | Interactive HTML visualization |
### Scoping & Filtering
```bash
bv --robot-plan --label backend # Scope to label's subgraph
bv --robot-insights --as-of HEAD~30 # Historical point-in-time
bv --recipe actionable --robot-plan # Pre-filter: ready to work
bv --recipe high-impact --robot-triage # Pre-filter: top PageRank
bv --robot-triage --robot-triage-by-track # Group by parallel work streams
bv --robot-triage --robot-triage-by-label # Group by domain
```
### Understanding Robot Output
**All robot JSON includes:**
- `data_hash` — Fingerprint of source beads.jsonl
- `status` — Per-metric state: `computed|approx|timeout|skipped` + elapsed ms
- `as_of` / `as_of_commit` — Present when using `--as-of`
**Two-phase analysis:**
- **Phase 1 (instant):** degree, topo sort, density
- **Phase 2 (async, 500ms timeout):** PageRank, betweenness, HITS, eigenvector, cycles
### jq Quick Reference
```bash
bv --robot-triage | jq '.quick_ref' # At-a-glance summary
bv --robot-triage | jq '.recommendations[0]' # Top recommendation
bv --robot-plan | jq '.plan.summary.highest_impact' # Best unblock target
bv --robot-insights | jq '.status' # Check metric readiness
bv --robot-insights | jq '.Cycles' # Circular deps (must fix!)
```
---
## UBS — Ultimate Bug Scanner
**Golden Rule:** `ubs <changed-files>` before every commit. Exit 0 = safe. Exit >0 = fix & re-run.
### Commands
```bash
ubs file.rs file2.rs # Specific files (< 1s) — USE THIS
ubs $(git diff --name-only --cached) # Staged files — before commit
ubs --only=rust,toml src/ # Language filter (3-5x faster)
ubs --ci --fail-on-warning . # CI mode — before PR
ubs . # Whole project (ignores target/, Cargo.lock)
```
### Output Format
```
⚠️ Category (N errors)
file.rs:42:5 Issue description
💡 Suggested fix
Exit code: 1
```
Parse: `file:line:col` → location | 💡 → how to fix | Exit 0/1 → pass/fail
### Fix Workflow
1. Read finding → category + fix suggestion
2. Navigate `file:line:col` → view context
3. Verify real issue (not false positive)
4. Fix root cause (not symptom)
5. Re-run `ubs <file>` → exit 0
6. Commit
### Bug Severity
- **Critical (always fix):** Memory safety, use-after-free, data races, SQL injection
- **Important (production):** Unwrap panics, resource leaks, overflow checks
- **Contextual (judgment):** TODO/FIXME, println! debugging
---
## ast-grep vs ripgrep
**Use `ast-grep` when structure matters.** It parses code and matches AST nodes, ignoring comments/strings, and can **safely rewrite** code.
- Refactors/codemods: rename APIs, change import forms
- Policy checks: enforce patterns across a repo
- Editor/automation: LSP mode, `--json` output
**Use `ripgrep` when text is enough.** Fastest way to grep literals/regex.
- Recon: find strings, TODOs, log lines, config values
- Pre-filter: narrow candidate files before ast-grep
### Rule of Thumb
- Need correctness or **applying changes** → `ast-grep`
- Need raw speed or **hunting text** → `rg`
- Often combine: `rg` to shortlist files, then `ast-grep` to match/modify
### Rust Examples
```bash
# Find structured code (ignores comments)
ast-grep run -l Rust -p 'fn $NAME($$$ARGS) -> $RET { $$$BODY }'
# Find all unwrap() calls
ast-grep run -l Rust -p '$EXPR.unwrap()'
# Quick textual hunt
rg -n 'println!' -t rust
# Combine speed + precision
rg -l -t rust 'unwrap\(' | xargs ast-grep run -l Rust -p '$X.unwrap()' --json
```
---
## Morph Warp Grep — AI-Powered Code Search
**Use `mcp__morph-mcp__warp_grep` for exploratory "how does X work?" questions.** An AI agent expands your query, greps the codebase, reads relevant files, and returns precise line ranges with full context.
**Use `ripgrep` for targeted searches.** When you know exactly what you're looking for.
**Use `ast-grep` for structural patterns.** When you need AST precision for matching/rewriting.
### When to Use What
| Scenario | Tool | Why |
|----------|------|-----|
| "How is pattern matching implemented?" | `warp_grep` | Exploratory; don't know where to start |
| "Where is the quick reject filter?" | `warp_grep` | Need to understand architecture |
| "Find all uses of `Regex::new`" | `ripgrep` | Targeted literal search |
| "Find files with `println!`" | `ripgrep` | Simple pattern |
| "Replace all `unwrap()` with `expect()`" | `ast-grep` | Structural refactor |
### warp_grep Usage
```
mcp__morph-mcp__warp_grep(
repoPath: "/path/to/dcg",
query: "How does the safe pattern whitelist work?"
)
```
Returns structured results with file paths, line ranges, and extracted code snippets.
### Anti-Patterns
- **Don't** use `warp_grep` to find a specific function name → use `ripgrep`
- **Don't** use `ripgrep` to understand "how does X work" → wastes time with manual reads
- **Don't** use `ripgrep` for codemods → risks collateral edits
<!-- bv-agent-instructions-v1 -->
---
## Beads Workflow Integration
This project uses [beads_viewer](https://github.com/Dicklesworthstone/beads_viewer) for issue tracking. Issues are stored in `.beads/` and tracked in git.
**Note:** `br` is non-invasive—it never executes git commands directly. You must run git commands manually after `br sync --flush-only`.
### Essential Commands
```bash
# View issues (launches TUI - avoid in automated sessions)
bv
# CLI commands for agents (use these instead)
br ready # Show issues ready to work (no blockers)
br list --status=open # All open issues
br show <id> # Full issue details with dependencies
br create --title="..." --type=task --priority=2
br update <id> --status=in_progress
br close <id> --reason="Completed"
br close <id1> <id2> # Close multiple issues at once
br sync --flush-only # Export to JSONL (then manually: git add .beads/ && git commit)
```
### Workflow Pattern
1. **Start**: Run `br ready` to find actionable work
2. **Claim**: Use `br update <id> --status=in_progress`
3. **Work**: Implement the task
4. **Complete**: Use `br close <id>`
5. **Sync**: Run `br sync --flush-only`, then `git add .beads/ && git commit -m "Update beads"`
### Key Concepts
- **Dependencies**: Issues can block other issues. `br ready` shows only unblocked work.
- **Priority**: P0=critical, P1=high, P2=medium, P3=low, P4=backlog (use numbers, not words)
- **Types**: task, bug, feature, epic, question, docs
- **Blocking**: `br dep add <issue> <depends-on>` to add dependencies
### Session Protocol
**Before ending any session, run this checklist:**
```bash
git status # Check what changed
git add <files> # Stage code changes
br sync --flush-only # Export beads to JSONL
git add .beads/ # Stage beads changes
git commit -m "..." # Commit code and beads
git push # Push to remote
```
### Best Practices
- Check `br ready` at session start to find available work
- Update status as you work (in_progress → closed)
- Create new issues with `br create` when you discover tasks
- Use descriptive titles and set appropriate priority/type
- Always run `br sync --flush-only` then commit .beads/ before ending session
<!-- end-bv-agent-instructions -->
## Landing the Plane (Session Completion)
**When ending a work session**, you MUST complete ALL steps below. Work is NOT complete until `git push` succeeds.
**MANDATORY WORKFLOW:**
1. **File issues for remaining work** - Create issues for anything that needs follow-up
2. **Run quality gates** (if code changed) - Tests, linters, builds
3. **Update issue status** - Close finished work, update in-progress items
4. **PUSH TO REMOTE** - This is MANDATORY:
```bash
git pull --rebase
br sync --flush-only
git add .beads/
git commit -m "Update beads"
git push
git status # MUST show "up to date with origin"
```
5. **Clean up** - Clear stashes, prune remote branches
6. **Verify** - All changes committed AND pushed
7. **Hand off** - Provide context for next session
**CRITICAL RULES:**
- Work is NOT complete until `git push` succeeds
- NEVER stop before pushing - that leaves work stranded locally
- NEVER say "ready to push when you are" - YOU must push
- If push fails, resolve and retry until it succeeds
---
## cass — Cross-Agent Session Search
`cass` indexes prior agent conversations (Claude Code, Codex, Cursor, Gemini, ChatGPT, etc.) so we can reuse solved problems.
**Rules:** Never run bare `cass` (TUI). Always use `--robot` or `--json`.
### Examples
```bash
cass health
cass search "async runtime" --robot --limit 5
cass view /path/to/session.jsonl -n 42 --json
cass expand /path/to/session.jsonl -n 42 -C 3 --json
cass capabilities --json
cass robot-docs guide
```
### Tips
- Use `--fields minimal` for lean output
- Filter by agent with `--agent`
- Use `--days N` to limit to recent history
stdout is data-only, stderr is diagnostics; exit code 0 means success.
Treat cass as a way to avoid re-solving problems other agents already handled.
---
## Note for Codex/GPT-5.2
You constantly bother me and stop working with concerned questions that look similar to this:
```
Unexpected changes (need guidance)
- Working tree still shows edits I did not make in Cargo.toml, Cargo.lock, src/runtime.rs, src/scope.rs. Please advise whether to keep/commit/revert these before any further work. I did not touch them.
Next steps (pick one)
1. Decide how to handle the unrelated modified files above so we can resume cleanly.
```
NEVER EVER DO THAT AGAIN. The answer is literally ALWAYS the same: those are changes created by the potentially dozen of other agents working on the project at the same time. This is not only a common occurence, it happens multiple times PER MINUTE. The way to deal with it is simple: you NEVER, under ANY CIRCUMSTANCE, stash, revert, overwrite, or otherwise disturb in ANY way the work of other agents. Just treat those changes identically to changes that you yourself made. Just fool yourself into thinking YOU made the changes and simply don't recall it for some reason.
---
## Note on Built-in TODO Functionality
Also, if I ask you to explicitly use your built-in TODO functionality, don't complain about this and say you need to use beads. You can use built-in TODOs if I tell you specifically to do so. Always comply with such orders.
## TDD Requirements
@@ -77,6 +641,21 @@ lore --robot doctor
# Document and index statistics
lore --robot stats
# Quick health pre-flight check (exit 0 = healthy, 1 = unhealthy)
lore --robot health
# Generate searchable documents from ingested data
lore --robot generate-docs
# Generate vector embeddings via Ollama
lore --robot embed
# Agent self-discovery manifest (all commands, flags, exit codes)
lore robot-docs
# Version information
lore --robot version
```
### Response Format
@@ -114,6 +693,8 @@ Errors return structured JSON to stderr:
| 14 | Ollama unavailable |
| 15 | Ollama model not found |
| 16 | Embedding failed |
| 17 | Not found (entity does not exist) |
| 18 | Ambiguous match (use `-p` to specify project) |
| 20 | Config not found |
### Configuration Precedence
@@ -129,4 +710,8 @@ Errors return structured JSON to stderr:
- Check exit codes for error handling
- Parse JSON errors from stderr
- Use `-n` / `--limit` to control response size
- Use `-q` / `--quiet` to suppress progress bars and non-essential output
- Use `--color never` in non-TTY automation for ANSI-free output
- TTY detection handles piped commands automatically
- Use `lore --robot health` as a fast pre-flight check before queries
- The `-p` flag supports fuzzy project matching (suffix and substring)

View File

@@ -140,6 +140,8 @@ Create a personal access token with `read_api` scope:
| `LORE_ROBOT` | Enable robot mode globally (set to `true` or `1`) | No |
| `XDG_CONFIG_HOME` | XDG Base Directory for config (fallback: `~/.config`) | No |
| `XDG_DATA_HOME` | XDG Base Directory for data (fallback: `~/.local/share`) | No |
| `NO_COLOR` | Disable color output when set (any value) | No |
| `CLICOLOR` | Standard color control (0 to disable) | No |
| `RUST_LOG` | Logging level filter (e.g., `lore=debug`) | No |
## Commands
@@ -162,6 +164,7 @@ lore issues -l bug -l urgent # Multiple labels
lore issues -m "v1.0" # By milestone title
lore issues --since 7d # Updated in last 7 days
lore issues --since 2w # Updated in last 2 weeks
lore issues --since 1m # Updated in last month
lore issues --since 2024-01-01 # Updated since date
lore issues --due-before 2024-12-31 # Due before date
lore issues --has-due # Only issues with due dates
@@ -174,6 +177,17 @@ When listing, output includes: IID, title, state, author, assignee, labels, and
When showing a single issue (e.g., `lore issues 123`), output includes: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions.
#### Project Resolution
The `-p` / `--project` flag uses cascading match logic across all commands:
1. **Exact match**: `group/project`
2. **Case-insensitive**: `Group/Project`
3. **Suffix match**: `project` matches `group/project` (if unambiguous)
4. **Substring match**: `typescript` matches `vs/typescript-code` (if unambiguous)
If multiple projects match, an error lists the candidates with a hint to use the full path.
### `lore mrs`
Query merge requests from local database, or show a specific MR.
@@ -221,14 +235,14 @@ lore search "deploy" --author username # Filter by author
lore search "deploy" -p group/repo # Filter by project
lore search "deploy" --label backend # Filter by label (AND logic)
lore search "deploy" --path src/ # Filter by file path (trailing / for prefix)
lore search "deploy" --after 7d # Created after (7d, 2w, or YYYY-MM-DD)
lore search "deploy" --after 7d # Created after (7d, 2w, 1m, or YYYY-MM-DD)
lore search "deploy" --updated-after 2w # Updated after
lore search "deploy" -n 50 # Limit results (default 20, max 100)
lore search "deploy" --explain # Show ranking explanation per result
lore search "deploy" --fts-mode raw # Raw FTS5 query syntax (advanced)
```
Requires `lore generate-docs` (or `lore sync`) to have been run at least once. Semantic mode requires Ollama with the configured embedding model.
Requires `lore generate-docs` (or `lore sync`) to have been run at least once. Semantic and hybrid modes require `lore embed` (or `lore sync`) to have generated vector embeddings via Ollama.
### `lore sync`
@@ -359,12 +373,32 @@ Run pending database migrations.
lore migrate
```
### `lore health`
Quick pre-flight check for config, database, and schema version. Exits 0 if healthy, 1 if unhealthy.
```bash
lore health
```
Useful as a fast gate before running queries or syncs. For a more thorough check including authentication and project access, use `lore doctor`.
### `lore robot-docs`
Machine-readable command manifest for agent self-discovery. Returns a JSON schema of all commands, flags, exit codes, and example workflows.
```bash
lore robot-docs # Pretty-printed JSON
lore --robot robot-docs # Compact JSON for parsing
```
### `lore version`
Show version information.
Show version information including the git commit hash.
```bash
lore version
# lore version 0.1.0 (abc1234)
```
## Robot Mode
@@ -422,6 +456,8 @@ Errors return structured JSON to stderr:
| 14 | Ollama unavailable |
| 15 | Ollama model not found |
| 16 | Embedding failed |
| 17 | Not found (entity does not exist) |
| 18 | Ambiguous match (use `-p` to specify project) |
| 20 | Config not found |
## Configuration Precedence
@@ -439,8 +475,13 @@ Settings are resolved in this order (highest to lowest priority):
lore -c /path/to/config.json <command> # Use alternate config
lore --robot <command> # Machine-readable JSON
lore -J <command> # JSON shorthand
lore --color never <command> # Disable color output
lore --color always <command> # Force color output
lore -q <command> # Suppress non-essential output
```
Color output respects `NO_COLOR` and `CLICOLOR` environment variables in `auto` mode (the default).
## Shell Completions
Generate shell completions for tab-completion support:
@@ -480,6 +521,8 @@ Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:
| `documents` | Extracted searchable text for FTS and embedding |
| `documents_fts` | FTS5 full-text search index |
| `embeddings` | Vector embeddings for semantic search |
| `dirty_sources` | Entities needing document regeneration after ingest |
| `pending_discussion_fetches` | Queue for discussion fetch operations |
| `sync_runs` | Audit trail of sync operations |
| `sync_cursors` | Cursor positions for incremental sync |
| `app_locks` | Crash-safe single-flight lock |

1654
api-review.html Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,308 @@
# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
> **Status:** Proposed
> **Date:** 2026-02-02
> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
## Problem Statement
Three issues stem from the chunk size reduction:
1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.
2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.
## Decision Record
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
---
## Changes
### Change 1: Wire `--full` flag through to embed
**Files:**
- `src/main.rs` (line 1116)
- `src/cli/commands/sync.rs` (line 105)
**main.rs `handle_embed`** (line 1116):
```rust
// BEFORE:
let result = run_embed(&config, retry_failed).await?;
// AFTER:
let result = run_embed_full(&config, args.full, retry_failed).await?;
```
Update the import at top of main.rs from `run_embed` to `run_embed_full`.
**sync.rs `run_sync`** (line 105):
```rust
// BEFORE:
match run_embed(config, false).await {
// AFTER:
match run_embed_full(config, options.full, false).await {
```
Update the import at line 11 from `run_embed` to `run_embed_full`.
**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.
Final signature:
```rust
pub async fn run_embed(
config: &Config,
full: bool,
retry_failed: bool,
) -> Result<EmbedCommandResult>
```
---
### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`
**New file:** `migrations/010_chunk_config.sql`
```sql
-- Migration 010: Chunk config tracking
-- Schema version: 10
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
-- Existing rows get NULL, which the change detector treats as "needs re-embed".
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
UPDATE schema_version SET version = 10
WHERE version = (SELECT MAX(version) FROM schema_version);
-- Or if using INSERT pattern:
INSERT INTO schema_version (version, applied_at, description)
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
```
Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.
---
### Change 3: Store `chunk_max_bytes` when writing embeddings
**File:** `src/embedding/pipeline.rs`
**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:
```rust
// Add import at top:
use crate::embedding::chunking::CHUNK_MAX_BYTES;
// In store_embedding, update SQL:
conn.execute(
"INSERT OR REPLACE INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
],
)?;
```
**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:
```rust
conn.execute(
"INSERT INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
attempt_count = embedding_metadata.attempt_count + 1,
last_error = ?8,
last_attempt_at = ?7,
chunk_max_bytes = ?9",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
],
)?;
```
---
### Change 4: Detect chunk config drift in change detector
**File:** `src/embedding/change_detector.rs`
Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).
```rust
use crate::embedding::chunking::CHUNK_MAX_BYTES;
pub fn find_pending_documents(
conn: &Connection,
page_size: usize,
last_id: i64,
) -> Result<Vec<PendingDocument>> {
let sql = r#"
SELECT d.id, d.content_text, d.content_hash
FROM documents d
WHERE d.id > ?1
AND (
-- Case 1: No embedding metadata (new document)
NOT EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
)
-- Case 2: Document content changed
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND em.document_hash != d.content_hash
)
-- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
)
)
ORDER BY d.id
LIMIT ?2
"#;
let mut stmt = conn.prepare(sql)?;
let rows = stmt
.query_map(
rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
|row| {
Ok(PendingDocument {
document_id: row.get(0)?,
content_text: row.get(1)?,
content_hash: row.get(2)?,
})
},
)?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(rows)
}
```
Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.
---
### Change 5: Adaptive dedup multiplier in vector search
**File:** `src/search/vector.rs`
Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.
```rust
/// Query the max chunks any single document has in the embedding table.
/// Returns the max chunk count, or a default floor if no data exists.
fn max_chunks_per_document(conn: &Connection) -> i64 {
conn.query_row(
"SELECT COALESCE(MAX(cnt), 1) FROM (
SELECT COUNT(*) as cnt FROM embedding_metadata
WHERE last_error IS NULL
GROUP BY document_id
)",
[],
|row| row.get(0),
)
.unwrap_or(1)
}
pub fn search_vector(
conn: &Connection,
query_embedding: &[f32],
limit: usize,
) -> Result<Vec<VectorResult>> {
if query_embedding.is_empty() || limit == 0 {
return Ok(Vec::new());
}
let embedding_bytes: Vec<u8> = query_embedding
.iter()
.flat_map(|f| f.to_le_bytes())
.collect();
// Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
// The 1.5x safety margin handles clustering in KNN results
let max_chunks = max_chunks_per_document(conn);
let multiplier = (max_chunks as usize * 3 / 2).max(8);
let k = limit * multiplier;
// ... rest unchanged ...
}
```
**Why `max_chunks * 1.5` with floor of 8**:
- `max_chunks` is the worst case for a single document dominating results
- `* 1.5` adds margin for multiple clustered documents
- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
- This is a single aggregate query on an indexed column -- sub-millisecond
---
### Change 6: Update chunk_ids.rs comment
**File:** `src/embedding/chunk_ids.rs` (line 1-3)
Update the comment to reflect current reality:
```rust
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
```
---
## Files Modified (Summary)
| File | Change |
|------|--------|
| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
| `src/main.rs` | Call `run_embed` with `args.full`, update import |
| `src/embedding/chunk_ids.rs` | Comment update only |
## Verification
1. **Compile check**: `cargo build` -- no errors
2. **Unit tests**: `cargo test` -- all existing tests pass
3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.
## Decision Record (for future reference)
**Date:** 2026-02-02
**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
**Downstream consequences identified:**
1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
4. Mixed chunk sizes degrade search -- fixed with config drift detection
5. Ollama API call volume increases proportionally -- acceptable for local model
**Rejected alternatives:**
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.

View File

@@ -0,0 +1,951 @@
# Phase B: Temporal Intelligence Foundation
> **Status:** Draft
> **Prerequisite:** CP3 Gates B+C complete (working search + sync pipeline)
> **Goal:** Transform gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab and exposing temporal queries that answer "why" and "when" questions about project history.
---
## Motivation
gitlore currently stores **snapshots** — the latest state of each issue, MR, and discussion. But temporal queries need **change history**. When an issue's labels change from `priority::low` to `priority::critical`, the current schema overwrites the label junction. The transition is lost.
GitLab issues, MRs, and discussions contain the raw ingredients for temporal intelligence: state transitions, label mutations, assignee changes, cross-references between entities, and decision rationale in discussions. What's missing is a structured temporal index that makes these ingredients queryable.
### The Problem This Solves
Today, when an AI agent or developer asks "Why did the team switch from REST to GraphQL?" or "What happened with the auth migration?", the answer is scattered across paginated API responses with no temporal index, no cross-referencing, and no semantic layer. Reconstructing a decision timeline manually takes 20+ minutes of clicking through GitLab's UI. This phase makes it take 2 seconds.
### Forcing Function
This phase is designed around one concrete question: **"What happened with X?"** — where X is any keyword, feature name, or initiative. If `lore timeline "auth migration"` can produce a useful, chronologically-ordered narrative of all related events across issues, MRs, and discussions, the architecture is validated. If it can't, we learn what's missing before investing in deeper temporal features.
---
## Executive Summary (Gated Milestones)
Five gates, each independently verifiable and shippable:
**Gate 1 (Resource Events Ingestion):** Structured event data from GitLab APIs → local event tables
**Gate 2 (Cross-Reference Extraction):** Entity relationship graph from structured APIs + system note parsing
**Gate 3 (Decision Timeline):** `lore timeline` command — keyword-driven chronological narrative
**Gate 4 (File Decision History):** `lore file-history` command — MR-to-file linking + scoped timelines
**Gate 5 (Code Trace):** `lore trace` command — file:line → commit → MR → issue → rationale chain
### Key Design Decisions
- **Structured APIs over text parsing.** GitLab provides Resource Events APIs (`resource_state_events`, `resource_label_events`, `resource_milestone_events`) that return clean JSON. These are the primary data source for temporal events. System note parsing is a fallback for events without structured APIs (assignee changes, cross-references).
- **Dependent resource pattern.** Resource events are fetched per-entity, triggered by the existing dirty source tracking. Same architecture as discussion fetching — queue-based, resumable, incremental.
- **Opt-in event ingestion.** New config flag `sync.fetchResourceEvents` (default `true`) controls whether the sync pipeline fetches event data. Users who don't need temporal features skip the additional API calls.
- **Application-level graph traversal.** Cross-reference expansion uses BFS in Rust, not recursive SQL CTEs. Capped at configurable depth (default 1) for predictable performance.
- **Evolutionary library extraction.** New commands are built with typed return structs from day one. Old commands are not retrofitted until a concrete consumer (MCP server, web UI) requires it.
- **Phase A fields cherry-picked as needed.** `merge_commit_sha` and `squash_commit_sha` are added in this phase's migration. Remaining Phase A fields are handled in their own migration later.
### Scope Boundaries
**In scope:**
- Batch temporal queries over historical data
- Structured event ingestion from GitLab APIs
- Cross-reference graph construction
- CLI commands with robot mode JSON output
**Out of scope (future phases):**
- Real-time monitoring / notifications ("alert me when my code changes")
- MCP server (Phase C — consumes the library API this phase produces)
- Web UI (Phase D — consumes the same library API)
- Pattern evolution / cross-project trend detection (Phase C)
- Library extraction refactor (happens organically as new commands are added)
---
## Gate 1: Resource Events Ingestion
### 1.1 Rationale: Why Not Parse System Notes?
The original approach was to parse system note body text with regex to extract state changes and label mutations. Research revealed this is the wrong approach:
1. **Structured APIs exist.** GitLab's Resource Events APIs return clean JSON with explicit `action`, `state`, and `label` fields. Available on all tiers (Free, Premium, Ultimate).
2. **System notes are localized.** A French GitLab instance says `"ajouté l'étiquette ~bug"` — regex breaks for non-English instances.
3. **Label events aren't in the Notes API.** Per [GitLab Issue #24661](https://gitlab.com/gitlab-org/gitlab/-/issues/24661), label change system notes are not returned by the Notes API. The Resource Label Events API is the only reliable source.
4. **No versioned format spec.** System note text has changed across GitLab 14.x17.x with no documentation of format changes.
System note parsing is still used for events without structured APIs (see Gate 2), but with the explicit understanding that it's best-effort and fragile for non-English instances.
### 1.2 Schema (Migration 010)
**File:** `migrations/010_resource_events.sql`
```sql
-- State change events (opened, closed, reopened, merged, locked)
-- Source: GET /projects/:id/issues/:iid/resource_state_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_state_events
CREATE TABLE resource_state_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
state TEXT NOT NULL, -- 'opened' | 'closed' | 'reopened' | 'merged' | 'locked'
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
-- "closed by MR" link: structured by GitLab, not parsed from text
source_merge_request_id INTEGER, -- GitLab's MR iid that caused this state change
source_commit TEXT, -- commit SHA that caused this state change
UNIQUE(gitlab_id, project_id),
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE INDEX idx_state_events_issue ON resource_state_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_state_events_mr ON resource_state_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_state_events_created ON resource_state_events(created_at);
-- Label change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_label_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_label_events
CREATE TABLE resource_label_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
label_name TEXT NOT NULL,
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
UNIQUE(gitlab_id, project_id),
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE INDEX idx_label_events_issue ON resource_label_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_label_events_mr ON resource_label_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_label_events_created ON resource_label_events(created_at);
CREATE INDEX idx_label_events_label ON resource_label_events(label_name);
-- Milestone change events (add, remove)
-- Source: GET /projects/:id/issues/:iid/resource_milestone_events
-- Source: GET /projects/:id/merge_requests/:iid/resource_milestone_events
CREATE TABLE resource_milestone_events (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
milestone_title TEXT NOT NULL,
milestone_id INTEGER,
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
actor_username TEXT, -- display/search convenience
created_at INTEGER NOT NULL, -- ms epoch UTC
UNIQUE(gitlab_id, project_id),
CHECK (
(issue_id IS NOT NULL AND merge_request_id IS NULL)
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
)
);
CREATE INDEX idx_milestone_events_issue ON resource_milestone_events(issue_id)
WHERE issue_id IS NOT NULL;
CREATE INDEX idx_milestone_events_mr ON resource_milestone_events(merge_request_id)
WHERE merge_request_id IS NOT NULL;
CREATE INDEX idx_milestone_events_created ON resource_milestone_events(created_at);
```
### 1.3 Config Extension
**File:** `src/core/config.rs`
Add to `SyncConfig`:
```rust
/// Fetch resource events (state, label, milestone changes) during sync.
/// Increases API calls but enables temporal queries (lore timeline, etc.).
/// Default: true
#[serde(default = "default_true")]
pub fetch_resource_events: bool,
```
**Config file example:**
```json
{
"sync": {
"fetchResourceEvents": true
}
}
```
### 1.4 GitLab API Client
**New endpoints in `src/gitlab/client.rs`:**
```
GET /projects/:id/issues/:iid/resource_state_events?per_page=100
GET /projects/:id/issues/:iid/resource_label_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_state_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_label_events?per_page=100
GET /projects/:id/issues/:iid/resource_milestone_events?per_page=100
GET /projects/:id/merge_requests/:iid/resource_milestone_events?per_page=100
```
All endpoints use standard pagination. Fetch all pages per entity.
**New serde types in `src/gitlab/types.rs`:**
```rust
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabStateEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String, // "Issue" | "MergeRequest"
pub resource_id: i64,
pub state: String, // "opened" | "closed" | "reopened" | "merged" | "locked"
pub source_commit: Option<String>,
pub source_merge_request: Option<GitLabMergeRequestRef>,
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabLabelEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String,
pub resource_id: i64,
pub label: GitLabLabelRef,
pub action: String, // "add" | "remove"
}
#[derive(Debug, Clone, Deserialize, Serialize)]
pub struct GitLabMilestoneEvent {
pub id: i64,
pub user: Option<GitLabAuthor>,
pub created_at: String,
pub resource_type: String,
pub resource_id: i64,
pub milestone: GitLabMilestoneRef,
pub action: String, // "add" | "remove"
}
```
### 1.5 Ingestion Pipeline
**Architecture:** Generic dependent-fetch queue, generalizing the `pending_discussion_fetches` pattern. A single queue table serves all dependent resource types across Gates 1, 2, and 4, avoiding schema churn as new fetch types are added.
**New queue table (in migration 010):**
```sql
-- Generic queue for all dependent resource fetches (events, closes_issues, diffs)
-- Replaces per-type queue tables with a unified job model
CREATE TABLE pending_dependent_fetches (
id INTEGER PRIMARY KEY,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
entity_type TEXT NOT NULL CHECK (entity_type IN ('issue', 'merge_request')),
entity_iid INTEGER NOT NULL,
entity_local_id INTEGER NOT NULL,
job_type TEXT NOT NULL CHECK (job_type IN (
'resource_events', -- Gate 1: state + label + milestone events
'mr_closes_issues', -- Gate 2: closes_issues API
'mr_diffs' -- Gate 4: MR file changes
)),
payload_json TEXT, -- job-specific params, e.g. {"event_types":["state","label","milestone"]}
enqueued_at INTEGER NOT NULL,
attempts INTEGER NOT NULL DEFAULT 0,
last_error TEXT,
next_retry_at INTEGER,
locked_at INTEGER, -- crash recovery: NULL = available, non-NULL = in progress
UNIQUE(project_id, entity_type, entity_iid, job_type)
);
```
The `locked_at` column provides crash recovery: if a sync process crashes mid-drain, stale locks (older than 5 minutes) are automatically reclaimed on the next `lore sync` run. This is intentionally minimal — full job leasing with `locked_by` and lease expiration is unnecessary for a single-process CLI tool.
**Flow:**
1. During issue/MR ingestion, when an entity is upserted (new or updated), enqueue jobs in `pending_dependent_fetches`:
- For all entities: `job_type = 'resource_events'` (when `fetchResourceEvents` is true)
- For MRs: `job_type = 'mr_closes_issues'` (always, for Gate 2)
- For MRs: `job_type = 'mr_diffs'` (when `fetchMrFileChanges` is true, for Gate 4)
2. After primary ingestion completes, drain the dependent fetch queue:
- Claim jobs: `UPDATE ... SET locked_at = now WHERE locked_at IS NULL AND (next_retry_at IS NULL OR next_retry_at <= now)`
- For each job, dispatch by `job_type` to the appropriate fetcher
- On success: DELETE the job row
- On transient failure: increment `attempts`, set `next_retry_at` with exponential backoff, clear `locked_at`
3. `lore sync` drains dependent jobs after ingestion + discussion fetch steps.
**Incremental behavior:** Only entities that changed since last sync are enqueued. On `--full` sync, all entities are re-enqueued.
### 1.6 API Call Budget
Per entity: 3 API calls (state + label + milestone) for issues, 3 for MRs.
| Scenario | Entities | API Calls | Time at 2k req/min |
|----------|----------|-----------|---------------------|
| Initial sync, 500 issues + 200 MRs | 700 | 2,100 | ~1 min |
| Initial sync, 2,000 issues + 1,000 MRs | 3,000 | 9,000 | ~4.5 min |
| Incremental sync, 20 changed entities | 20 | 60 | <2 sec |
Acceptable for initial sync. Incremental sync adds negligible overhead.
**Optimization (future):** If milestone events prove low-value, make them opt-in to reduce calls by 1/3.
### 1.7 Acceptance Criteria
- [ ] Migration 010 creates all three event tables + generic dependent fetch queue
- [ ] `lore sync` fetches resource events for changed entities when `fetchResourceEvents` is true
- [ ] `lore sync --no-events` skips event fetching
- [ ] Event fetch failures are queued for retry with exponential backoff
- [ ] Stale locks (crashed sync) automatically reclaimed on next run
- [ ] `lore count events` shows event counts by type
- [ ] `lore stats --check` validates event table referential integrity
- [ ] `lore stats --check` validates dependent job queue health (no stuck locks, retryable jobs visible)
- [ ] Robot mode JSON for all new commands
---
## Gate 2: Cross-Reference Extraction
### 2.1 Rationale
Temporal queries need to follow links between entities: "MR !567 closed issue #234", "issue #234 mentioned in MR !567", "#299 was opened as a follow-up to !567". These relationships are captured in two places:
1. **Structured API:** `GET /projects/:id/merge_requests/:iid/closes_issues` returns issues that close when the MR merges. Also, `resource_state_events` includes `source_merge_request_id` for "closed by MR" events.
2. **System notes:** Cross-references like "mentioned in !456" and "closed by !789" appear in system note body text.
### 2.2 Schema (in Migration 010)
```sql
-- Cross-references between entities
-- Populated from: closes_issues API, state events, system note parsing
--
-- Directionality convention:
-- source = the entity where the reference was *observed* (contains the note, or is the MR in closes_issues)
-- target = the entity being *referenced* (the issue closed, the MR mentioned)
-- This is consistent across all source_methods and enables predictable BFS traversal.
--
-- Unresolved references: when a cross-reference points to an entity in a project
-- that isn't synced locally, target_entity_id is NULL but target_project_path and
-- target_entity_iid are populated. This preserves valuable edges rather than
-- silently dropping them. Timeline output marks these as "[external]".
CREATE TABLE entity_references (
id INTEGER PRIMARY KEY,
source_entity_type TEXT NOT NULL CHECK (source_entity_type IN ('issue', 'merge_request')),
source_entity_id INTEGER NOT NULL, -- local DB id
target_entity_type TEXT NOT NULL CHECK (target_entity_type IN ('issue', 'merge_request')),
target_entity_id INTEGER, -- local DB id (NULL when target is unresolved/external)
target_project_path TEXT, -- e.g. "group/other-repo" (populated for cross-project refs)
target_entity_iid INTEGER, -- GitLab iid (populated when target_entity_id is NULL)
reference_type TEXT NOT NULL, -- 'closes' | 'mentioned' | 'related'
source_method TEXT NOT NULL, -- 'api_closes_issues' | 'api_state_event' | 'system_note_parse'
created_at INTEGER, -- when the reference was created (if known)
UNIQUE(source_entity_type, source_entity_id, target_entity_type,
COALESCE(target_entity_id, -1), COALESCE(target_project_path, ''),
COALESCE(target_entity_iid, -1), reference_type)
);
CREATE INDEX idx_refs_source ON entity_references(source_entity_type, source_entity_id);
CREATE INDEX idx_refs_target ON entity_references(target_entity_type, target_entity_id)
WHERE target_entity_id IS NOT NULL;
CREATE INDEX idx_refs_unresolved ON entity_references(target_project_path, target_entity_iid)
WHERE target_entity_id IS NULL;
```
### 2.3 Population Strategy
**Tier 1 — Structured APIs (reliable):**
1. **`closes_issues` endpoint:** After MR ingestion, fetch `GET /projects/:id/merge_requests/:iid/closes_issues`. Insert `reference_type = 'closes'`, `source_method = 'api_closes_issues'`. Source = MR, target = issue.
2. **State events:** When `resource_state_events` contains `source_merge_request_id`, insert `reference_type = 'closes'`, `source_method = 'api_state_event'`. Source = MR (referenced by iid), target = issue (that received the state change).
**Tier 2 — System note parsing (best-effort):**
Parse system notes where `is_system = 1` for cross-reference patterns.
**Directionality rule:** Source = entity containing the system note. Target = entity referenced by the note text. This is consistent with Tier 1's convention.
```
mentioned in !{iid}
mentioned in #{iid}
mentioned in {group}/{project}!{iid}
mentioned in {group}/{project}#{iid}
closed by !{iid}
closed by #{iid}
```
**Cross-project references:** When a system note references `{group}/{project}#{iid}` and the target project is not synced locally, store with `target_entity_id = NULL`, `target_project_path = '{group}/{project}'`, `target_entity_iid = {iid}`. These unresolved references are still valuable for timeline narratives — they indicate external dependencies and decision context even when we can't traverse further.
Insert with `source_method = 'system_note_parse'`. Accept that:
- This breaks on non-English GitLab instances
- Format may vary across GitLab versions
- Log parse failures at `debug` level for monitoring
**Tier 3 — Description/body parsing (deferred):**
Issue and MR descriptions often contain `#123` or `!456` references. Parsing these is lower confidence (mentions != relationships) and is deferred to a future iteration.
### 2.4 Ingestion Flow
The `closes_issues` fetch uses the generic dependent fetch queue (`job_type = 'mr_closes_issues'`):
- After MR ingestion, a `mr_closes_issues` job is enqueued alongside `resource_events` jobs
- One additional API call per MR: `GET /projects/:id/merge_requests/:iid/closes_issues`
- Cross-reference parsing from system notes runs as a local post-processing step (no API calls) after all dependent fetches complete
### 2.5 Acceptance Criteria
- [ ] `entity_references` table populated from `closes_issues` API for all synced MRs
- [ ] `entity_references` table populated from `resource_state_events` where `source_merge_request_id` is present
- [ ] System notes parsed for cross-reference patterns (English instances)
- [ ] Cross-project references stored as unresolved when target project is not synced
- [ ] `source_method` column tracks provenance of each reference
- [ ] References are deduplicated (same relationship from multiple sources stored once)
- [ ] Timeline JSON includes expansion provenance (`via`) for all expanded entities
---
## Gate 3: Decision Timeline (`lore timeline`)
### 3.1 Command Design
```bash
# Basic: keyword-driven timeline
lore timeline "auth migration"
# Scoped to project
lore timeline "auth migration" -p group/repo
# Limit date range
lore timeline "auth migration" --since 6m
lore timeline "auth migration" --since 2024-01-01
# Control cross-reference expansion depth
lore timeline "auth migration" --depth 0 # No expansion (matched entities only)
lore timeline "auth migration" --depth 1 # Follow direct references (default)
lore timeline "auth migration" --depth 2 # Two hops
# Control which edge types are followed during expansion
lore timeline "auth migration" --expand-mentions # Also follow 'mentioned' edges (off by default)
# Default expansion follows 'closes' and 'related' edges only.
# 'mentioned' edges are excluded by default because they have high fan-out
# and often connect tangentially related entities.
# Limit results
lore timeline "auth migration" -n 50
# Robot mode
lore -J timeline "auth migration"
```
### 3.2 Query Flow
```
1. SEED: FTS5 keyword search → matched document IDs (issues, MRs, and notes/discussions)
2. HYDRATE:
- Map document IDs → source entities (issues, MRs)
- Collect top matched notes as evidence candidates (bounded, default top 10)
These are the actual decision-bearing comments that answer "why"
3. EXPAND: Follow entity_references (BFS, depth-limited)
→ Discover related entities not matched by keywords
→ Default: follow 'closes' + 'related' edges; skip 'mentioned' unless --expand-mentions
→ Unresolved (external) references included in output but not traversed further
4. COLLECT EVENTS: For all entities (seed + expanded):
- Entity creation (created_at from issues/merge_requests)
- State changes (resource_state_events)
- Label changes (resource_label_events)
- Milestone changes (resource_milestone_events)
- Evidence notes: top FTS5-matched notes as discrete events (snippet + author + url)
- Merge events (merged_at from merge_requests)
5. INTERLEAVE: Sort all events chronologically
6. RENDER: Format as timeline (human or JSON)
```
**Why evidence notes instead of "discussion activity summarized":** The forcing function is "What happened with X?" A timeline entry that says "3 new comments" doesn't answer *why* — it answers *how many*. By including the top FTS5-matched notes as first-class timeline events, the timeline surfaces the actual decision rationale, code review feedback, and architectural reasoning that motivated changes. This uses the existing search infrastructure (CP3) with no new indexing required.
### 3.3 Event Model
The timeline doesn't store a separate unified event table. Instead, it queries across the existing tables at read time and produces a virtual event stream:
```rust
pub struct TimelineEvent {
pub timestamp: i64, // ms epoch
pub entity_type: String, // "issue" | "merge_request" | "discussion"
pub entity_iid: i64,
pub project_path: String,
pub event_type: TimelineEventType,
pub summary: String, // human-readable one-liner
pub actor: Option<String>, // username
pub url: Option<String>,
pub is_seed: bool, // matched by keyword (vs. expanded via reference)
}
pub enum TimelineEventType {
Created, // entity opened/created
StateChanged { state: String }, // closed, reopened, merged, locked
LabelAdded { label: String },
LabelRemoved { label: String },
MilestoneSet { milestone: String },
MilestoneRemoved { milestone: String },
Merged,
NoteEvidence { // FTS5-matched note surfacing decision rationale
note_id: i64,
snippet: String, // first ~200 chars of the matching note body
discussion_id: Option<i64>,
},
CrossReferenced { target: String },
}
```
### 3.4 Human Output Format
```
lore timeline "auth migration"
Timeline: "auth migration" (12 events across 4 entities)
───────────────────────────────────────────────────────
2024-03-15 CREATED #234 Migrate to OAuth2 @alice
Labels: ~auth, ~breaking-change
2024-03-18 CREATED !567 feat: add OAuth2 provider @bob
References: #234
2024-03-20 NOTE #234 "Should we support SAML too? I think @charlie
we should stick with OAuth2 for now..."
2024-03-22 LABEL !567 added ~security-review @alice
2024-03-24 NOTE !567 [src/auth/oauth.rs:45] @dave
"Consider refresh token rotation to
prevent session fixation attacks"
2024-03-25 MERGED !567 feat: add OAuth2 provider @alice
2024-03-26 CLOSED #234 closed by !567 @alice
2024-03-28 CREATED #299 OAuth2 login fails for SSO users @dave [expanded]
(via !567, closes)
───────────────────────────────────────────────────────
Seed entities: #234, !567 | Expanded: #299 (depth 1, via !567)
```
Entities discovered via cross-reference expansion are marked `[expanded]` with a compact provenance note showing which seed entity and edge type led to their discovery.
Evidence notes (`NOTE` events) show the first ~200 characters of FTS5-matched note bodies. These are the actual decision-bearing comments that answer "why" — not just activity counts.
### 3.5 Robot Mode JSON
```json
{
"ok": true,
"data": {
"query": "auth migration",
"event_count": 12,
"seed_entities": [
{ "type": "issue", "iid": 234, "project": "group/repo" },
{ "type": "merge_request", "iid": 567, "project": "group/repo" }
],
"expanded_entities": [
{
"type": "issue",
"iid": 299,
"project": "group/repo",
"depth": 1,
"via": {
"from": { "type": "merge_request", "iid": 567, "project": "group/repo" },
"reference_type": "closes",
"source_method": "api_closes_issues"
}
}
],
"unresolved_references": [
{
"source": { "type": "merge_request", "iid": 567, "project": "group/repo" },
"target_project": "group/other-repo",
"target_type": "issue",
"target_iid": 42,
"reference_type": "mentioned"
}
],
"events": [
{
"timestamp": "2024-03-15T10:00:00Z",
"entity_type": "issue",
"entity_iid": 234,
"project": "group/repo",
"event_type": "created",
"summary": "Migrate to OAuth2",
"actor": "alice",
"url": "https://gitlab.com/group/repo/-/issues/234",
"is_seed": true,
"details": {
"labels": ["auth", "breaking-change"]
}
},
{
"timestamp": "2024-03-20T14:30:00Z",
"entity_type": "issue",
"entity_iid": 234,
"project": "group/repo",
"event_type": "note_evidence",
"summary": "Should we support SAML too? I think we should stick with OAuth2 for now...",
"actor": "charlie",
"url": "https://gitlab.com/group/repo/-/issues/234#note_12345",
"is_seed": true,
"details": {
"note_id": 12345,
"snippet": "Should we support SAML too? I think we should stick with OAuth2 for now..."
}
}
]
},
"meta": {
"search_mode": "lexical",
"expansion_depth": 1,
"expand_mentions": false,
"total_entities": 3,
"total_events": 12,
"evidence_notes_included": 4,
"unresolved_references": 1
}
}
```
### 3.6 Acceptance Criteria
- [ ] `lore timeline <query>` returns chronologically ordered events
- [ ] Seed entities found via FTS5 keyword search (issues, MRs, and notes)
- [ ] State, label, and milestone events interleaved from resource event tables
- [ ] Entity creation and merge events included
- [ ] Evidence-bearing notes included as `note_evidence` events (top FTS5 matches, bounded default 10)
- [ ] Cross-reference expansion follows `entity_references` to configurable depth
- [ ] Default expansion follows `closes` + `related` edges; `--expand-mentions` adds `mentioned` edges
- [ ] `--depth 0` disables expansion
- [ ] `--since` filters by event timestamp
- [ ] `-p` scopes to project
- [ ] Human output is colored and readable
- [ ] Robot mode returns structured JSON with expansion provenance (`via`) for expanded entities
- [ ] Unresolved (external) references included in JSON output
---
## Gate 4: File Decision History (`lore file-history`)
### 4.1 Schema (Migration 011)
**File:** `migrations/011_file_changes.sql`
```sql
-- Files changed by each merge request
-- Source: GET /projects/:id/merge_requests/:iid/diffs
CREATE TABLE mr_file_changes (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER NOT NULL REFERENCES merge_requests(id) ON DELETE CASCADE,
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
old_path TEXT, -- NULL for new files
new_path TEXT NOT NULL,
change_type TEXT NOT NULL CHECK (change_type IN ('added', 'modified', 'deleted', 'renamed')),
UNIQUE(merge_request_id, new_path)
);
CREATE INDEX idx_mr_files_new_path ON mr_file_changes(new_path);
CREATE INDEX idx_mr_files_old_path ON mr_file_changes(old_path)
WHERE old_path IS NOT NULL;
CREATE INDEX idx_mr_files_mr ON mr_file_changes(merge_request_id);
-- Add commit SHAs to merge_requests (cherry-picked from Phase A)
-- These link MRs to actual git history
ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT;
```
### 4.2 Config Extension
```json
{
"sync": {
"fetchMrFileChanges": true
}
}
```
Opt-in. When enabled, the sync pipeline fetches `GET /projects/:id/merge_requests/:iid/diffs` for each changed MR and extracts file metadata. Diff content is **not stored** — only file paths and change types.
### 4.3 Ingestion
**Uses the generic dependent fetch queue (`job_type = 'mr_diffs'`):**
1. After MR ingestion, if `fetchMrFileChanges` is true, enqueue a `mr_diffs` job in `pending_dependent_fetches`.
2. Parse response: `changes[].{old_path, new_path, new_file, renamed_file, deleted_file}`.
3. Derive `change_type`:
- `new_file == true``'added'`
- `renamed_file == true``'renamed'`
- `deleted_file == true``'deleted'`
- else → `'modified'`
4. Upsert into `mr_file_changes`. On re-sync, DELETE existing rows for the MR and re-insert (diffs can change if MR is rebased).
**API call cost:** 1 additional call per MR. Acceptable for incremental sync (1050 MRs/day).
### 4.4 Command Design
```bash
# Show decision history for a file
lore file-history src/auth/oauth.rs
# Scoped to project (required if file path exists in multiple projects)
lore file-history src/auth/oauth.rs -p group/repo
# Include discussions on the MRs
lore file-history src/auth/oauth.rs --discussions
# Follow rename chains (default: on)
lore file-history src/auth/oauth.rs # follows renames automatically
lore file-history src/auth/oauth.rs --no-follow-renames # disable rename chain resolution
# Limit results
lore file-history src/auth/oauth.rs -n 10
# Filter to merged MRs only
lore file-history src/auth/oauth.rs --merged
# Robot mode
lore -J file-history src/auth/oauth.rs
```
### 4.5 Query Logic
```sql
SELECT
mr.iid,
mr.title,
mr.state,
mr.author_username,
mr.merged_at,
mr.created_at,
mr.web_url,
mr.merge_commit_sha,
mfc.change_type,
mfc.old_path,
(SELECT COUNT(*) FROM discussions d
WHERE d.merge_request_id = mr.id) AS discussion_count,
(SELECT COUNT(*) FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE d.merge_request_id = mr.id
AND n.position_new_path = ?1) AS file_discussion_count
FROM mr_file_changes mfc
JOIN merge_requests mr ON mr.id = mfc.merge_request_id
WHERE mfc.new_path = ?1 OR mfc.old_path = ?1
ORDER BY COALESCE(mr.merged_at, mr.created_at) DESC;
```
For each MR, optionally fetch related issues via `entity_references` (Gate 2 data).
### 4.6 Rename Handling
File renames are tracked via `old_path` and resolved as bounded chains:
1. Start with the query path in the path set: `{src/auth/oauth.rs}`
2. Search `mr_file_changes` for rows where `change_type = 'renamed'` and either `new_path` or `old_path` is in the path set
3. Add the other side of each rename to the path set
4. Repeat until no new paths are discovered, up to a maximum of 10 hops (configurable)
5. Use the full path set for the file history query
**Safeguards:**
- Hop cap (default 10) prevents runaway expansion
- Cycle detection: if a path is already in the set, skip it
- The unioned path set is used for matching MRs in the main query
**Output:**
- Human mode annotates the rename chain: `"src/auth/oauth.rs (renamed from src/auth/handler.rs ← src/auth.rs)"`
- Robot mode JSON includes `rename_chain`: `["src/auth.rs", "src/auth/handler.rs", "src/auth/oauth.rs"]`
- `--no-follow-renames` disables chain resolution (matches only the literal path provided)
### 4.7 Acceptance Criteria
- [ ] `mr_file_changes` table populated from GitLab diffs API
- [ ] `merge_commit_sha` and `squash_commit_sha` captured in `merge_requests`
- [ ] `lore file-history <path>` returns MRs ordered by merge/creation date
- [ ] Output includes: MR title, state, author, change type, discussion count
- [ ] `--discussions` shows inline discussion snippets from DiffNotes on the file
- [ ] Rename chains resolved with bounded hop count (default 10) and cycle detection
- [ ] `--no-follow-renames` disables chain resolution
- [ ] Robot mode JSON includes `rename_chain` when renames are detected
- [ ] Robot mode JSON output
- [ ] `-p` required when path exists in multiple projects (Ambiguous error)
---
## Gate 5: Code Trace (`lore trace`)
### 5.1 Overview
`lore trace` answers "Why was this code introduced?" by tracing from a file (and optionally a line number) back through the MR and issue that motivated the change.
### 5.2 Two-Tier Architecture
**Tier 1 — API-only (no local git required):**
Uses `merge_commit_sha` and `squash_commit_sha` from the `merge_requests` table to link MRs to commits. Combined with `mr_file_changes`, this can answer "which MRs touched this file" and link to their motivating issues via `entity_references`.
This is equivalent to `lore file-history` enriched with issue context — effectively a file-scoped decision timeline.
**Tier 2 — Git integration (requires local clone):**
Uses `git blame` to map a specific line to a commit SHA, then resolves the commit to an MR via `merge_commit_sha` lookup. This provides line-level precision.
**Gate 5 ships Tier 1 only.** Tier 2 (git integration via `git2-rs`) is a future enhancement.
### 5.3 Command Design
```bash
# Trace a file's history (Tier 1: API-only)
lore trace src/auth/oauth.rs
# Trace a specific line (Tier 2: requires local git)
lore trace src/auth/oauth.rs:45
# Robot mode
lore -J trace src/auth/oauth.rs
```
### 5.4 Query Flow (Tier 1)
```
1. Find MRs that touched this file (mr_file_changes)
2. For each MR, find related issues (entity_references WHERE reference_type = 'closes')
3. For each issue, fetch discussions with rationale
4. Build trace chain: file → MR → issue → discussions
5. Order by merge date (most recent first)
```
### 5.5 Output Format (Human)
```
lore trace src/auth/oauth.rs
Trace: src/auth/oauth.rs
────────────────────────
!567 feat: add OAuth2 provider MERGED 2024-03-25
→ Closes #234: Migrate to OAuth2
→ 12 discussion comments, 4 on this file
→ Decision: Use rust-oauth2 crate (discussed in #234, comment by @alice)
!612 fix: token refresh race condition MERGED 2024-04-10
→ Closes #299: OAuth2 login fails for SSO users
→ 5 discussion comments, 2 on this file
→ [src/auth/oauth.rs:45] "Add mutex around refresh to prevent double-refresh"
!701 refactor: extract TokenManager MERGED 2024-05-01
→ Related: #312: Reduce auth module complexity
→ 3 discussion comments
→ Note: file was renamed from src/auth/handler.rs
```
### 5.6 Tier 2 Design Notes (Future — Not in This Phase)
When git integration is added:
1. Add `git2-rs` dependency for native git operations
2. Implement `git blame -L <line>,<line> <file>` to get commit SHA for a specific line
3. Look up commit SHA in `merge_requests.merge_commit_sha` or `merge_requests.squash_commit_sha`
4. If no match (commit was squashed), search `merge_commit_sha` for commits in the blame range
5. Optional `blame_cache` table for performance (invalidated by content hash)
**Known limitation:** Squash commits break blame-to-MR mapping for individual commits within an MR. The squash commit SHA maps to the MR, but all lines show the same commit. This is a fundamental Git limitation documented in [GitLab Forum #77146](https://forum.gitlab.com/t/preserve-blame-in-squash-merge/77146).
### 5.7 Acceptance Criteria (Tier 1 Only)
- [ ] `lore trace <file>` shows MRs that touched the file with linked issues and discussion context
- [ ] Output includes the MR → issue → discussion chain
- [ ] Discussion snippets show DiffNote content on the traced file
- [ ] Cross-references from `entity_references` used for MR→issue linking
- [ ] Robot mode JSON output
- [ ] Graceful handling when no MR data found ("Run `lore sync` with `fetchMrFileChanges: true`")
---
## Migration Strategy
### Migration Numbering
Phase B uses migration numbers starting at 010:
| Migration | Content | Gate |
|-----------|---------|------|
| 010 | Resource event tables, generic dependent fetch queue, entity_references | Gates 1, 2 |
| 011 | mr_file_changes, merge_commit_sha, squash_commit_sha | Gate 4 |
Phase A's complete field capture migration should use 012+ when implemented, skipping fields already added by 011 (`merge_commit_sha`, `squash_commit_sha`).
### Backward Compatibility
- All new tables are additive (no ALTER on existing data-bearing columns)
- `lore sync` works without event data — temporal commands gracefully report "No event data. Run `lore sync` to populate."
- Existing search, issues, mrs commands are unaffected
---
## Risks and Mitigations
### Identified During Premortem
| Risk | Severity | Mitigation |
|------|----------|------------|
| API call volume explosion (3 event calls per entity) | Medium | Incremental sync limits to changed entities; opt-in config flag |
| System note parsing fragile for non-English instances | Medium | Used only for assignee changes and cross-refs; `source_method` tracks provenance |
| GitLab diffs API returns large payloads | Low | Extract file metadata only, discard diff content |
| Cross-reference graph traversal unbounded | Medium | BFS depth capped at configurable limit (default 1); `mentioned` edges excluded by default |
| Cross-project references lost when target not synced | Medium | Unresolved references stored with `target_entity_id = NULL`; still appear in timeline output |
| Phase A migration numbering conflict | Low | Phase B uses 010-011; Phase A uses 012+ |
| Timeline output lacks "why" evidence | Medium | Evidence-bearing notes from FTS5 included as first-class timeline events |
| Squash commits break blame-to-MR mapping | Medium | Tier 2 (git integration) deferred; Tier 1 uses file-level MR matching |
### Accepted Limitations
- **No real-time monitoring.** Phase B is batch queries over historical data. "Notify me when my code changes" requires a different architecture (webhooks, polling daemon) and is out of scope.
- **No pattern evolution.** Cross-project trend detection requires all of Phase B's infrastructure plus semantic clustering. Deferred to Phase C.
- **English-only system note parsing.** Cross-reference extraction from system notes works reliably only for English-language GitLab instances. Structured API data works for all languages.
- **Bounded rename chain resolution.** `lore file-history` resolves rename chains up to 10 hops with cycle detection. Pathological rename histories (>10 hops) are truncated.
- **Evidence notes are keyword-matched, not summarized.** Timeline evidence notes are the raw FTS5-matched note text, not AI-generated summaries. This keeps the system deterministic and avoids LLM dependencies.
---
## Success Metrics
| Metric | Target |
|--------|--------|
| `lore timeline` query latency | < 200ms for typical queries (< 50 seed entities) |
| Timeline event coverage | State + label + creation + merge + evidence note events for all synced entities |
| Timeline evidence quality | Top 10 FTS5-matched notes included per query; at least 1 evidence note for queries matching discussion-bearing entities |
| Cross-reference coverage | > 80% of "closed by MR" relationships captured via structured API |
| Unresolved reference capture | Cross-project references stored even when target project is not synced |
| Incremental sync overhead | < 5% increase in sync time for event fetching |
| `lore file-history` coverage | File changes captured for all synced MRs (when opt-in enabled) |
| Rename chain resolution | Multi-hop renames correctly resolved up to 10 hops |
---
## Future Phases (Out of Scope)
### Phase C: Advanced Temporal Features
- Pattern Evolution: cross-project trend detection via embedding clusters
- Git integration (Tier 2): `git blame` → commit → MR resolution
- MCP server: expose `timeline`, `file-history`, `trace` as typed MCP tools
### Phase D: Consumer Applications
- Web UI: separate frontend consuming lore's JSON API via `lore serve`
- Real-time monitoring: webhook listener or polling daemon for change notifications
- IDE integration: editor plugins surfacing temporal context inline

View File

@@ -0,0 +1,14 @@
-- Migration 010: Chunk config tracking + adaptive dedup support
-- Schema version: 10
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
ALTER TABLE embedding_metadata ADD COLUMN chunk_count INTEGER;
-- Partial index: accelerates drift detection and adaptive dedup queries on sentinel rows
CREATE INDEX idx_embedding_metadata_sentinel
ON embedding_metadata(document_id, chunk_index)
WHERE chunk_index = 0;
INSERT INTO schema_version (version, applied_at, description)
VALUES (10, strftime('%s', 'now') * 1000,
'Add chunk_max_bytes and chunk_count to embedding_metadata');

1260
phase-a-review.html Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,36 @@
---
name: agent-swarm-launcher
description: "Launch a multi-agent “swarm” workflow for a repository: read and follow AGENTS.md/README.md, perform an architecture/codebase reconnaissance, then coordinate work via Agent Mail / beads-style task tracking when those tools are available. Use when you want to quickly bootstrap a coordinated agent workflow, avoid communication deadlocks, and start making progress on prioritized tasks."
---
# Agent Swarm Launcher
## Workflow (do in order)
1. Read *all* `AGENTS.md` and `README.md` files carefully and completely.
- If multiple `AGENTS.md` files exist, treat deeper ones as higher priority within their directory scope.
- Note any required workflows (e.g., TDD), tooling conventions, and “robot mode” flags.
2. Enter “code investigation” mode and understand the project.
- Identify entrypoints, key packages/modules, and how data flows.
- Note build/test commands and any local dev constraints.
- Summarize the technical architecture and purpose of the project.
3. Register with Agent Mail and coordinate, if available.
- If “MCP Agent Mail” exists in this environment, register and introduce yourself to the other agents.
- Check Agent Mail and promptly respond to any messages.
- If “beads” tracking is used by the repo/team, open/continue the current bead(s) and mark progress as you go.
- If Agent Mail/beads are not available, state that plainly and proceed with a lightweight local substitute (a short task checklist in the thread).
4. Start work (do not get stuck waiting).
- Acknowledge incoming requests promptly.
- Do not get stuck in “communication purgatory” where nothing is getting done.
- If you are blocked on prioritization, look for a prioritization tool mentioned in `AGENTS.md` (for example “bv”) and use it; otherwise propose the next best task(s) and proceed.
- If `AGENTS.md` references a task system (e.g., beads), pick the next task you can complete usefully and start.
## Execution rules
- Follow repository instructions over this skill if they conflict.
- Prefer action + short status updates over prolonged coordination.
- If a referenced tool does not exist, do not hallucinate it—fall back and keep moving.
- Do not claim you registered with or heard from other agents unless you actually did via the available tooling.

View File

@@ -0,0 +1,4 @@
interface:
display_name: "Agent Swarm Launcher"
short_description: "Kick off multi-agent repo onboarding"
default_prompt: "Use $agent-swarm-launcher to onboard, coordinate, and start the next prioritized task."

View File

@@ -21,6 +21,7 @@ pub struct EmbedCommandResult {
/// Run the embed command.
pub async fn run_embed(
config: &Config,
full: bool,
retry_failed: bool,
) -> Result<EmbedCommandResult> {
let db_path = get_db_path(config.storage.db_path.as_deref());
@@ -37,8 +38,18 @@ pub async fn run_embed(
// Health check — fail fast if Ollama is down or model missing
client.health_check().await?;
// If retry_failed, clear errors so they become pending again
if retry_failed {
if full {
// Clear ALL embeddings and metadata atomically for a complete re-embed.
// Wrapped in a transaction so a crash between the two DELETEs can't
// leave orphaned data.
conn.execute_batch(
"BEGIN;
DELETE FROM embedding_metadata;
DELETE FROM embeddings;
COMMIT;",
)?;
} else if retry_failed {
// Clear errors so they become pending again
conn.execute(
"UPDATE embedding_metadata SET last_error = NULL, attempt_count = 0
WHERE last_error IS NOT NULL",

View File

@@ -1,6 +1,7 @@
//! Sync command: unified orchestrator for ingest -> generate-docs -> embed.
use console::style;
use indicatif::{ProgressBar, ProgressStyle};
use serde::Serialize;
use tracing::{info, warn};
@@ -31,6 +32,22 @@ pub struct SyncResult {
pub documents_embedded: usize,
}
/// Create a styled spinner for a sync stage.
fn stage_spinner(stage: u8, total: u8, msg: &str, robot_mode: bool) -> ProgressBar {
if robot_mode {
return ProgressBar::hidden();
}
let pb = ProgressBar::new_spinner();
pb.set_style(
ProgressStyle::default_spinner()
.template("{spinner:.blue} {msg}")
.expect("valid template"),
);
pb.enable_steady_tick(std::time::Duration::from_millis(80));
pb.set_message(format!("[{stage}/{total}] {msg}"));
pb
}
/// Run the full sync pipeline: ingest -> generate-docs -> embed.
pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResult> {
let mut result = SyncResult::default();
@@ -41,41 +58,70 @@ pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResul
IngestDisplay::progress_only()
};
let total_stages: u8 = if options.no_docs && options.no_embed {
2
} else if options.no_docs || options.no_embed {
3
} else {
4
};
let mut current_stage: u8 = 0;
// Stage 1: Ingest issues
info!("Sync stage 1/4: ingesting issues");
current_stage += 1;
let spinner = stage_spinner(current_stage, total_stages, "Fetching issues from GitLab...", options.robot_mode);
info!("Sync stage {current_stage}/{total_stages}: ingesting issues");
let issues_result = run_ingest(config, "issues", None, options.force, options.full, ingest_display).await?;
result.issues_updated = issues_result.issues_upserted;
result.discussions_fetched += issues_result.discussions_fetched;
spinner.finish_and_clear();
// Stage 2: Ingest MRs
info!("Sync stage 2/4: ingesting merge requests");
current_stage += 1;
let spinner = stage_spinner(current_stage, total_stages, "Fetching merge requests from GitLab...", options.robot_mode);
info!("Sync stage {current_stage}/{total_stages}: ingesting merge requests");
let mrs_result = run_ingest(config, "mrs", None, options.force, options.full, ingest_display).await?;
result.mrs_updated = mrs_result.mrs_upserted;
result.discussions_fetched += mrs_result.discussions_fetched;
spinner.finish_and_clear();
// Stage 3: Generate documents (unless --no-docs)
if options.no_docs {
info!("Sync stage 3/4: skipping document generation (--no-docs)");
} else {
info!("Sync stage 3/4: generating documents");
if !options.no_docs {
current_stage += 1;
let spinner = stage_spinner(current_stage, total_stages, "Processing documents...", options.robot_mode);
info!("Sync stage {current_stage}/{total_stages}: generating documents");
let docs_result = run_generate_docs(config, false, None)?;
result.documents_regenerated = docs_result.regenerated;
spinner.finish_and_clear();
} else {
info!("Sync: skipping document generation (--no-docs)");
}
// Stage 4: Embed documents (unless --no-embed)
if options.no_embed {
info!("Sync stage 4/4: skipping embedding (--no-embed)");
} else {
info!("Sync stage 4/4: embedding documents");
match run_embed(config, false).await {
if !options.no_embed {
current_stage += 1;
let spinner = stage_spinner(current_stage, total_stages, "Generating embeddings...", options.robot_mode);
info!("Sync stage {current_stage}/{total_stages}: embedding documents");
match run_embed(config, options.full, false).await {
Ok(embed_result) => {
result.documents_embedded = embed_result.embedded;
spinner.finish_and_clear();
}
Err(e) => {
// Graceful degradation: Ollama down is a warning, not an error
spinner.finish_and_clear();
if !options.robot_mode {
eprintln!(
" {} Embedding skipped ({})",
style("warn").yellow(),
e
);
}
warn!(error = %e, "Embedding stage failed (Ollama may be unavailable), continuing");
}
}
} else {
info!("Sync: skipping embedding (--no-embed)");
}
info!(

View File

@@ -483,6 +483,13 @@ pub struct SyncArgs {
/// Arguments for `lore embed`
#[derive(Parser)]
pub struct EmbedArgs {
/// Re-embed all documents (clears existing embeddings first)
#[arg(long, overrides_with = "no_full")]
pub full: bool,
#[arg(long = "no-full", hide = true, overrides_with = "full")]
pub no_full: bool,
/// Retry previously failed embeddings
#[arg(long, overrides_with = "no_retry_failed")]
pub retry_failed: bool,

View File

@@ -10,6 +10,10 @@ use tracing::{debug, info};
use super::error::{LoreError, Result};
/// Latest schema version, derived from the embedded migrations count.
/// Used by the health check to verify databases are up-to-date.
pub const LATEST_SCHEMA_VERSION: i32 = MIGRATIONS.len() as i32;
/// Embedded migrations - compiled into the binary.
const MIGRATIONS: &[(&str, &str)] = &[
("001", include_str!("../../migrations/001_initial.sql")),
@@ -39,6 +43,10 @@ const MIGRATIONS: &[(&str, &str)] = &[
"009",
include_str!("../../migrations/009_embeddings.sql"),
),
(
"010",
include_str!("../../migrations/010_chunk_config.sql"),
),
];
/// Create a database connection with production-grade pragmas.

View File

@@ -3,6 +3,7 @@
use rusqlite::Connection;
use crate::core::error::Result;
use crate::embedding::chunking::{CHUNK_MAX_BYTES, EXPECTED_DIMS};
/// A document that needs embedding or re-embedding.
#[derive(Debug)]
@@ -12,17 +13,20 @@ pub struct PendingDocument {
pub content_hash: String,
}
/// Find documents that need embedding: new (no metadata) or changed (hash mismatch).
/// Find documents that need embedding: new (no metadata), changed (hash mismatch),
/// or config-drifted (chunk_max_bytes/model/dims mismatch).
///
/// Uses keyset pagination (WHERE d.id > last_id) and returns up to `page_size` results.
pub fn find_pending_documents(
conn: &Connection,
page_size: usize,
last_id: i64,
model_name: &str,
) -> Result<Vec<PendingDocument>> {
// Documents that either:
// 1. Have no embedding_metadata at all (new)
// 2. Have metadata where document_hash != content_hash (changed)
// 3. Config drift: chunk_max_bytes, model, or dims mismatch (or pre-migration NULL)
let sql = r#"
SELECT d.id, d.content_text, d.content_hash
FROM documents d
@@ -37,6 +41,16 @@ pub fn find_pending_documents(
WHERE em.document_id = d.id AND em.chunk_index = 0
AND em.document_hash != d.content_hash
)
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND (
em.chunk_max_bytes IS NULL
OR em.chunk_max_bytes != ?3
OR em.model != ?4
OR em.dims != ?5
)
)
)
ORDER BY d.id
LIMIT ?2
@@ -44,25 +58,35 @@ pub fn find_pending_documents(
let mut stmt = conn.prepare(sql)?;
let rows = stmt
.query_map(rusqlite::params![last_id, page_size as i64], |row| {
.query_map(
rusqlite::params![
last_id,
page_size as i64,
CHUNK_MAX_BYTES as i64,
model_name,
EXPECTED_DIMS as i64,
],
|row| {
Ok(PendingDocument {
document_id: row.get(0)?,
content_text: row.get(1)?,
content_hash: row.get(2)?,
})
})?
},
)?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(rows)
}
/// Count total documents that need embedding.
pub fn count_pending_documents(conn: &Connection) -> Result<i64> {
pub fn count_pending_documents(conn: &Connection, model_name: &str) -> Result<i64> {
let count: i64 = conn.query_row(
r#"
SELECT COUNT(*)
FROM documents d
WHERE NOT EXISTS (
WHERE (
NOT EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
)
@@ -71,8 +95,19 @@ pub fn count_pending_documents(conn: &Connection) -> Result<i64> {
WHERE em.document_id = d.id AND em.chunk_index = 0
AND em.document_hash != d.content_hash
)
OR EXISTS (
SELECT 1 FROM embedding_metadata em
WHERE em.document_id = d.id AND em.chunk_index = 0
AND (
em.chunk_max_bytes IS NULL
OR em.chunk_max_bytes != ?1
OR em.model != ?2
OR em.dims != ?3
)
)
)
"#,
[],
rusqlite::params![CHUNK_MAX_BYTES as i64, model_name, EXPECTED_DIMS as i64],
|row| row.get(0),
)?;
Ok(count)

View File

@@ -1,5 +1,7 @@
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
/// Supports up to 1000 chunks per document (32M chars at 32k/chunk).
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
/// The pipeline enforces chunk_count < CHUNK_ROWID_MULTIPLIER at runtime.
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
/// Encode (document_id, chunk_index) into a sqlite-vec rowid.

View File

@@ -2,11 +2,19 @@
/// Maximum bytes per chunk.
/// Named `_BYTES` because `str::len()` returns byte count; multi-byte UTF-8
/// sequences mean byte length char count.
pub const CHUNK_MAX_BYTES: usize = 32_000;
/// sequences mean byte length >= char count.
///
/// nomic-embed-text has an 8,192-token context window. English prose averages
/// ~4 chars/token, but technical content (code, URLs, JSON) can be 1-2
/// chars/token. We use 6,000 bytes as a conservative limit that stays safe
/// even for code-heavy chunks (~6,000 tokens worst-case).
pub const CHUNK_MAX_BYTES: usize = 6_000;
/// Expected embedding dimensions for nomic-embed-text.
pub const EXPECTED_DIMS: usize = 768;
/// Character overlap between adjacent chunks.
pub const CHUNK_OVERLAP_CHARS: usize = 500;
pub const CHUNK_OVERLAP_CHARS: usize = 200;
/// Split document content into chunks suitable for embedding.
///

View File

@@ -1,18 +1,19 @@
//! Async embedding pipeline: chunk documents, embed via Ollama, store in sqlite-vec.
use std::collections::HashSet;
use rusqlite::Connection;
use sha2::{Digest, Sha256};
use tracing::{info, warn};
use crate::core::error::Result;
use crate::embedding::change_detector::{count_pending_documents, find_pending_documents};
use crate::embedding::chunk_ids::encode_rowid;
use crate::embedding::chunking::split_into_chunks;
use crate::embedding::chunk_ids::{encode_rowid, CHUNK_ROWID_MULTIPLIER};
use crate::embedding::chunking::{split_into_chunks, CHUNK_MAX_BYTES, EXPECTED_DIMS};
use crate::embedding::ollama::OllamaClient;
const BATCH_SIZE: usize = 32;
const DB_PAGE_SIZE: usize = 500;
const EXPECTED_DIMS: usize = 768;
/// Result of an embedding run.
#[derive(Debug, Default)]
@@ -26,6 +27,7 @@ pub struct EmbedResult {
struct ChunkWork {
doc_id: i64,
chunk_index: usize,
total_chunks: usize,
doc_hash: String,
chunk_hash: String,
text: String,
@@ -41,7 +43,7 @@ pub async fn embed_documents(
model_name: &str,
progress_callback: Option<Box<dyn Fn(usize, usize)>>,
) -> Result<EmbedResult> {
let total = count_pending_documents(conn)? as usize;
let total = count_pending_documents(conn, model_name)? as usize;
let mut result = EmbedResult::default();
let mut last_id: i64 = 0;
let mut processed: usize = 0;
@@ -53,13 +55,21 @@ pub async fn embed_documents(
info!(total, "Starting embedding pipeline");
loop {
let pending = find_pending_documents(conn, DB_PAGE_SIZE, last_id)?;
let pending = find_pending_documents(conn, DB_PAGE_SIZE, last_id, model_name)?;
if pending.is_empty() {
break;
}
// Wrap all DB writes for this page in a savepoint so that
// clear_document_embeddings + store_embedding are atomic. If the
// process crashes mid-page, the savepoint is never released and
// SQLite rolls back — preventing partial document states where old
// embeddings are cleared but new ones haven't been written yet.
conn.execute_batch("SAVEPOINT embed_page")?;
// Build chunk work items for this page
let mut all_chunks: Vec<ChunkWork> = Vec::new();
let mut page_normal_docs: usize = 0;
for doc in &pending {
// Always advance the cursor, even for skipped docs, to avoid re-fetching
@@ -71,27 +81,65 @@ pub async fn embed_documents(
continue;
}
// Clear existing embeddings for this document before re-embedding
clear_document_embeddings(conn, doc.document_id)?;
let chunks = split_into_chunks(&doc.content_text);
let total_chunks = chunks.len();
// Overflow guard: skip documents that produce too many chunks.
// Must run BEFORE clear_document_embeddings so existing embeddings
// are preserved when we skip.
if total_chunks as i64 >= CHUNK_ROWID_MULTIPLIER {
warn!(
doc_id = doc.document_id,
chunk_count = total_chunks,
max = CHUNK_ROWID_MULTIPLIER,
"Document produces too many chunks, skipping to prevent rowid collision"
);
// Record a sentinel error so the document is not re-detected as
// pending on subsequent runs (prevents infinite re-processing).
record_embedding_error(
conn,
doc.document_id,
0, // sentinel chunk_index
&doc.content_hash,
"overflow-sentinel",
model_name,
&format!(
"Document produces {} chunks, exceeding max {}",
total_chunks, CHUNK_ROWID_MULTIPLIER
),
)?;
result.skipped += 1;
processed += 1;
if let Some(ref cb) = progress_callback {
cb(processed, total);
}
continue;
}
// Don't clear existing embeddings here — defer until the first
// successful chunk embedding so that if ALL chunks for a document
// fail, old embeddings survive instead of leaving zero data.
for (chunk_index, text) in chunks {
all_chunks.push(ChunkWork {
doc_id: doc.document_id,
chunk_index,
total_chunks,
doc_hash: doc.content_hash.clone(),
chunk_hash: sha256_hash(&text),
text,
});
}
// Track progress per document (not per chunk) to match `total`
processed += 1;
if let Some(ref cb) = progress_callback {
cb(processed, total);
}
page_normal_docs += 1;
// Don't fire progress here — wait until embedding completes below.
}
// Track documents whose old embeddings have been cleared.
// We defer clearing until the first successful chunk embedding so
// that if ALL chunks for a document fail, old embeddings survive.
let mut cleared_docs: HashSet<i64> = HashSet::new();
// Process chunks in batches of BATCH_SIZE
for batch in all_chunks.chunks(BATCH_SIZE) {
let texts: Vec<String> = batch.iter().map(|c| c.text.clone()).collect();
@@ -129,6 +177,12 @@ pub async fn embed_documents(
continue;
}
// Clear old embeddings on first successful chunk for this document
if !cleared_docs.contains(&chunk.doc_id) {
clear_document_embeddings(conn, chunk.doc_id)?;
cleared_docs.insert(chunk.doc_id);
}
store_embedding(
conn,
chunk.doc_id,
@@ -137,11 +191,71 @@ pub async fn embed_documents(
&chunk.chunk_hash,
model_name,
embedding,
chunk.total_chunks,
)?;
result.embedded += 1;
}
}
Err(e) => {
// Batch failed — retry each chunk individually so one
// oversized chunk doesn't poison the entire batch.
let err_str = e.to_string();
let err_lower = err_str.to_lowercase();
// Ollama error messages vary across versions. Match broadly
// against known patterns to detect context-window overflow.
let is_context_error = err_lower.contains("context length")
|| err_lower.contains("too long")
|| err_lower.contains("maximum context")
|| err_lower.contains("token limit")
|| err_lower.contains("exceeds")
|| (err_lower.contains("413") && err_lower.contains("http"));
if is_context_error && batch.len() > 1 {
warn!("Batch failed with context length error, retrying chunks individually");
for chunk in batch {
match client.embed_batch(vec![chunk.text.clone()]).await {
Ok(embeddings) if !embeddings.is_empty()
&& embeddings[0].len() == EXPECTED_DIMS =>
{
// Clear old embeddings on first successful chunk
if !cleared_docs.contains(&chunk.doc_id) {
clear_document_embeddings(conn, chunk.doc_id)?;
cleared_docs.insert(chunk.doc_id);
}
store_embedding(
conn,
chunk.doc_id,
chunk.chunk_index,
&chunk.doc_hash,
&chunk.chunk_hash,
model_name,
&embeddings[0],
chunk.total_chunks,
)?;
result.embedded += 1;
}
_ => {
warn!(
doc_id = chunk.doc_id,
chunk_index = chunk.chunk_index,
chunk_bytes = chunk.text.len(),
"Chunk too large for model context window"
);
record_embedding_error(
conn,
chunk.doc_id,
chunk.chunk_index,
&chunk.doc_hash,
&chunk.chunk_hash,
model_name,
"Chunk exceeds model context window",
)?;
result.failed += 1;
}
}
}
} else {
warn!(error = %e, "Batch embedding failed");
for chunk in batch {
record_embedding_error(
@@ -157,8 +271,19 @@ pub async fn embed_documents(
}
}
}
}
}
// Fire progress for all normal documents after embedding completes.
// This ensures progress reflects actual embedding work, not just chunking.
processed += page_normal_docs;
if let Some(ref cb) = progress_callback {
cb(processed, total);
}
// Commit all DB writes for this page atomically.
conn.execute_batch("RELEASE embed_page")?;
}
info!(
@@ -197,6 +322,7 @@ fn store_embedding(
chunk_hash: &str,
model_name: &str,
embedding: &[f32],
total_chunks: usize,
) -> Result<()> {
let rowid = encode_rowid(doc_id, chunk_index as i64);
@@ -207,13 +333,23 @@ fn store_embedding(
rusqlite::params![rowid, embedding_bytes],
)?;
// Only store chunk_count on the sentinel row (chunk_index=0)
let chunk_count: Option<i64> = if chunk_index == 0 {
Some(total_chunks as i64)
} else {
None
};
let now = chrono::Utc::now().timestamp_millis();
conn.execute(
"INSERT OR REPLACE INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL)",
rusqlite::params![doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64, doc_hash, chunk_hash, now],
created_at, attempt_count, last_error, chunk_max_bytes, chunk_count)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8, ?9)",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64, chunk_count
],
)?;
Ok(())
@@ -233,13 +369,17 @@ fn record_embedding_error(
conn.execute(
"INSERT INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, last_attempt_at)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7)
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
attempt_count = embedding_metadata.attempt_count + 1,
last_error = ?8,
last_attempt_at = ?7",
rusqlite::params![doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64, doc_hash, chunk_hash, now, error],
last_attempt_at = ?7,
chunk_max_bytes = ?9",
rusqlite::params![
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
],
)?;
Ok(())
}

View File

@@ -26,7 +26,7 @@ use lore::cli::{
Cli, Commands, CountArgs, EmbedArgs, GenerateDocsArgs, IngestArgs, IssuesArgs, MrsArgs,
SearchArgs, StatsArgs, SyncArgs,
};
use lore::core::db::{create_connection, get_schema_version, run_migrations};
use lore::core::db::{create_connection, get_schema_version, run_migrations, LATEST_SCHEMA_VERSION};
use lore::core::error::{LoreError, RobotErrorOutput};
use lore::core::paths::get_config_path;
use lore::core::paths::get_db_path;
@@ -1112,8 +1112,9 @@ async fn handle_embed(
robot_mode: bool,
) -> Result<(), Box<dyn std::error::Error>> {
let config = Config::load(config_override)?;
let full = args.full && !args.no_full;
let retry_failed = args.retry_failed && !args.no_retry_failed;
let result = run_embed(&config, retry_failed).await?;
let result = run_embed(&config, full, retry_failed).await?;
if robot_mode {
print_embed_json(&result);
} else {
@@ -1183,8 +1184,7 @@ async fn handle_health(
match create_connection(&db_path) {
Ok(conn) => {
let version = get_schema_version(&conn);
let latest = 9; // Number of embedded migrations
(true, version, version >= latest)
(true, version, version >= LATEST_SCHEMA_VERSION)
}
Err(_) => (true, 0, false),
}
@@ -1340,7 +1340,7 @@ fn handle_robot_docs(robot_mode: bool) -> Result<(), Box<dyn std::error::Error>>
},
"embed": {
"description": "Generate vector embeddings for documents via Ollama",
"flags": ["--retry-failed"],
"flags": ["--full", "--retry-failed"],
"example": "lore --robot embed"
},
"migrate": {

View File

@@ -12,10 +12,39 @@ pub struct VectorResult {
pub distance: f64,
}
/// Query the maximum number of chunks per document for adaptive dedup sizing.
fn max_chunks_per_document(conn: &Connection) -> i64 {
// Fast path: stored chunk_count on sentinel rows (post-migration 010)
let stored: Option<i64> = conn
.query_row(
"SELECT MAX(chunk_count) FROM embedding_metadata
WHERE chunk_index = 0 AND chunk_count IS NOT NULL",
[],
|row| row.get(0),
)
.unwrap_or(None);
if let Some(max) = stored {
return max;
}
// Fallback for pre-migration data: count chunks per document
conn.query_row(
"SELECT COALESCE(MAX(cnt), 1) FROM (
SELECT COUNT(*) as cnt FROM embedding_metadata
WHERE last_error IS NULL GROUP BY document_id
)",
[],
|row| row.get(0),
)
.unwrap_or(1)
}
/// Search documents using sqlite-vec KNN query.
///
/// Over-fetches 3x limit to handle chunk deduplication (multiple chunks per
/// document produce multiple KNN results for the same document_id).
/// Over-fetches by an adaptive multiplier based on actual max chunks per document
/// to handle chunk deduplication (multiple chunks per document produce multiple
/// KNN results for the same document_id).
/// Returns deduplicated results with best (lowest) distance per document.
pub fn search_vector(
conn: &Connection,
@@ -32,7 +61,9 @@ pub fn search_vector(
.flat_map(|f| f.to_le_bytes())
.collect();
let k = limit * 3; // Over-fetch for dedup
let max_chunks = max_chunks_per_document(conn);
let multiplier = ((max_chunks as usize * 3 / 2) + 1).max(8);
let k = limit * multiplier;
let mut stmt = conn.prepare(
"SELECT rowid, distance
@@ -69,7 +100,7 @@ pub fn search_vector(
distance,
})
.collect();
results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap_or(std::cmp::Ordering::Equal));
results.sort_by(|a, b| a.distance.total_cmp(&b.distance));
results.truncate(limit);
Ok(results)
@@ -132,7 +163,7 @@ mod tests {
.into_iter()
.map(|(document_id, distance)| VectorResult { document_id, distance })
.collect();
results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap_or(std::cmp::Ordering::Equal));
results.sort_by(|a, b| a.distance.total_cmp(&b.distance));
results.truncate(limit);
results
}

View File

@@ -1,7 +1,7 @@
//! Integration tests for embedding storage and vector search.
//!
//! These tests create an in-memory SQLite database with sqlite-vec loaded,
//! apply all migrations through 009 (embeddings), and verify KNN search
//! apply all migrations through 010 (chunk config), and verify KNN search
//! and metadata operations.
use lore::core::db::create_connection;
@@ -18,7 +18,7 @@ fn create_test_db() -> (TempDir, Connection) {
let migrations_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("migrations");
for version in 1..=9 {
for version in 1..=10 {
let entries: Vec<_> = std::fs::read_dir(&migrations_dir)
.unwrap()
.filter_map(|e| e.ok())
@@ -181,3 +181,122 @@ fn empty_database_returns_no_results() {
let results = lore::search::search_vector(&conn, &axis_vector(0), 10).unwrap();
assert!(results.is_empty(), "Empty DB should return no results");
}
// --- Bug-fix regression tests ---
#[test]
fn overflow_doc_with_error_sentinel_not_re_detected_as_pending() {
// Bug 2: Documents skipped for chunk overflow must record a sentinel error
// in embedding_metadata so they are not re-detected as pending on subsequent
// pipeline runs (which would cause an infinite re-processing loop).
let (_tmp, conn) = create_test_db();
insert_document(&conn, 1, "Overflow doc", "Some content");
// Simulate what the pipeline does when a document exceeds CHUNK_ROWID_MULTIPLIER:
// it records an error sentinel at chunk_index=0.
let now = chrono::Utc::now().timestamp_millis();
conn.execute(
"INSERT INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
VALUES (1, 0, 'nomic-embed-text', 768, 'hash_1', 'overflow-sentinel', ?1, 1, 'Document produces too many chunks', ?1, ?2)",
rusqlite::params![now, lore::embedding::CHUNK_MAX_BYTES as i64],
)
.unwrap();
// Now find_pending_documents should NOT return this document
let pending = lore::embedding::find_pending_documents(&conn, 100, 0, "nomic-embed-text").unwrap();
assert!(
pending.is_empty(),
"Document with overflow error sentinel should not be re-detected as pending, got {} pending",
pending.len()
);
// count_pending_documents should also return 0
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
assert_eq!(count, 0, "Count should be 0 for document with overflow sentinel");
}
#[test]
fn count_and_find_pending_agree() {
// Bug 1: count_pending_documents and find_pending_documents must use
// logically equivalent WHERE clauses to produce consistent results.
let (_tmp, conn) = create_test_db();
// Case 1: No documents at all
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
assert_eq!(count as usize, found.len(), "Empty DB: count and find should agree");
// Case 2: New document (no metadata)
insert_document(&conn, 1, "New doc", "Content");
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
assert_eq!(count as usize, found.len(), "New doc: count and find should agree");
assert_eq!(count, 1);
// Case 3: Document with matching metadata (not pending)
let now = chrono::Utc::now().timestamp_millis();
conn.execute(
"INSERT INTO embedding_metadata
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
created_at, attempt_count, chunk_max_bytes)
VALUES (1, 0, 'nomic-embed-text', 768, 'hash_1', 'ch', ?1, 1, ?2)",
rusqlite::params![now, lore::embedding::CHUNK_MAX_BYTES as i64],
)
.unwrap();
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
assert_eq!(count as usize, found.len(), "Complete doc: count and find should agree");
assert_eq!(count, 0);
// Case 4: Config drift (chunk_max_bytes mismatch)
conn.execute(
"UPDATE embedding_metadata SET chunk_max_bytes = 999 WHERE document_id = 1",
[],
)
.unwrap();
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
assert_eq!(count as usize, found.len(), "Config drift: count and find should agree");
assert_eq!(count, 1);
}
#[test]
fn full_embed_delete_is_atomic() {
// Bug 7: The --full flag's two DELETE statements should be atomic.
// This test verifies that both tables are cleared together.
let (_tmp, conn) = create_test_db();
insert_document(&conn, 1, "Doc", "Content");
insert_embedding(&conn, 1, 0, &axis_vector(0));
// Verify data exists
let meta_count: i64 = conn
.query_row("SELECT COUNT(*) FROM embedding_metadata", [], |r| r.get(0))
.unwrap();
let embed_count: i64 = conn
.query_row("SELECT COUNT(*) FROM embeddings", [], |r| r.get(0))
.unwrap();
assert_eq!(meta_count, 1);
assert_eq!(embed_count, 1);
// Execute the atomic delete (same as embed.rs --full)
conn.execute_batch(
"BEGIN;
DELETE FROM embedding_metadata;
DELETE FROM embeddings;
COMMIT;",
)
.unwrap();
let meta_count: i64 = conn
.query_row("SELECT COUNT(*) FROM embedding_metadata", [], |r| r.get(0))
.unwrap();
let embed_count: i64 = conn
.query_row("SELECT COUNT(*) FROM embeddings", [], |r| r.get(0))
.unwrap();
assert_eq!(meta_count, 0, "Metadata should be cleared");
assert_eq!(embed_count, 0, "Embeddings should be cleared");
}