Compare commits
6 Commits
51c370fac2
...
549a0646d7
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
549a0646d7 | ||
|
|
a417640faa | ||
|
|
f560e6bc00 | ||
|
|
aebbe6b795 | ||
|
|
7d07f95d4c | ||
|
|
2a52594a60 |
File diff suppressed because one or more lines are too long
@@ -1 +1 @@
|
||||
bd-35o
|
||||
bd-1j1
|
||||
|
||||
59
.claude/agents/test-runner.md
Normal file
59
.claude/agents/test-runner.md
Normal file
@@ -0,0 +1,59 @@
|
||||
---
|
||||
name: test-runner
|
||||
description: "Use this agent when unit tests need to be run and results analyzed. This includes after writing or modifying code, before committing changes, or when explicitly asked to verify test status.\\n\\nExamples:\\n\\n- User: \"Please refactor the parse_session function to handle edge cases\"\\n Assistant: \"Here is the refactored function with edge case handling: ...\"\\n [code changes applied]\\n Since a significant piece of code was modified, use the Task tool to launch the test-runner agent to verify nothing is broken.\\n Assistant: \"Now let me run the test suite to make sure everything still passes.\"\\n\\n- User: \"Do all tests pass?\"\\n Assistant: \"Let me use the Task tool to launch the test-runner agent to check the current test status.\"\\n\\n- User: \"I just finished implementing the search feature\"\\n Assistant: \"Let me use the Task tool to launch the test-runner agent to validate the implementation.\"\\n\\n- After any logical chunk of code is written or modified, proactively use the Task tool to launch the test-runner agent to run the tests before reporting completion to the user."
|
||||
tools: Bash
|
||||
model: haiku
|
||||
color: orange
|
||||
---
|
||||
|
||||
You are an expert test execution and analysis engineer. Your sole responsibility is to run the project's unit test suite, interpret the results with precision, and deliver a clear, actionable summary.
|
||||
|
||||
## Execution Protocol
|
||||
|
||||
1. **Discover the test framework**: Examine the project structure to determine how tests are run:
|
||||
- Look for `Cargo.toml` (Rust: `cargo test`)
|
||||
- If unclear, check README or CLAUDE.md for test instructions
|
||||
|
||||
2. **Run the tests**: Execute the appropriate test command. Capture full output including stdout and stderr. Do NOT run tests interactively or with watch mode. Use flags that produce verbose or detailed output when available (e.g., `cargo test -- --nocapture`, `jest --verbose`).
|
||||
|
||||
3. **Analyze results**: Parse the test output carefully and categorize:
|
||||
- Total tests run
|
||||
- Tests passed
|
||||
- Tests failed (with details)
|
||||
- Tests skipped/ignored
|
||||
- Compilation errors (if tests couldn't even run)
|
||||
|
||||
4. **Report findings**:
|
||||
|
||||
**If ALL tests pass:**
|
||||
Provide a concise success summary:
|
||||
- Total test count and pass count
|
||||
- Execution time if available
|
||||
- Note any skipped/ignored tests and why (if apparent)
|
||||
- A clear statement: "All tests passed."
|
||||
|
||||
**If ANY tests fail:**
|
||||
Provide a detailed failure report:
|
||||
- List each failing test by its full name/path
|
||||
- Include the assertion error or panic message for each failure
|
||||
- Include relevant expected vs actual values
|
||||
- Note the file and line number where the failure occurred (if available)
|
||||
- Group failures by module/file if there are many
|
||||
- Suggest likely root causes when the error messages make it apparent
|
||||
- Note if failures appear related (e.g., same underlying issue)
|
||||
|
||||
**If tests cannot run (compilation/setup error):**
|
||||
- Report the exact error preventing test execution
|
||||
- Identify the file and line causing the issue
|
||||
- Distinguish between test code errors and source code errors
|
||||
|
||||
## Rules
|
||||
|
||||
- NEVER modify any source code or test code. You are read-only except for running the test command.
|
||||
- NEVER skip running tests and guess at results. Always execute the actual test command.
|
||||
- NEVER run the full application or any destructive commands. Only run test commands.
|
||||
- If the test suite is extremely large, run it fully anyway. Do not truncate or sample.
|
||||
- If multiple test targets exist (unit, integration, e2e), run unit tests only unless instructed otherwise.
|
||||
- Report raw numbers. Do not round or approximate test counts.
|
||||
- If tests produce warnings (not failures), mention them briefly but clearly separate them from failures.
|
||||
- Keep the summary structured and scannable. Use bullet points and clear headers.
|
||||
587
AGENTS.md
587
AGENTS.md
@@ -1,6 +1,570 @@
|
||||
# AGENTS.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||
## RULE 0 - THE FUNDAMENTAL OVERRIDE PEROGATIVE
|
||||
|
||||
If I tell you to do something, even if it goes against what follows below, YOU MUST LISTEN TO ME. I AM IN CHARGE, NOT YOU.
|
||||
|
||||
---
|
||||
|
||||
## RULE NUMBER 1: NO FILE DELETION
|
||||
|
||||
**YOU ARE NEVER ALLOWED TO DELETE A FILE WITHOUT EXPRESS PERMISSION.** Even a new file that you yourself created, such as a test code file. You have a horrible track record of deleting critically important files or otherwise throwing away tons of expensive work. As a result, you have permanently lost any and all rights to determine that a file or folder should be deleted.
|
||||
|
||||
**YOU MUST ALWAYS ASK AND RECEIVE CLEAR, WRITTEN PERMISSION BEFORE EVER DELETING A FILE OR FOLDER OF ANY KIND.**
|
||||
|
||||
---
|
||||
|
||||
## Irreversible Git & Filesystem Actions — DO NOT EVER BREAK GLASS
|
||||
|
||||
> **Note:** Treat destructive commands as break-glass. If there's any doubt, stop and ask.
|
||||
|
||||
1. **Absolutely forbidden commands:** `git reset --hard`, `git clean -fd`, `rm -rf`, or any command that can delete or overwrite code/data must never be run unless the user explicitly provides the exact command and states, in the same message, that they understand and want the irreversible consequences.
|
||||
2. **No guessing:** If there is any uncertainty about what a command might delete or overwrite, stop immediately and ask the user for specific approval. "I think it's safe" is never acceptable.
|
||||
3. **Safer alternatives first:** When cleanup or rollbacks are needed, request permission to use non-destructive options (`git status`, `git diff`, `git stash`, copying to backups) before ever considering a destructive command.
|
||||
4. **Mandatory explicit plan:** Even after explicit user authorization, restate the command verbatim, list exactly what will be affected, and wait for a confirmation that your understanding is correct. Only then may you execute it—if anything remains ambiguous, refuse and escalate.
|
||||
5. **Document the confirmation:** When running any approved destructive command, record (in the session notes / final response) the exact user text that authorized it, the command actually run, and the execution time. If that record is absent, the operation did not happen.
|
||||
|
||||
---
|
||||
|
||||
## Toolchain: Rust & Cargo
|
||||
|
||||
We only use **Cargo** in this project, NEVER any other package manager.
|
||||
|
||||
- **Edition/toolchain:** Follow `rust-toolchain.toml` (if present). Do not assume stable vs nightly.
|
||||
- **Dependencies:** Explicit versions for stability; keep the set minimal.
|
||||
- **Configuration:** Cargo.toml only
|
||||
- **Unsafe code:** Forbidden (`#![forbid(unsafe_code)]`)
|
||||
|
||||
### Release Profile
|
||||
|
||||
Use the release profile defined in `Cargo.toml`. If you need to change it, justify the
|
||||
performance/size tradeoff and how it impacts determinism and cancellation behavior.
|
||||
|
||||
---
|
||||
|
||||
## Code Editing Discipline
|
||||
|
||||
### No Script-Based Changes
|
||||
|
||||
**NEVER** run a script that processes/changes code files in this repo. Brittle regex-based transformations create far more problems than they solve.
|
||||
|
||||
- **Always make code changes manually**, even when there are many instances
|
||||
- For many simple changes: use parallel subagents
|
||||
- For subtle/complex changes: do them methodically yourself
|
||||
|
||||
### No File Proliferation
|
||||
|
||||
If you want to change something or add a feature, **revise existing code files in place**.
|
||||
|
||||
**NEVER** create variations like:
|
||||
- `mainV2.rs`
|
||||
- `main_improved.rs`
|
||||
- `main_enhanced.rs`
|
||||
|
||||
New files are reserved for **genuinely new functionality** that makes zero sense to include in any existing file. The bar for creating new files is **incredibly high**.
|
||||
|
||||
---
|
||||
|
||||
## Backwards Compatibility
|
||||
|
||||
We do not care about backwards compatibility—we're in early development with no users. We want to do things the **RIGHT** way with **NO TECH DEBT**.
|
||||
|
||||
- Never create "compatibility shims"
|
||||
- Never create wrapper functions for deprecated APIs
|
||||
- Just fix the code directly
|
||||
|
||||
---
|
||||
|
||||
## Compiler Checks (CRITICAL)
|
||||
|
||||
**After any substantive code changes, you MUST verify no errors were introduced:**
|
||||
|
||||
```bash
|
||||
# Check for compiler errors and warnings
|
||||
cargo check --all-targets
|
||||
|
||||
# Check for clippy lints (pedantic + nursery are enabled)
|
||||
cargo clippy --all-targets -- -D warnings
|
||||
|
||||
# Verify formatting
|
||||
cargo fmt --check
|
||||
```
|
||||
|
||||
If you see errors, **carefully understand and resolve each issue**. Read sufficient context to fix them the RIGHT way.
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit & Property Tests
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
cargo test
|
||||
|
||||
# Run with output
|
||||
cargo test -- --nocapture
|
||||
```
|
||||
|
||||
When adding or changing primitives, add tests that assert the core invariants:
|
||||
|
||||
- no task leaks
|
||||
- no obligation leaks
|
||||
- losers are drained after races
|
||||
- region close implies quiescence
|
||||
|
||||
Prefer deterministic lab-runtime tests for concurrency-sensitive behavior.
|
||||
|
||||
---
|
||||
|
||||
## MCP Agent Mail — Multi-Agent Coordination
|
||||
|
||||
A mail-like layer that lets coding agents coordinate asynchronously via MCP tools and resources. Provides identities, inbox/outbox, searchable threads, and advisory file reservations with human-auditable artifacts in Git.
|
||||
|
||||
### Why It's Useful
|
||||
|
||||
- **Prevents conflicts:** Explicit file reservations (leases) for files/globs
|
||||
- **Token-efficient:** Messages stored in per-project archive, not in context
|
||||
- **Quick reads:** `resource://inbox/...`, `resource://thread/...`
|
||||
|
||||
### Same Repository Workflow
|
||||
|
||||
1. **Register identity:**
|
||||
```
|
||||
ensure_project(project_key=<abs-path>)
|
||||
register_agent(project_key, program, model)
|
||||
```
|
||||
|
||||
2. **Reserve files before editing:**
|
||||
```
|
||||
file_reservation_paths(project_key, agent_name, ["src/**"], ttl_seconds=3600, exclusive=true)
|
||||
```
|
||||
|
||||
3. **Communicate with threads:**
|
||||
```
|
||||
send_message(..., thread_id="FEAT-123")
|
||||
fetch_inbox(project_key, agent_name)
|
||||
acknowledge_message(project_key, agent_name, message_id)
|
||||
```
|
||||
|
||||
4. **Quick reads:**
|
||||
```
|
||||
resource://inbox/{Agent}?project=<abs-path>&limit=20
|
||||
resource://thread/{id}?project=<abs-path>&include_bodies=true
|
||||
```
|
||||
|
||||
### Macros vs Granular Tools
|
||||
|
||||
- **Prefer macros for speed:** `macro_start_session`, `macro_prepare_thread`, `macro_file_reservation_cycle`, `macro_contact_handshake`
|
||||
- **Use granular tools for control:** `register_agent`, `file_reservation_paths`, `send_message`, `fetch_inbox`, `acknowledge_message`
|
||||
|
||||
### Common Pitfalls
|
||||
|
||||
- `"from_agent not registered"`: Always `register_agent` in the correct `project_key` first
|
||||
- `"FILE_RESERVATION_CONFLICT"`: Adjust patterns, wait for expiry, or use non-exclusive reservation
|
||||
- **Auth errors:** If JWT+JWKS enabled, include bearer token with matching `kid`
|
||||
|
||||
---
|
||||
|
||||
## Beads (br) — Dependency-Aware Issue Tracking
|
||||
|
||||
Beads provides a lightweight, dependency-aware issue database and CLI (`br` / beads_rust) for selecting "ready work," setting priorities, and tracking status. It complements MCP Agent Mail's messaging and file reservations.
|
||||
|
||||
**Note:** `br` is non-invasive—it never executes git commands directly. You must run git commands manually after `br sync --flush-only`.
|
||||
|
||||
### Conventions
|
||||
|
||||
- **Single source of truth:** Beads for task status/priority/dependencies; Agent Mail for conversation and audit
|
||||
- **Shared identifiers:** Use Beads issue ID (e.g., `br-123`) as Mail `thread_id` and prefix subjects with `[br-123]`
|
||||
- **Reservations:** When starting a task, call `file_reservation_paths()` with the issue ID in `reason`
|
||||
|
||||
### Typical Agent Flow
|
||||
|
||||
1. **Pick ready work (Beads):**
|
||||
```bash
|
||||
br ready --json # Choose highest priority, no blockers
|
||||
```
|
||||
|
||||
2. **Reserve edit surface (Mail):**
|
||||
```
|
||||
file_reservation_paths(project_key, agent_name, ["src/**"], ttl_seconds=3600, exclusive=true, reason="br-123")
|
||||
```
|
||||
|
||||
3. **Announce start (Mail):**
|
||||
```
|
||||
send_message(..., thread_id="br-123", subject="[br-123] Start: <title>", ack_required=true)
|
||||
```
|
||||
|
||||
4. **Work and update:** Reply in-thread with progress
|
||||
|
||||
5. **Complete and release:**
|
||||
```bash
|
||||
br close br-123 --reason "Completed"
|
||||
```
|
||||
```
|
||||
release_file_reservations(project_key, agent_name, paths=["src/**"])
|
||||
```
|
||||
Final Mail reply: `[br-123] Completed` with summary
|
||||
|
||||
### Mapping Cheat Sheet
|
||||
|
||||
| Concept | Value |
|
||||
|---------|-------|
|
||||
| Mail `thread_id` | `br-###` |
|
||||
| Mail subject | `[br-###] ...` |
|
||||
| File reservation `reason` | `br-###` |
|
||||
| Commit messages | Include `br-###` for traceability |
|
||||
|
||||
---
|
||||
|
||||
## bv — Graph-Aware Triage Engine
|
||||
|
||||
bv is a graph-aware triage engine for Beads projects (`.beads/beads.jsonl`). It computes PageRank, betweenness, critical path, cycles, HITS, eigenvector, and k-core metrics deterministically.
|
||||
|
||||
**Scope boundary:** bv handles *what to work on* (triage, priority, planning). For agent-to-agent coordination (messaging, work claiming, file reservations), use MCP Agent Mail.
|
||||
|
||||
**CRITICAL: Use ONLY `--robot-*` flags. Bare `bv` launches an interactive TUI that blocks your session.**
|
||||
|
||||
### The Workflow: Start With Triage
|
||||
|
||||
**`bv --robot-triage` is your single entry point.** It returns:
|
||||
- `quick_ref`: at-a-glance counts + top 3 picks
|
||||
- `recommendations`: ranked actionable items with scores, reasons, unblock info
|
||||
- `quick_wins`: low-effort high-impact items
|
||||
- `blockers_to_clear`: items that unblock the most downstream work
|
||||
- `project_health`: status/type/priority distributions, graph metrics
|
||||
- `commands`: copy-paste shell commands for next steps
|
||||
|
||||
```bash
|
||||
bv --robot-triage # THE MEGA-COMMAND: start here
|
||||
bv --robot-next # Minimal: just the single top pick + claim command
|
||||
```
|
||||
|
||||
### Command Reference
|
||||
|
||||
**Planning:**
|
||||
| Command | Returns |
|
||||
|---------|---------|
|
||||
| `--robot-plan` | Parallel execution tracks with `unblocks` lists |
|
||||
| `--robot-priority` | Priority misalignment detection with confidence |
|
||||
|
||||
**Graph Analysis:**
|
||||
| Command | Returns |
|
||||
|---------|---------|
|
||||
| `--robot-insights` | Full metrics: PageRank, betweenness, HITS, eigenvector, critical path, cycles, k-core, articulation points, slack |
|
||||
| `--robot-label-health` | Per-label health: `health_level`, `velocity_score`, `staleness`, `blocked_count` |
|
||||
| `--robot-label-flow` | Cross-label dependency: `flow_matrix`, `dependencies`, `bottleneck_labels` |
|
||||
| `--robot-label-attention [--attention-limit=N]` | Attention-ranked labels |
|
||||
|
||||
**History & Change Tracking:**
|
||||
| Command | Returns |
|
||||
|---------|---------|
|
||||
| `--robot-history` | Bead-to-commit correlations |
|
||||
| `--robot-diff --diff-since <ref>` | Changes since ref: new/closed/modified issues, cycles |
|
||||
|
||||
**Other:**
|
||||
| Command | Returns |
|
||||
|---------|---------|
|
||||
| `--robot-burndown <sprint>` | Sprint burndown, scope changes, at-risk items |
|
||||
| `--robot-forecast <id\|all>` | ETA predictions with dependency-aware scheduling |
|
||||
| `--robot-alerts` | Stale issues, blocking cascades, priority mismatches |
|
||||
| `--robot-suggest` | Hygiene: duplicates, missing deps, label suggestions |
|
||||
| `--robot-graph [--graph-format=json\|dot\|mermaid]` | Dependency graph export |
|
||||
| `--export-graph <file.html>` | Interactive HTML visualization |
|
||||
|
||||
### Scoping & Filtering
|
||||
|
||||
```bash
|
||||
bv --robot-plan --label backend # Scope to label's subgraph
|
||||
bv --robot-insights --as-of HEAD~30 # Historical point-in-time
|
||||
bv --recipe actionable --robot-plan # Pre-filter: ready to work
|
||||
bv --recipe high-impact --robot-triage # Pre-filter: top PageRank
|
||||
bv --robot-triage --robot-triage-by-track # Group by parallel work streams
|
||||
bv --robot-triage --robot-triage-by-label # Group by domain
|
||||
```
|
||||
|
||||
### Understanding Robot Output
|
||||
|
||||
**All robot JSON includes:**
|
||||
- `data_hash` — Fingerprint of source beads.jsonl
|
||||
- `status` — Per-metric state: `computed|approx|timeout|skipped` + elapsed ms
|
||||
- `as_of` / `as_of_commit` — Present when using `--as-of`
|
||||
|
||||
**Two-phase analysis:**
|
||||
- **Phase 1 (instant):** degree, topo sort, density
|
||||
- **Phase 2 (async, 500ms timeout):** PageRank, betweenness, HITS, eigenvector, cycles
|
||||
|
||||
### jq Quick Reference
|
||||
|
||||
```bash
|
||||
bv --robot-triage | jq '.quick_ref' # At-a-glance summary
|
||||
bv --robot-triage | jq '.recommendations[0]' # Top recommendation
|
||||
bv --robot-plan | jq '.plan.summary.highest_impact' # Best unblock target
|
||||
bv --robot-insights | jq '.status' # Check metric readiness
|
||||
bv --robot-insights | jq '.Cycles' # Circular deps (must fix!)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UBS — Ultimate Bug Scanner
|
||||
|
||||
**Golden Rule:** `ubs <changed-files>` before every commit. Exit 0 = safe. Exit >0 = fix & re-run.
|
||||
|
||||
### Commands
|
||||
|
||||
```bash
|
||||
ubs file.rs file2.rs # Specific files (< 1s) — USE THIS
|
||||
ubs $(git diff --name-only --cached) # Staged files — before commit
|
||||
ubs --only=rust,toml src/ # Language filter (3-5x faster)
|
||||
ubs --ci --fail-on-warning . # CI mode — before PR
|
||||
ubs . # Whole project (ignores target/, Cargo.lock)
|
||||
```
|
||||
|
||||
### Output Format
|
||||
|
||||
```
|
||||
⚠️ Category (N errors)
|
||||
file.rs:42:5 – Issue description
|
||||
💡 Suggested fix
|
||||
Exit code: 1
|
||||
```
|
||||
|
||||
Parse: `file:line:col` → location | 💡 → how to fix | Exit 0/1 → pass/fail
|
||||
|
||||
### Fix Workflow
|
||||
|
||||
1. Read finding → category + fix suggestion
|
||||
2. Navigate `file:line:col` → view context
|
||||
3. Verify real issue (not false positive)
|
||||
4. Fix root cause (not symptom)
|
||||
5. Re-run `ubs <file>` → exit 0
|
||||
6. Commit
|
||||
|
||||
### Bug Severity
|
||||
|
||||
- **Critical (always fix):** Memory safety, use-after-free, data races, SQL injection
|
||||
- **Important (production):** Unwrap panics, resource leaks, overflow checks
|
||||
- **Contextual (judgment):** TODO/FIXME, println! debugging
|
||||
|
||||
---
|
||||
|
||||
## ast-grep vs ripgrep
|
||||
|
||||
**Use `ast-grep` when structure matters.** It parses code and matches AST nodes, ignoring comments/strings, and can **safely rewrite** code.
|
||||
|
||||
- Refactors/codemods: rename APIs, change import forms
|
||||
- Policy checks: enforce patterns across a repo
|
||||
- Editor/automation: LSP mode, `--json` output
|
||||
|
||||
**Use `ripgrep` when text is enough.** Fastest way to grep literals/regex.
|
||||
|
||||
- Recon: find strings, TODOs, log lines, config values
|
||||
- Pre-filter: narrow candidate files before ast-grep
|
||||
|
||||
### Rule of Thumb
|
||||
|
||||
- Need correctness or **applying changes** → `ast-grep`
|
||||
- Need raw speed or **hunting text** → `rg`
|
||||
- Often combine: `rg` to shortlist files, then `ast-grep` to match/modify
|
||||
|
||||
### Rust Examples
|
||||
|
||||
```bash
|
||||
# Find structured code (ignores comments)
|
||||
ast-grep run -l Rust -p 'fn $NAME($$$ARGS) -> $RET { $$$BODY }'
|
||||
|
||||
# Find all unwrap() calls
|
||||
ast-grep run -l Rust -p '$EXPR.unwrap()'
|
||||
|
||||
# Quick textual hunt
|
||||
rg -n 'println!' -t rust
|
||||
|
||||
# Combine speed + precision
|
||||
rg -l -t rust 'unwrap\(' | xargs ast-grep run -l Rust -p '$X.unwrap()' --json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Morph Warp Grep — AI-Powered Code Search
|
||||
|
||||
**Use `mcp__morph-mcp__warp_grep` for exploratory "how does X work?" questions.** An AI agent expands your query, greps the codebase, reads relevant files, and returns precise line ranges with full context.
|
||||
|
||||
**Use `ripgrep` for targeted searches.** When you know exactly what you're looking for.
|
||||
|
||||
**Use `ast-grep` for structural patterns.** When you need AST precision for matching/rewriting.
|
||||
|
||||
### When to Use What
|
||||
|
||||
| Scenario | Tool | Why |
|
||||
|----------|------|-----|
|
||||
| "How is pattern matching implemented?" | `warp_grep` | Exploratory; don't know where to start |
|
||||
| "Where is the quick reject filter?" | `warp_grep` | Need to understand architecture |
|
||||
| "Find all uses of `Regex::new`" | `ripgrep` | Targeted literal search |
|
||||
| "Find files with `println!`" | `ripgrep` | Simple pattern |
|
||||
| "Replace all `unwrap()` with `expect()`" | `ast-grep` | Structural refactor |
|
||||
|
||||
### warp_grep Usage
|
||||
|
||||
```
|
||||
mcp__morph-mcp__warp_grep(
|
||||
repoPath: "/path/to/dcg",
|
||||
query: "How does the safe pattern whitelist work?"
|
||||
)
|
||||
```
|
||||
|
||||
Returns structured results with file paths, line ranges, and extracted code snippets.
|
||||
|
||||
### Anti-Patterns
|
||||
|
||||
- **Don't** use `warp_grep` to find a specific function name → use `ripgrep`
|
||||
- **Don't** use `ripgrep` to understand "how does X work" → wastes time with manual reads
|
||||
- **Don't** use `ripgrep` for codemods → risks collateral edits
|
||||
|
||||
<!-- bv-agent-instructions-v1 -->
|
||||
|
||||
---
|
||||
|
||||
## Beads Workflow Integration
|
||||
|
||||
This project uses [beads_viewer](https://github.com/Dicklesworthstone/beads_viewer) for issue tracking. Issues are stored in `.beads/` and tracked in git.
|
||||
|
||||
**Note:** `br` is non-invasive—it never executes git commands directly. You must run git commands manually after `br sync --flush-only`.
|
||||
|
||||
### Essential Commands
|
||||
|
||||
```bash
|
||||
# View issues (launches TUI - avoid in automated sessions)
|
||||
bv
|
||||
|
||||
# CLI commands for agents (use these instead)
|
||||
br ready # Show issues ready to work (no blockers)
|
||||
br list --status=open # All open issues
|
||||
br show <id> # Full issue details with dependencies
|
||||
br create --title="..." --type=task --priority=2
|
||||
br update <id> --status=in_progress
|
||||
br close <id> --reason="Completed"
|
||||
br close <id1> <id2> # Close multiple issues at once
|
||||
br sync --flush-only # Export to JSONL (then manually: git add .beads/ && git commit)
|
||||
```
|
||||
|
||||
### Workflow Pattern
|
||||
|
||||
1. **Start**: Run `br ready` to find actionable work
|
||||
2. **Claim**: Use `br update <id> --status=in_progress`
|
||||
3. **Work**: Implement the task
|
||||
4. **Complete**: Use `br close <id>`
|
||||
5. **Sync**: Run `br sync --flush-only`, then `git add .beads/ && git commit -m "Update beads"`
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **Dependencies**: Issues can block other issues. `br ready` shows only unblocked work.
|
||||
- **Priority**: P0=critical, P1=high, P2=medium, P3=low, P4=backlog (use numbers, not words)
|
||||
- **Types**: task, bug, feature, epic, question, docs
|
||||
- **Blocking**: `br dep add <issue> <depends-on>` to add dependencies
|
||||
|
||||
### Session Protocol
|
||||
|
||||
**Before ending any session, run this checklist:**
|
||||
|
||||
```bash
|
||||
git status # Check what changed
|
||||
git add <files> # Stage code changes
|
||||
br sync --flush-only # Export beads to JSONL
|
||||
git add .beads/ # Stage beads changes
|
||||
git commit -m "..." # Commit code and beads
|
||||
git push # Push to remote
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Check `br ready` at session start to find available work
|
||||
- Update status as you work (in_progress → closed)
|
||||
- Create new issues with `br create` when you discover tasks
|
||||
- Use descriptive titles and set appropriate priority/type
|
||||
- Always run `br sync --flush-only` then commit .beads/ before ending session
|
||||
|
||||
<!-- end-bv-agent-instructions -->
|
||||
|
||||
## Landing the Plane (Session Completion)
|
||||
|
||||
**When ending a work session**, you MUST complete ALL steps below. Work is NOT complete until `git push` succeeds.
|
||||
|
||||
**MANDATORY WORKFLOW:**
|
||||
|
||||
1. **File issues for remaining work** - Create issues for anything that needs follow-up
|
||||
2. **Run quality gates** (if code changed) - Tests, linters, builds
|
||||
3. **Update issue status** - Close finished work, update in-progress items
|
||||
4. **PUSH TO REMOTE** - This is MANDATORY:
|
||||
```bash
|
||||
git pull --rebase
|
||||
br sync --flush-only
|
||||
git add .beads/
|
||||
git commit -m "Update beads"
|
||||
git push
|
||||
git status # MUST show "up to date with origin"
|
||||
```
|
||||
5. **Clean up** - Clear stashes, prune remote branches
|
||||
6. **Verify** - All changes committed AND pushed
|
||||
7. **Hand off** - Provide context for next session
|
||||
|
||||
**CRITICAL RULES:**
|
||||
- Work is NOT complete until `git push` succeeds
|
||||
- NEVER stop before pushing - that leaves work stranded locally
|
||||
- NEVER say "ready to push when you are" - YOU must push
|
||||
- If push fails, resolve and retry until it succeeds
|
||||
|
||||
---
|
||||
|
||||
## cass — Cross-Agent Session Search
|
||||
|
||||
`cass` indexes prior agent conversations (Claude Code, Codex, Cursor, Gemini, ChatGPT, etc.) so we can reuse solved problems.
|
||||
|
||||
**Rules:** Never run bare `cass` (TUI). Always use `--robot` or `--json`.
|
||||
|
||||
### Examples
|
||||
|
||||
```bash
|
||||
cass health
|
||||
cass search "async runtime" --robot --limit 5
|
||||
cass view /path/to/session.jsonl -n 42 --json
|
||||
cass expand /path/to/session.jsonl -n 42 -C 3 --json
|
||||
cass capabilities --json
|
||||
cass robot-docs guide
|
||||
```
|
||||
|
||||
### Tips
|
||||
|
||||
- Use `--fields minimal` for lean output
|
||||
- Filter by agent with `--agent`
|
||||
- Use `--days N` to limit to recent history
|
||||
|
||||
stdout is data-only, stderr is diagnostics; exit code 0 means success.
|
||||
|
||||
Treat cass as a way to avoid re-solving problems other agents already handled.
|
||||
|
||||
---
|
||||
|
||||
## Note for Codex/GPT-5.2
|
||||
|
||||
You constantly bother me and stop working with concerned questions that look similar to this:
|
||||
|
||||
```
|
||||
Unexpected changes (need guidance)
|
||||
|
||||
- Working tree still shows edits I did not make in Cargo.toml, Cargo.lock, src/runtime.rs, src/scope.rs. Please advise whether to keep/commit/revert these before any further work. I did not touch them.
|
||||
|
||||
Next steps (pick one)
|
||||
|
||||
1. Decide how to handle the unrelated modified files above so we can resume cleanly.
|
||||
```
|
||||
|
||||
NEVER EVER DO THAT AGAIN. The answer is literally ALWAYS the same: those are changes created by the potentially dozen of other agents working on the project at the same time. This is not only a common occurence, it happens multiple times PER MINUTE. The way to deal with it is simple: you NEVER, under ANY CIRCUMSTANCE, stash, revert, overwrite, or otherwise disturb in ANY way the work of other agents. Just treat those changes identically to changes that you yourself made. Just fool yourself into thinking YOU made the changes and simply don't recall it for some reason.
|
||||
|
||||
---
|
||||
|
||||
## Note on Built-in TODO Functionality
|
||||
|
||||
Also, if I ask you to explicitly use your built-in TODO functionality, don't complain about this and say you need to use beads. You can use built-in TODOs if I tell you specifically to do so. Always comply with such orders.
|
||||
|
||||
## TDD Requirements
|
||||
|
||||
@@ -77,6 +641,21 @@ lore --robot doctor
|
||||
|
||||
# Document and index statistics
|
||||
lore --robot stats
|
||||
|
||||
# Quick health pre-flight check (exit 0 = healthy, 1 = unhealthy)
|
||||
lore --robot health
|
||||
|
||||
# Generate searchable documents from ingested data
|
||||
lore --robot generate-docs
|
||||
|
||||
# Generate vector embeddings via Ollama
|
||||
lore --robot embed
|
||||
|
||||
# Agent self-discovery manifest (all commands, flags, exit codes)
|
||||
lore robot-docs
|
||||
|
||||
# Version information
|
||||
lore --robot version
|
||||
```
|
||||
|
||||
### Response Format
|
||||
@@ -114,6 +693,8 @@ Errors return structured JSON to stderr:
|
||||
| 14 | Ollama unavailable |
|
||||
| 15 | Ollama model not found |
|
||||
| 16 | Embedding failed |
|
||||
| 17 | Not found (entity does not exist) |
|
||||
| 18 | Ambiguous match (use `-p` to specify project) |
|
||||
| 20 | Config not found |
|
||||
|
||||
### Configuration Precedence
|
||||
@@ -129,4 +710,8 @@ Errors return structured JSON to stderr:
|
||||
- Check exit codes for error handling
|
||||
- Parse JSON errors from stderr
|
||||
- Use `-n` / `--limit` to control response size
|
||||
- Use `-q` / `--quiet` to suppress progress bars and non-essential output
|
||||
- Use `--color never` in non-TTY automation for ANSI-free output
|
||||
- TTY detection handles piped commands automatically
|
||||
- Use `lore --robot health` as a fast pre-flight check before queries
|
||||
- The `-p` flag supports fuzzy project matching (suffix and substring)
|
||||
|
||||
49
README.md
49
README.md
@@ -140,6 +140,8 @@ Create a personal access token with `read_api` scope:
|
||||
| `LORE_ROBOT` | Enable robot mode globally (set to `true` or `1`) | No |
|
||||
| `XDG_CONFIG_HOME` | XDG Base Directory for config (fallback: `~/.config`) | No |
|
||||
| `XDG_DATA_HOME` | XDG Base Directory for data (fallback: `~/.local/share`) | No |
|
||||
| `NO_COLOR` | Disable color output when set (any value) | No |
|
||||
| `CLICOLOR` | Standard color control (0 to disable) | No |
|
||||
| `RUST_LOG` | Logging level filter (e.g., `lore=debug`) | No |
|
||||
|
||||
## Commands
|
||||
@@ -162,6 +164,7 @@ lore issues -l bug -l urgent # Multiple labels
|
||||
lore issues -m "v1.0" # By milestone title
|
||||
lore issues --since 7d # Updated in last 7 days
|
||||
lore issues --since 2w # Updated in last 2 weeks
|
||||
lore issues --since 1m # Updated in last month
|
||||
lore issues --since 2024-01-01 # Updated since date
|
||||
lore issues --due-before 2024-12-31 # Due before date
|
||||
lore issues --has-due # Only issues with due dates
|
||||
@@ -174,6 +177,17 @@ When listing, output includes: IID, title, state, author, assignee, labels, and
|
||||
|
||||
When showing a single issue (e.g., `lore issues 123`), output includes: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions.
|
||||
|
||||
#### Project Resolution
|
||||
|
||||
The `-p` / `--project` flag uses cascading match logic across all commands:
|
||||
|
||||
1. **Exact match**: `group/project`
|
||||
2. **Case-insensitive**: `Group/Project`
|
||||
3. **Suffix match**: `project` matches `group/project` (if unambiguous)
|
||||
4. **Substring match**: `typescript` matches `vs/typescript-code` (if unambiguous)
|
||||
|
||||
If multiple projects match, an error lists the candidates with a hint to use the full path.
|
||||
|
||||
### `lore mrs`
|
||||
|
||||
Query merge requests from local database, or show a specific MR.
|
||||
@@ -221,14 +235,14 @@ lore search "deploy" --author username # Filter by author
|
||||
lore search "deploy" -p group/repo # Filter by project
|
||||
lore search "deploy" --label backend # Filter by label (AND logic)
|
||||
lore search "deploy" --path src/ # Filter by file path (trailing / for prefix)
|
||||
lore search "deploy" --after 7d # Created after (7d, 2w, or YYYY-MM-DD)
|
||||
lore search "deploy" --after 7d # Created after (7d, 2w, 1m, or YYYY-MM-DD)
|
||||
lore search "deploy" --updated-after 2w # Updated after
|
||||
lore search "deploy" -n 50 # Limit results (default 20, max 100)
|
||||
lore search "deploy" --explain # Show ranking explanation per result
|
||||
lore search "deploy" --fts-mode raw # Raw FTS5 query syntax (advanced)
|
||||
```
|
||||
|
||||
Requires `lore generate-docs` (or `lore sync`) to have been run at least once. Semantic mode requires Ollama with the configured embedding model.
|
||||
Requires `lore generate-docs` (or `lore sync`) to have been run at least once. Semantic and hybrid modes require `lore embed` (or `lore sync`) to have generated vector embeddings via Ollama.
|
||||
|
||||
### `lore sync`
|
||||
|
||||
@@ -359,12 +373,32 @@ Run pending database migrations.
|
||||
lore migrate
|
||||
```
|
||||
|
||||
### `lore health`
|
||||
|
||||
Quick pre-flight check for config, database, and schema version. Exits 0 if healthy, 1 if unhealthy.
|
||||
|
||||
```bash
|
||||
lore health
|
||||
```
|
||||
|
||||
Useful as a fast gate before running queries or syncs. For a more thorough check including authentication and project access, use `lore doctor`.
|
||||
|
||||
### `lore robot-docs`
|
||||
|
||||
Machine-readable command manifest for agent self-discovery. Returns a JSON schema of all commands, flags, exit codes, and example workflows.
|
||||
|
||||
```bash
|
||||
lore robot-docs # Pretty-printed JSON
|
||||
lore --robot robot-docs # Compact JSON for parsing
|
||||
```
|
||||
|
||||
### `lore version`
|
||||
|
||||
Show version information.
|
||||
Show version information including the git commit hash.
|
||||
|
||||
```bash
|
||||
lore version
|
||||
# lore version 0.1.0 (abc1234)
|
||||
```
|
||||
|
||||
## Robot Mode
|
||||
@@ -422,6 +456,8 @@ Errors return structured JSON to stderr:
|
||||
| 14 | Ollama unavailable |
|
||||
| 15 | Ollama model not found |
|
||||
| 16 | Embedding failed |
|
||||
| 17 | Not found (entity does not exist) |
|
||||
| 18 | Ambiguous match (use `-p` to specify project) |
|
||||
| 20 | Config not found |
|
||||
|
||||
## Configuration Precedence
|
||||
@@ -439,8 +475,13 @@ Settings are resolved in this order (highest to lowest priority):
|
||||
lore -c /path/to/config.json <command> # Use alternate config
|
||||
lore --robot <command> # Machine-readable JSON
|
||||
lore -J <command> # JSON shorthand
|
||||
lore --color never <command> # Disable color output
|
||||
lore --color always <command> # Force color output
|
||||
lore -q <command> # Suppress non-essential output
|
||||
```
|
||||
|
||||
Color output respects `NO_COLOR` and `CLICOLOR` environment variables in `auto` mode (the default).
|
||||
|
||||
## Shell Completions
|
||||
|
||||
Generate shell completions for tab-completion support:
|
||||
@@ -480,6 +521,8 @@ Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:
|
||||
| `documents` | Extracted searchable text for FTS and embedding |
|
||||
| `documents_fts` | FTS5 full-text search index |
|
||||
| `embeddings` | Vector embeddings for semantic search |
|
||||
| `dirty_sources` | Entities needing document regeneration after ingest |
|
||||
| `pending_discussion_fetches` | Queue for discussion fetch operations |
|
||||
| `sync_runs` | Audit trail of sync operations |
|
||||
| `sync_cursors` | Cursor positions for incremental sync |
|
||||
| `app_locks` | Crash-safe single-flight lock |
|
||||
|
||||
1654
api-review.html
Normal file
1654
api-review.html
Normal file
File diff suppressed because it is too large
Load Diff
308
docs/embedding-pipeline-hardening.md
Normal file
308
docs/embedding-pipeline-hardening.md
Normal file
@@ -0,0 +1,308 @@
|
||||
# Embedding Pipeline Hardening: Chunk Config Drift, Adaptive Dedup, Full Flag Wiring
|
||||
|
||||
> **Status:** Proposed
|
||||
> **Date:** 2026-02-02
|
||||
> **Context:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama context window overflow. This plan addresses the downstream consequences of that change.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Three issues stem from the chunk size reduction:
|
||||
|
||||
1. **Broken `--full` wiring**: `handle_embed` in main.rs ignores `args.full` (calls `run_embed` instead of `run_embed_full`). `run_sync` hardcodes `false` for retry_failed and never passes `options.full` to embed. Users running `lore sync --full` or `lore embed --full` don't get a full re-embed.
|
||||
|
||||
2. **Mixed chunk sizes in vector space**: Existing embeddings (32KB chunks) coexist with new embeddings (6KB chunks). These are semantically incomparable -- different granularity vectors in the same KNN space degrade search quality. No mechanism detects this drift.
|
||||
|
||||
3. **Static dedup multiplier**: `search_vector` uses `limit * 8` to over-fetch for dedup. With smaller chunks producing 5-6 chunks per document, clustered search results can exhaust slots before reaching `limit` unique documents. The multiplier should adapt to actual data.
|
||||
|
||||
## Decision Record
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|----------|--------|-----------|
|
||||
| Detect chunk config drift | Store `chunk_max_bytes` in `embedding_metadata` | Allows automatic invalidation without user intervention. Self-heals on next sync. |
|
||||
| Dedup multiplier strategy | Adaptive from DB with static floor | One cheap aggregate query per search. Self-adjusts as data grows. No wasted KNN budget. |
|
||||
| `--full` propagation | `sync --full` passes full to embed step | Matches user expectation: "start fresh" means everything, not just ingest+docs. |
|
||||
| Migration strategy | New migration 010 for `chunk_max_bytes` column | Non-breaking additive change. NULL values = "unknown config" treated as needing re-embed. |
|
||||
|
||||
---
|
||||
|
||||
## Changes
|
||||
|
||||
### Change 1: Wire `--full` flag through to embed
|
||||
|
||||
**Files:**
|
||||
- `src/main.rs` (line 1116)
|
||||
- `src/cli/commands/sync.rs` (line 105)
|
||||
|
||||
**main.rs `handle_embed`** (line 1116):
|
||||
```rust
|
||||
// BEFORE:
|
||||
let result = run_embed(&config, retry_failed).await?;
|
||||
|
||||
// AFTER:
|
||||
let result = run_embed_full(&config, args.full, retry_failed).await?;
|
||||
```
|
||||
|
||||
Update the import at top of main.rs from `run_embed` to `run_embed_full`.
|
||||
|
||||
**sync.rs `run_sync`** (line 105):
|
||||
```rust
|
||||
// BEFORE:
|
||||
match run_embed(config, false).await {
|
||||
|
||||
// AFTER:
|
||||
match run_embed_full(config, options.full, false).await {
|
||||
```
|
||||
|
||||
Update the import at line 11 from `run_embed` to `run_embed_full`.
|
||||
|
||||
**Cleanup `embed.rs`**: Remove `run_embed` (the wrapper that hardcodes `full: false`). All callers should use `run_embed_full` directly. Rename `run_embed_full` to `run_embed` with the 3-arg signature `(config, full, retry_failed)`.
|
||||
|
||||
Final signature:
|
||||
```rust
|
||||
pub async fn run_embed(
|
||||
config: &Config,
|
||||
full: bool,
|
||||
retry_failed: bool,
|
||||
) -> Result<EmbedCommandResult>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 2: Migration 010 -- add `chunk_max_bytes` to `embedding_metadata`
|
||||
|
||||
**New file:** `migrations/010_chunk_config.sql`
|
||||
|
||||
```sql
|
||||
-- Migration 010: Chunk config tracking
|
||||
-- Schema version: 10
|
||||
-- Adds chunk_max_bytes to embedding_metadata for drift detection.
|
||||
-- Existing rows get NULL, which the change detector treats as "needs re-embed".
|
||||
|
||||
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
|
||||
|
||||
UPDATE schema_version SET version = 10
|
||||
WHERE version = (SELECT MAX(version) FROM schema_version);
|
||||
-- Or if using INSERT pattern:
|
||||
INSERT INTO schema_version (version, applied_at, description)
|
||||
VALUES (10, strftime('%s', 'now') * 1000, 'Add chunk_max_bytes to embedding_metadata for config drift detection');
|
||||
```
|
||||
|
||||
Check existing migration pattern in `src/core/db.rs` for how migrations are applied -- follow that exact pattern for consistency.
|
||||
|
||||
---
|
||||
|
||||
### Change 3: Store `chunk_max_bytes` when writing embeddings
|
||||
|
||||
**File:** `src/embedding/pipeline.rs`
|
||||
|
||||
**`store_embedding`** (lines 238-266): Add `chunk_max_bytes` to the INSERT:
|
||||
|
||||
```rust
|
||||
// Add import at top:
|
||||
use crate::embedding::chunking::CHUNK_MAX_BYTES;
|
||||
|
||||
// In store_embedding, update SQL:
|
||||
conn.execute(
|
||||
"INSERT OR REPLACE INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error, chunk_max_bytes)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8)",
|
||||
rusqlite::params![
|
||||
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
||||
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64
|
||||
],
|
||||
)?;
|
||||
```
|
||||
|
||||
**`record_embedding_error`** (lines 269-291): Also store `chunk_max_bytes` so error rows track which config they failed under:
|
||||
|
||||
```rust
|
||||
conn.execute(
|
||||
"INSERT INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
|
||||
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
|
||||
attempt_count = embedding_metadata.attempt_count + 1,
|
||||
last_error = ?8,
|
||||
last_attempt_at = ?7,
|
||||
chunk_max_bytes = ?9",
|
||||
rusqlite::params![
|
||||
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
||||
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
|
||||
],
|
||||
)?;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Change 4: Detect chunk config drift in change detector
|
||||
|
||||
**File:** `src/embedding/change_detector.rs`
|
||||
|
||||
Add a third condition to the pending detection: embeddings where `chunk_max_bytes` differs from the current `CHUNK_MAX_BYTES` constant (or is NULL, meaning pre-migration embeddings).
|
||||
|
||||
```rust
|
||||
use crate::embedding::chunking::CHUNK_MAX_BYTES;
|
||||
|
||||
pub fn find_pending_documents(
|
||||
conn: &Connection,
|
||||
page_size: usize,
|
||||
last_id: i64,
|
||||
) -> Result<Vec<PendingDocument>> {
|
||||
let sql = r#"
|
||||
SELECT d.id, d.content_text, d.content_hash
|
||||
FROM documents d
|
||||
WHERE d.id > ?1
|
||||
AND (
|
||||
-- Case 1: No embedding metadata (new document)
|
||||
NOT EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
)
|
||||
-- Case 2: Document content changed
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND em.document_hash != d.content_hash
|
||||
)
|
||||
-- Case 3: Chunk config drift (different chunk size or pre-migration NULL)
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND (em.chunk_max_bytes IS NULL OR em.chunk_max_bytes != ?3)
|
||||
)
|
||||
)
|
||||
ORDER BY d.id
|
||||
LIMIT ?2
|
||||
"#;
|
||||
|
||||
let mut stmt = conn.prepare(sql)?;
|
||||
let rows = stmt
|
||||
.query_map(
|
||||
rusqlite::params![last_id, page_size as i64, CHUNK_MAX_BYTES as i64],
|
||||
|row| {
|
||||
Ok(PendingDocument {
|
||||
document_id: row.get(0)?,
|
||||
content_text: row.get(1)?,
|
||||
content_hash: row.get(2)?,
|
||||
})
|
||||
},
|
||||
)?
|
||||
.collect::<std::result::Result<Vec<_>, _>>()?;
|
||||
|
||||
Ok(rows)
|
||||
}
|
||||
```
|
||||
|
||||
Apply the same change to `count_pending_documents` -- add the third OR clause and the `?3` parameter.
|
||||
|
||||
---
|
||||
|
||||
### Change 5: Adaptive dedup multiplier in vector search
|
||||
|
||||
**File:** `src/search/vector.rs`
|
||||
|
||||
Replace the static `limit * 8` with an adaptive multiplier based on the actual max chunks-per-document in the database.
|
||||
|
||||
```rust
|
||||
/// Query the max chunks any single document has in the embedding table.
|
||||
/// Returns the max chunk count, or a default floor if no data exists.
|
||||
fn max_chunks_per_document(conn: &Connection) -> i64 {
|
||||
conn.query_row(
|
||||
"SELECT COALESCE(MAX(cnt), 1) FROM (
|
||||
SELECT COUNT(*) as cnt FROM embedding_metadata
|
||||
WHERE last_error IS NULL
|
||||
GROUP BY document_id
|
||||
)",
|
||||
[],
|
||||
|row| row.get(0),
|
||||
)
|
||||
.unwrap_or(1)
|
||||
}
|
||||
|
||||
pub fn search_vector(
|
||||
conn: &Connection,
|
||||
query_embedding: &[f32],
|
||||
limit: usize,
|
||||
) -> Result<Vec<VectorResult>> {
|
||||
if query_embedding.is_empty() || limit == 0 {
|
||||
return Ok(Vec::new());
|
||||
}
|
||||
|
||||
let embedding_bytes: Vec<u8> = query_embedding
|
||||
.iter()
|
||||
.flat_map(|f| f.to_le_bytes())
|
||||
.collect();
|
||||
|
||||
// Adaptive over-fetch: use actual max chunks per doc, with floor of 8x
|
||||
// The 1.5x safety margin handles clustering in KNN results
|
||||
let max_chunks = max_chunks_per_document(conn);
|
||||
let multiplier = (max_chunks as usize * 3 / 2).max(8);
|
||||
let k = limit * multiplier;
|
||||
|
||||
// ... rest unchanged ...
|
||||
}
|
||||
```
|
||||
|
||||
**Why `max_chunks * 1.5` with floor of 8**:
|
||||
- `max_chunks` is the worst case for a single document dominating results
|
||||
- `* 1.5` adds margin for multiple clustered documents
|
||||
- Floor of `8` ensures reasonable over-fetch even with single-chunk documents
|
||||
- This is a single aggregate query on an indexed column -- sub-millisecond
|
||||
|
||||
---
|
||||
|
||||
### Change 6: Update chunk_ids.rs comment
|
||||
|
||||
**File:** `src/embedding/chunk_ids.rs` (line 1-3)
|
||||
|
||||
Update the comment to reflect current reality:
|
||||
```rust
|
||||
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
|
||||
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
|
||||
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
|
||||
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Files Modified (Summary)
|
||||
|
||||
| File | Change |
|
||||
|------|--------|
|
||||
| `migrations/010_chunk_config.sql` | **NEW** -- Add `chunk_max_bytes` column |
|
||||
| `src/embedding/pipeline.rs` | Store `CHUNK_MAX_BYTES` in metadata writes |
|
||||
| `src/embedding/change_detector.rs` | Detect chunk config drift (3rd OR clause) |
|
||||
| `src/search/vector.rs` | Adaptive dedup multiplier from DB |
|
||||
| `src/cli/commands/embed.rs` | Consolidate to single `run_embed(config, full, retry_failed)` |
|
||||
| `src/cli/commands/sync.rs` | Pass `options.full` to embed, update import |
|
||||
| `src/main.rs` | Call `run_embed` with `args.full`, update import |
|
||||
| `src/embedding/chunk_ids.rs` | Comment update only |
|
||||
|
||||
## Verification
|
||||
|
||||
1. **Compile check**: `cargo build` -- no errors
|
||||
2. **Unit tests**: `cargo test` -- all existing tests pass
|
||||
3. **Migration test**: Run `lore doctor` or `lore migrate` -- migration 010 applies cleanly
|
||||
4. **Full flag wiring**: `lore embed --full` should clear all embeddings and re-embed. Verify by checking `lore --robot stats` before and after (embedded count should reset then rebuild).
|
||||
5. **Chunk config drift**: After migration, existing embeddings have `chunk_max_bytes = NULL`. Running `lore embed` (without --full) should detect all existing embeddings as stale and re-embed them automatically.
|
||||
6. **Sync propagation**: `lore sync --full` should produce the same embed behavior as `lore embed --full`
|
||||
7. **Adaptive dedup**: Run `lore search "some query"` and verify the result count matches the requested limit (default 20). Check with `RUST_LOG=debug` that the computed `k` value scales with actual chunk distribution.
|
||||
|
||||
## Decision Record (for future reference)
|
||||
|
||||
**Date:** 2026-02-02
|
||||
**Trigger:** Reduced CHUNK_MAX_BYTES from 32KB to 6KB to prevent Ollama nomic-embed-text context window overflow (8192 tokens).
|
||||
|
||||
**Downstream consequences identified:**
|
||||
1. Chunk ID headroom reduced (1000 slots, now ~333 used for 2MB docs) -- acceptable, no action needed
|
||||
2. Vector search dedup pressure increased 5x -- fixed with adaptive multiplier
|
||||
3. Embedding DB grows ~5x -- acceptable at current scale (~7.5MB)
|
||||
4. Mixed chunk sizes degrade search -- fixed with config drift detection
|
||||
5. Ollama API call volume increases proportionally -- acceptable for local model
|
||||
|
||||
**Rejected alternatives:**
|
||||
- Two-phase KNN fetch (fetch, check, re-fetch with higher k): adds code complexity for marginal improvement over adaptive. sqlite-vec doesn't support OFFSET in KNN queries, requiring full re-query.
|
||||
- Generous static multiplier (15x): wastes KNN budget on datasets where documents are small. Over-allocates permanently instead of adapting.
|
||||
- Manual `--full` as the only drift remedy: requires users to understand chunk config internals. Violates principle of least surprise.
|
||||
951
docs/phase-b-temporal-intelligence.md
Normal file
951
docs/phase-b-temporal-intelligence.md
Normal file
@@ -0,0 +1,951 @@
|
||||
# Phase B: Temporal Intelligence Foundation
|
||||
|
||||
> **Status:** Draft
|
||||
> **Prerequisite:** CP3 Gates B+C complete (working search + sync pipeline)
|
||||
> **Goal:** Transform gitlore from a search engine into a temporal code intelligence system by ingesting structured event data from GitLab and exposing temporal queries that answer "why" and "when" questions about project history.
|
||||
|
||||
---
|
||||
|
||||
## Motivation
|
||||
|
||||
gitlore currently stores **snapshots** — the latest state of each issue, MR, and discussion. But temporal queries need **change history**. When an issue's labels change from `priority::low` to `priority::critical`, the current schema overwrites the label junction. The transition is lost.
|
||||
|
||||
GitLab issues, MRs, and discussions contain the raw ingredients for temporal intelligence: state transitions, label mutations, assignee changes, cross-references between entities, and decision rationale in discussions. What's missing is a structured temporal index that makes these ingredients queryable.
|
||||
|
||||
### The Problem This Solves
|
||||
|
||||
Today, when an AI agent or developer asks "Why did the team switch from REST to GraphQL?" or "What happened with the auth migration?", the answer is scattered across paginated API responses with no temporal index, no cross-referencing, and no semantic layer. Reconstructing a decision timeline manually takes 20+ minutes of clicking through GitLab's UI. This phase makes it take 2 seconds.
|
||||
|
||||
### Forcing Function
|
||||
|
||||
This phase is designed around one concrete question: **"What happened with X?"** — where X is any keyword, feature name, or initiative. If `lore timeline "auth migration"` can produce a useful, chronologically-ordered narrative of all related events across issues, MRs, and discussions, the architecture is validated. If it can't, we learn what's missing before investing in deeper temporal features.
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary (Gated Milestones)
|
||||
|
||||
Five gates, each independently verifiable and shippable:
|
||||
|
||||
**Gate 1 (Resource Events Ingestion):** Structured event data from GitLab APIs → local event tables
|
||||
**Gate 2 (Cross-Reference Extraction):** Entity relationship graph from structured APIs + system note parsing
|
||||
**Gate 3 (Decision Timeline):** `lore timeline` command — keyword-driven chronological narrative
|
||||
**Gate 4 (File Decision History):** `lore file-history` command — MR-to-file linking + scoped timelines
|
||||
**Gate 5 (Code Trace):** `lore trace` command — file:line → commit → MR → issue → rationale chain
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
- **Structured APIs over text parsing.** GitLab provides Resource Events APIs (`resource_state_events`, `resource_label_events`, `resource_milestone_events`) that return clean JSON. These are the primary data source for temporal events. System note parsing is a fallback for events without structured APIs (assignee changes, cross-references).
|
||||
- **Dependent resource pattern.** Resource events are fetched per-entity, triggered by the existing dirty source tracking. Same architecture as discussion fetching — queue-based, resumable, incremental.
|
||||
- **Opt-in event ingestion.** New config flag `sync.fetchResourceEvents` (default `true`) controls whether the sync pipeline fetches event data. Users who don't need temporal features skip the additional API calls.
|
||||
- **Application-level graph traversal.** Cross-reference expansion uses BFS in Rust, not recursive SQL CTEs. Capped at configurable depth (default 1) for predictable performance.
|
||||
- **Evolutionary library extraction.** New commands are built with typed return structs from day one. Old commands are not retrofitted until a concrete consumer (MCP server, web UI) requires it.
|
||||
- **Phase A fields cherry-picked as needed.** `merge_commit_sha` and `squash_commit_sha` are added in this phase's migration. Remaining Phase A fields are handled in their own migration later.
|
||||
|
||||
### Scope Boundaries
|
||||
|
||||
**In scope:**
|
||||
- Batch temporal queries over historical data
|
||||
- Structured event ingestion from GitLab APIs
|
||||
- Cross-reference graph construction
|
||||
- CLI commands with robot mode JSON output
|
||||
|
||||
**Out of scope (future phases):**
|
||||
- Real-time monitoring / notifications ("alert me when my code changes")
|
||||
- MCP server (Phase C — consumes the library API this phase produces)
|
||||
- Web UI (Phase D — consumes the same library API)
|
||||
- Pattern evolution / cross-project trend detection (Phase C)
|
||||
- Library extraction refactor (happens organically as new commands are added)
|
||||
|
||||
---
|
||||
|
||||
## Gate 1: Resource Events Ingestion
|
||||
|
||||
### 1.1 Rationale: Why Not Parse System Notes?
|
||||
|
||||
The original approach was to parse system note body text with regex to extract state changes and label mutations. Research revealed this is the wrong approach:
|
||||
|
||||
1. **Structured APIs exist.** GitLab's Resource Events APIs return clean JSON with explicit `action`, `state`, and `label` fields. Available on all tiers (Free, Premium, Ultimate).
|
||||
2. **System notes are localized.** A French GitLab instance says `"ajouté l'étiquette ~bug"` — regex breaks for non-English instances.
|
||||
3. **Label events aren't in the Notes API.** Per [GitLab Issue #24661](https://gitlab.com/gitlab-org/gitlab/-/issues/24661), label change system notes are not returned by the Notes API. The Resource Label Events API is the only reliable source.
|
||||
4. **No versioned format spec.** System note text has changed across GitLab 14.x–17.x with no documentation of format changes.
|
||||
|
||||
System note parsing is still used for events without structured APIs (see Gate 2), but with the explicit understanding that it's best-effort and fragile for non-English instances.
|
||||
|
||||
### 1.2 Schema (Migration 010)
|
||||
|
||||
**File:** `migrations/010_resource_events.sql`
|
||||
|
||||
```sql
|
||||
-- State change events (opened, closed, reopened, merged, locked)
|
||||
-- Source: GET /projects/:id/issues/:iid/resource_state_events
|
||||
-- Source: GET /projects/:id/merge_requests/:iid/resource_state_events
|
||||
CREATE TABLE resource_state_events (
|
||||
id INTEGER PRIMARY KEY,
|
||||
gitlab_id INTEGER NOT NULL,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
||||
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
|
||||
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
|
||||
state TEXT NOT NULL, -- 'opened' | 'closed' | 'reopened' | 'merged' | 'locked'
|
||||
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
|
||||
actor_username TEXT, -- display/search convenience
|
||||
created_at INTEGER NOT NULL, -- ms epoch UTC
|
||||
-- "closed by MR" link: structured by GitLab, not parsed from text
|
||||
source_merge_request_id INTEGER, -- GitLab's MR iid that caused this state change
|
||||
source_commit TEXT, -- commit SHA that caused this state change
|
||||
UNIQUE(gitlab_id, project_id),
|
||||
CHECK (
|
||||
(issue_id IS NOT NULL AND merge_request_id IS NULL)
|
||||
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
|
||||
)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_state_events_issue ON resource_state_events(issue_id)
|
||||
WHERE issue_id IS NOT NULL;
|
||||
CREATE INDEX idx_state_events_mr ON resource_state_events(merge_request_id)
|
||||
WHERE merge_request_id IS NOT NULL;
|
||||
CREATE INDEX idx_state_events_created ON resource_state_events(created_at);
|
||||
|
||||
-- Label change events (add, remove)
|
||||
-- Source: GET /projects/:id/issues/:iid/resource_label_events
|
||||
-- Source: GET /projects/:id/merge_requests/:iid/resource_label_events
|
||||
CREATE TABLE resource_label_events (
|
||||
id INTEGER PRIMARY KEY,
|
||||
gitlab_id INTEGER NOT NULL,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
||||
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
|
||||
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
|
||||
label_name TEXT NOT NULL,
|
||||
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
|
||||
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
|
||||
actor_username TEXT, -- display/search convenience
|
||||
created_at INTEGER NOT NULL, -- ms epoch UTC
|
||||
UNIQUE(gitlab_id, project_id),
|
||||
CHECK (
|
||||
(issue_id IS NOT NULL AND merge_request_id IS NULL)
|
||||
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
|
||||
)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_label_events_issue ON resource_label_events(issue_id)
|
||||
WHERE issue_id IS NOT NULL;
|
||||
CREATE INDEX idx_label_events_mr ON resource_label_events(merge_request_id)
|
||||
WHERE merge_request_id IS NOT NULL;
|
||||
CREATE INDEX idx_label_events_created ON resource_label_events(created_at);
|
||||
CREATE INDEX idx_label_events_label ON resource_label_events(label_name);
|
||||
|
||||
-- Milestone change events (add, remove)
|
||||
-- Source: GET /projects/:id/issues/:iid/resource_milestone_events
|
||||
-- Source: GET /projects/:id/merge_requests/:iid/resource_milestone_events
|
||||
CREATE TABLE resource_milestone_events (
|
||||
id INTEGER PRIMARY KEY,
|
||||
gitlab_id INTEGER NOT NULL,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
||||
issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE,
|
||||
merge_request_id INTEGER REFERENCES merge_requests(id) ON DELETE CASCADE,
|
||||
milestone_title TEXT NOT NULL,
|
||||
milestone_id INTEGER,
|
||||
action TEXT NOT NULL CHECK (action IN ('add', 'remove')),
|
||||
actor_gitlab_id INTEGER, -- GitLab user ID (stable; usernames can change)
|
||||
actor_username TEXT, -- display/search convenience
|
||||
created_at INTEGER NOT NULL, -- ms epoch UTC
|
||||
UNIQUE(gitlab_id, project_id),
|
||||
CHECK (
|
||||
(issue_id IS NOT NULL AND merge_request_id IS NULL)
|
||||
OR (issue_id IS NULL AND merge_request_id IS NOT NULL)
|
||||
)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_milestone_events_issue ON resource_milestone_events(issue_id)
|
||||
WHERE issue_id IS NOT NULL;
|
||||
CREATE INDEX idx_milestone_events_mr ON resource_milestone_events(merge_request_id)
|
||||
WHERE merge_request_id IS NOT NULL;
|
||||
CREATE INDEX idx_milestone_events_created ON resource_milestone_events(created_at);
|
||||
```
|
||||
|
||||
### 1.3 Config Extension
|
||||
|
||||
**File:** `src/core/config.rs`
|
||||
|
||||
Add to `SyncConfig`:
|
||||
|
||||
```rust
|
||||
/// Fetch resource events (state, label, milestone changes) during sync.
|
||||
/// Increases API calls but enables temporal queries (lore timeline, etc.).
|
||||
/// Default: true
|
||||
#[serde(default = "default_true")]
|
||||
pub fetch_resource_events: bool,
|
||||
```
|
||||
|
||||
**Config file example:**
|
||||
|
||||
```json
|
||||
{
|
||||
"sync": {
|
||||
"fetchResourceEvents": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 1.4 GitLab API Client
|
||||
|
||||
**New endpoints in `src/gitlab/client.rs`:**
|
||||
|
||||
```
|
||||
GET /projects/:id/issues/:iid/resource_state_events?per_page=100
|
||||
GET /projects/:id/issues/:iid/resource_label_events?per_page=100
|
||||
GET /projects/:id/merge_requests/:iid/resource_state_events?per_page=100
|
||||
GET /projects/:id/merge_requests/:iid/resource_label_events?per_page=100
|
||||
GET /projects/:id/issues/:iid/resource_milestone_events?per_page=100
|
||||
GET /projects/:id/merge_requests/:iid/resource_milestone_events?per_page=100
|
||||
```
|
||||
|
||||
All endpoints use standard pagination. Fetch all pages per entity.
|
||||
|
||||
**New serde types in `src/gitlab/types.rs`:**
|
||||
|
||||
```rust
|
||||
#[derive(Debug, Clone, Deserialize, Serialize)]
|
||||
pub struct GitLabStateEvent {
|
||||
pub id: i64,
|
||||
pub user: Option<GitLabAuthor>,
|
||||
pub created_at: String,
|
||||
pub resource_type: String, // "Issue" | "MergeRequest"
|
||||
pub resource_id: i64,
|
||||
pub state: String, // "opened" | "closed" | "reopened" | "merged" | "locked"
|
||||
pub source_commit: Option<String>,
|
||||
pub source_merge_request: Option<GitLabMergeRequestRef>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Deserialize, Serialize)]
|
||||
pub struct GitLabLabelEvent {
|
||||
pub id: i64,
|
||||
pub user: Option<GitLabAuthor>,
|
||||
pub created_at: String,
|
||||
pub resource_type: String,
|
||||
pub resource_id: i64,
|
||||
pub label: GitLabLabelRef,
|
||||
pub action: String, // "add" | "remove"
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Deserialize, Serialize)]
|
||||
pub struct GitLabMilestoneEvent {
|
||||
pub id: i64,
|
||||
pub user: Option<GitLabAuthor>,
|
||||
pub created_at: String,
|
||||
pub resource_type: String,
|
||||
pub resource_id: i64,
|
||||
pub milestone: GitLabMilestoneRef,
|
||||
pub action: String, // "add" | "remove"
|
||||
}
|
||||
```
|
||||
|
||||
### 1.5 Ingestion Pipeline
|
||||
|
||||
**Architecture:** Generic dependent-fetch queue, generalizing the `pending_discussion_fetches` pattern. A single queue table serves all dependent resource types across Gates 1, 2, and 4, avoiding schema churn as new fetch types are added.
|
||||
|
||||
**New queue table (in migration 010):**
|
||||
|
||||
```sql
|
||||
-- Generic queue for all dependent resource fetches (events, closes_issues, diffs)
|
||||
-- Replaces per-type queue tables with a unified job model
|
||||
CREATE TABLE pending_dependent_fetches (
|
||||
id INTEGER PRIMARY KEY,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
||||
entity_type TEXT NOT NULL CHECK (entity_type IN ('issue', 'merge_request')),
|
||||
entity_iid INTEGER NOT NULL,
|
||||
entity_local_id INTEGER NOT NULL,
|
||||
job_type TEXT NOT NULL CHECK (job_type IN (
|
||||
'resource_events', -- Gate 1: state + label + milestone events
|
||||
'mr_closes_issues', -- Gate 2: closes_issues API
|
||||
'mr_diffs' -- Gate 4: MR file changes
|
||||
)),
|
||||
payload_json TEXT, -- job-specific params, e.g. {"event_types":["state","label","milestone"]}
|
||||
enqueued_at INTEGER NOT NULL,
|
||||
attempts INTEGER NOT NULL DEFAULT 0,
|
||||
last_error TEXT,
|
||||
next_retry_at INTEGER,
|
||||
locked_at INTEGER, -- crash recovery: NULL = available, non-NULL = in progress
|
||||
UNIQUE(project_id, entity_type, entity_iid, job_type)
|
||||
);
|
||||
```
|
||||
|
||||
The `locked_at` column provides crash recovery: if a sync process crashes mid-drain, stale locks (older than 5 minutes) are automatically reclaimed on the next `lore sync` run. This is intentionally minimal — full job leasing with `locked_by` and lease expiration is unnecessary for a single-process CLI tool.
|
||||
|
||||
**Flow:**
|
||||
|
||||
1. During issue/MR ingestion, when an entity is upserted (new or updated), enqueue jobs in `pending_dependent_fetches`:
|
||||
- For all entities: `job_type = 'resource_events'` (when `fetchResourceEvents` is true)
|
||||
- For MRs: `job_type = 'mr_closes_issues'` (always, for Gate 2)
|
||||
- For MRs: `job_type = 'mr_diffs'` (when `fetchMrFileChanges` is true, for Gate 4)
|
||||
2. After primary ingestion completes, drain the dependent fetch queue:
|
||||
- Claim jobs: `UPDATE ... SET locked_at = now WHERE locked_at IS NULL AND (next_retry_at IS NULL OR next_retry_at <= now)`
|
||||
- For each job, dispatch by `job_type` to the appropriate fetcher
|
||||
- On success: DELETE the job row
|
||||
- On transient failure: increment `attempts`, set `next_retry_at` with exponential backoff, clear `locked_at`
|
||||
3. `lore sync` drains dependent jobs after ingestion + discussion fetch steps.
|
||||
|
||||
**Incremental behavior:** Only entities that changed since last sync are enqueued. On `--full` sync, all entities are re-enqueued.
|
||||
|
||||
### 1.6 API Call Budget
|
||||
|
||||
Per entity: 3 API calls (state + label + milestone) for issues, 3 for MRs.
|
||||
|
||||
| Scenario | Entities | API Calls | Time at 2k req/min |
|
||||
|----------|----------|-----------|---------------------|
|
||||
| Initial sync, 500 issues + 200 MRs | 700 | 2,100 | ~1 min |
|
||||
| Initial sync, 2,000 issues + 1,000 MRs | 3,000 | 9,000 | ~4.5 min |
|
||||
| Incremental sync, 20 changed entities | 20 | 60 | <2 sec |
|
||||
|
||||
Acceptable for initial sync. Incremental sync adds negligible overhead.
|
||||
|
||||
**Optimization (future):** If milestone events prove low-value, make them opt-in to reduce calls by 1/3.
|
||||
|
||||
### 1.7 Acceptance Criteria
|
||||
|
||||
- [ ] Migration 010 creates all three event tables + generic dependent fetch queue
|
||||
- [ ] `lore sync` fetches resource events for changed entities when `fetchResourceEvents` is true
|
||||
- [ ] `lore sync --no-events` skips event fetching
|
||||
- [ ] Event fetch failures are queued for retry with exponential backoff
|
||||
- [ ] Stale locks (crashed sync) automatically reclaimed on next run
|
||||
- [ ] `lore count events` shows event counts by type
|
||||
- [ ] `lore stats --check` validates event table referential integrity
|
||||
- [ ] `lore stats --check` validates dependent job queue health (no stuck locks, retryable jobs visible)
|
||||
- [ ] Robot mode JSON for all new commands
|
||||
|
||||
---
|
||||
|
||||
## Gate 2: Cross-Reference Extraction
|
||||
|
||||
### 2.1 Rationale
|
||||
|
||||
Temporal queries need to follow links between entities: "MR !567 closed issue #234", "issue #234 mentioned in MR !567", "#299 was opened as a follow-up to !567". These relationships are captured in two places:
|
||||
|
||||
1. **Structured API:** `GET /projects/:id/merge_requests/:iid/closes_issues` returns issues that close when the MR merges. Also, `resource_state_events` includes `source_merge_request_id` for "closed by MR" events.
|
||||
2. **System notes:** Cross-references like "mentioned in !456" and "closed by !789" appear in system note body text.
|
||||
|
||||
### 2.2 Schema (in Migration 010)
|
||||
|
||||
```sql
|
||||
-- Cross-references between entities
|
||||
-- Populated from: closes_issues API, state events, system note parsing
|
||||
--
|
||||
-- Directionality convention:
|
||||
-- source = the entity where the reference was *observed* (contains the note, or is the MR in closes_issues)
|
||||
-- target = the entity being *referenced* (the issue closed, the MR mentioned)
|
||||
-- This is consistent across all source_methods and enables predictable BFS traversal.
|
||||
--
|
||||
-- Unresolved references: when a cross-reference points to an entity in a project
|
||||
-- that isn't synced locally, target_entity_id is NULL but target_project_path and
|
||||
-- target_entity_iid are populated. This preserves valuable edges rather than
|
||||
-- silently dropping them. Timeline output marks these as "[external]".
|
||||
CREATE TABLE entity_references (
|
||||
id INTEGER PRIMARY KEY,
|
||||
source_entity_type TEXT NOT NULL CHECK (source_entity_type IN ('issue', 'merge_request')),
|
||||
source_entity_id INTEGER NOT NULL, -- local DB id
|
||||
target_entity_type TEXT NOT NULL CHECK (target_entity_type IN ('issue', 'merge_request')),
|
||||
target_entity_id INTEGER, -- local DB id (NULL when target is unresolved/external)
|
||||
target_project_path TEXT, -- e.g. "group/other-repo" (populated for cross-project refs)
|
||||
target_entity_iid INTEGER, -- GitLab iid (populated when target_entity_id is NULL)
|
||||
reference_type TEXT NOT NULL, -- 'closes' | 'mentioned' | 'related'
|
||||
source_method TEXT NOT NULL, -- 'api_closes_issues' | 'api_state_event' | 'system_note_parse'
|
||||
created_at INTEGER, -- when the reference was created (if known)
|
||||
UNIQUE(source_entity_type, source_entity_id, target_entity_type,
|
||||
COALESCE(target_entity_id, -1), COALESCE(target_project_path, ''),
|
||||
COALESCE(target_entity_iid, -1), reference_type)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_refs_source ON entity_references(source_entity_type, source_entity_id);
|
||||
CREATE INDEX idx_refs_target ON entity_references(target_entity_type, target_entity_id)
|
||||
WHERE target_entity_id IS NOT NULL;
|
||||
CREATE INDEX idx_refs_unresolved ON entity_references(target_project_path, target_entity_iid)
|
||||
WHERE target_entity_id IS NULL;
|
||||
```
|
||||
|
||||
### 2.3 Population Strategy
|
||||
|
||||
**Tier 1 — Structured APIs (reliable):**
|
||||
|
||||
1. **`closes_issues` endpoint:** After MR ingestion, fetch `GET /projects/:id/merge_requests/:iid/closes_issues`. Insert `reference_type = 'closes'`, `source_method = 'api_closes_issues'`. Source = MR, target = issue.
|
||||
2. **State events:** When `resource_state_events` contains `source_merge_request_id`, insert `reference_type = 'closes'`, `source_method = 'api_state_event'`. Source = MR (referenced by iid), target = issue (that received the state change).
|
||||
|
||||
**Tier 2 — System note parsing (best-effort):**
|
||||
|
||||
Parse system notes where `is_system = 1` for cross-reference patterns.
|
||||
|
||||
**Directionality rule:** Source = entity containing the system note. Target = entity referenced by the note text. This is consistent with Tier 1's convention.
|
||||
|
||||
```
|
||||
mentioned in !{iid}
|
||||
mentioned in #{iid}
|
||||
mentioned in {group}/{project}!{iid}
|
||||
mentioned in {group}/{project}#{iid}
|
||||
closed by !{iid}
|
||||
closed by #{iid}
|
||||
```
|
||||
|
||||
**Cross-project references:** When a system note references `{group}/{project}#{iid}` and the target project is not synced locally, store with `target_entity_id = NULL`, `target_project_path = '{group}/{project}'`, `target_entity_iid = {iid}`. These unresolved references are still valuable for timeline narratives — they indicate external dependencies and decision context even when we can't traverse further.
|
||||
|
||||
Insert with `source_method = 'system_note_parse'`. Accept that:
|
||||
- This breaks on non-English GitLab instances
|
||||
- Format may vary across GitLab versions
|
||||
- Log parse failures at `debug` level for monitoring
|
||||
|
||||
**Tier 3 — Description/body parsing (deferred):**
|
||||
|
||||
Issue and MR descriptions often contain `#123` or `!456` references. Parsing these is lower confidence (mentions != relationships) and is deferred to a future iteration.
|
||||
|
||||
### 2.4 Ingestion Flow
|
||||
|
||||
The `closes_issues` fetch uses the generic dependent fetch queue (`job_type = 'mr_closes_issues'`):
|
||||
- After MR ingestion, a `mr_closes_issues` job is enqueued alongside `resource_events` jobs
|
||||
- One additional API call per MR: `GET /projects/:id/merge_requests/:iid/closes_issues`
|
||||
- Cross-reference parsing from system notes runs as a local post-processing step (no API calls) after all dependent fetches complete
|
||||
|
||||
### 2.5 Acceptance Criteria
|
||||
|
||||
- [ ] `entity_references` table populated from `closes_issues` API for all synced MRs
|
||||
- [ ] `entity_references` table populated from `resource_state_events` where `source_merge_request_id` is present
|
||||
- [ ] System notes parsed for cross-reference patterns (English instances)
|
||||
- [ ] Cross-project references stored as unresolved when target project is not synced
|
||||
- [ ] `source_method` column tracks provenance of each reference
|
||||
- [ ] References are deduplicated (same relationship from multiple sources stored once)
|
||||
- [ ] Timeline JSON includes expansion provenance (`via`) for all expanded entities
|
||||
|
||||
---
|
||||
|
||||
## Gate 3: Decision Timeline (`lore timeline`)
|
||||
|
||||
### 3.1 Command Design
|
||||
|
||||
```bash
|
||||
# Basic: keyword-driven timeline
|
||||
lore timeline "auth migration"
|
||||
|
||||
# Scoped to project
|
||||
lore timeline "auth migration" -p group/repo
|
||||
|
||||
# Limit date range
|
||||
lore timeline "auth migration" --since 6m
|
||||
lore timeline "auth migration" --since 2024-01-01
|
||||
|
||||
# Control cross-reference expansion depth
|
||||
lore timeline "auth migration" --depth 0 # No expansion (matched entities only)
|
||||
lore timeline "auth migration" --depth 1 # Follow direct references (default)
|
||||
lore timeline "auth migration" --depth 2 # Two hops
|
||||
|
||||
# Control which edge types are followed during expansion
|
||||
lore timeline "auth migration" --expand-mentions # Also follow 'mentioned' edges (off by default)
|
||||
# Default expansion follows 'closes' and 'related' edges only.
|
||||
# 'mentioned' edges are excluded by default because they have high fan-out
|
||||
# and often connect tangentially related entities.
|
||||
|
||||
# Limit results
|
||||
lore timeline "auth migration" -n 50
|
||||
|
||||
# Robot mode
|
||||
lore -J timeline "auth migration"
|
||||
```
|
||||
|
||||
### 3.2 Query Flow
|
||||
|
||||
```
|
||||
1. SEED: FTS5 keyword search → matched document IDs (issues, MRs, and notes/discussions)
|
||||
↓
|
||||
2. HYDRATE:
|
||||
- Map document IDs → source entities (issues, MRs)
|
||||
- Collect top matched notes as evidence candidates (bounded, default top 10)
|
||||
These are the actual decision-bearing comments that answer "why"
|
||||
↓
|
||||
3. EXPAND: Follow entity_references (BFS, depth-limited)
|
||||
→ Discover related entities not matched by keywords
|
||||
→ Default: follow 'closes' + 'related' edges; skip 'mentioned' unless --expand-mentions
|
||||
→ Unresolved (external) references included in output but not traversed further
|
||||
↓
|
||||
4. COLLECT EVENTS: For all entities (seed + expanded):
|
||||
- Entity creation (created_at from issues/merge_requests)
|
||||
- State changes (resource_state_events)
|
||||
- Label changes (resource_label_events)
|
||||
- Milestone changes (resource_milestone_events)
|
||||
- Evidence notes: top FTS5-matched notes as discrete events (snippet + author + url)
|
||||
- Merge events (merged_at from merge_requests)
|
||||
↓
|
||||
5. INTERLEAVE: Sort all events chronologically
|
||||
↓
|
||||
6. RENDER: Format as timeline (human or JSON)
|
||||
```
|
||||
|
||||
**Why evidence notes instead of "discussion activity summarized":** The forcing function is "What happened with X?" A timeline entry that says "3 new comments" doesn't answer *why* — it answers *how many*. By including the top FTS5-matched notes as first-class timeline events, the timeline surfaces the actual decision rationale, code review feedback, and architectural reasoning that motivated changes. This uses the existing search infrastructure (CP3) with no new indexing required.
|
||||
|
||||
### 3.3 Event Model
|
||||
|
||||
The timeline doesn't store a separate unified event table. Instead, it queries across the existing tables at read time and produces a virtual event stream:
|
||||
|
||||
```rust
|
||||
pub struct TimelineEvent {
|
||||
pub timestamp: i64, // ms epoch
|
||||
pub entity_type: String, // "issue" | "merge_request" | "discussion"
|
||||
pub entity_iid: i64,
|
||||
pub project_path: String,
|
||||
pub event_type: TimelineEventType,
|
||||
pub summary: String, // human-readable one-liner
|
||||
pub actor: Option<String>, // username
|
||||
pub url: Option<String>,
|
||||
pub is_seed: bool, // matched by keyword (vs. expanded via reference)
|
||||
}
|
||||
|
||||
pub enum TimelineEventType {
|
||||
Created, // entity opened/created
|
||||
StateChanged { state: String }, // closed, reopened, merged, locked
|
||||
LabelAdded { label: String },
|
||||
LabelRemoved { label: String },
|
||||
MilestoneSet { milestone: String },
|
||||
MilestoneRemoved { milestone: String },
|
||||
Merged,
|
||||
NoteEvidence { // FTS5-matched note surfacing decision rationale
|
||||
note_id: i64,
|
||||
snippet: String, // first ~200 chars of the matching note body
|
||||
discussion_id: Option<i64>,
|
||||
},
|
||||
CrossReferenced { target: String },
|
||||
}
|
||||
```
|
||||
|
||||
### 3.4 Human Output Format
|
||||
|
||||
```
|
||||
lore timeline "auth migration"
|
||||
|
||||
Timeline: "auth migration" (12 events across 4 entities)
|
||||
───────────────────────────────────────────────────────
|
||||
|
||||
2024-03-15 CREATED #234 Migrate to OAuth2 @alice
|
||||
Labels: ~auth, ~breaking-change
|
||||
2024-03-18 CREATED !567 feat: add OAuth2 provider @bob
|
||||
References: #234
|
||||
2024-03-20 NOTE #234 "Should we support SAML too? I think @charlie
|
||||
we should stick with OAuth2 for now..."
|
||||
2024-03-22 LABEL !567 added ~security-review @alice
|
||||
2024-03-24 NOTE !567 [src/auth/oauth.rs:45] @dave
|
||||
"Consider refresh token rotation to
|
||||
prevent session fixation attacks"
|
||||
2024-03-25 MERGED !567 feat: add OAuth2 provider @alice
|
||||
2024-03-26 CLOSED #234 closed by !567 @alice
|
||||
2024-03-28 CREATED #299 OAuth2 login fails for SSO users @dave [expanded]
|
||||
(via !567, closes)
|
||||
|
||||
───────────────────────────────────────────────────────
|
||||
Seed entities: #234, !567 | Expanded: #299 (depth 1, via !567)
|
||||
```
|
||||
|
||||
Entities discovered via cross-reference expansion are marked `[expanded]` with a compact provenance note showing which seed entity and edge type led to their discovery.
|
||||
|
||||
Evidence notes (`NOTE` events) show the first ~200 characters of FTS5-matched note bodies. These are the actual decision-bearing comments that answer "why" — not just activity counts.
|
||||
|
||||
### 3.5 Robot Mode JSON
|
||||
|
||||
```json
|
||||
{
|
||||
"ok": true,
|
||||
"data": {
|
||||
"query": "auth migration",
|
||||
"event_count": 12,
|
||||
"seed_entities": [
|
||||
{ "type": "issue", "iid": 234, "project": "group/repo" },
|
||||
{ "type": "merge_request", "iid": 567, "project": "group/repo" }
|
||||
],
|
||||
"expanded_entities": [
|
||||
{
|
||||
"type": "issue",
|
||||
"iid": 299,
|
||||
"project": "group/repo",
|
||||
"depth": 1,
|
||||
"via": {
|
||||
"from": { "type": "merge_request", "iid": 567, "project": "group/repo" },
|
||||
"reference_type": "closes",
|
||||
"source_method": "api_closes_issues"
|
||||
}
|
||||
}
|
||||
],
|
||||
"unresolved_references": [
|
||||
{
|
||||
"source": { "type": "merge_request", "iid": 567, "project": "group/repo" },
|
||||
"target_project": "group/other-repo",
|
||||
"target_type": "issue",
|
||||
"target_iid": 42,
|
||||
"reference_type": "mentioned"
|
||||
}
|
||||
],
|
||||
"events": [
|
||||
{
|
||||
"timestamp": "2024-03-15T10:00:00Z",
|
||||
"entity_type": "issue",
|
||||
"entity_iid": 234,
|
||||
"project": "group/repo",
|
||||
"event_type": "created",
|
||||
"summary": "Migrate to OAuth2",
|
||||
"actor": "alice",
|
||||
"url": "https://gitlab.com/group/repo/-/issues/234",
|
||||
"is_seed": true,
|
||||
"details": {
|
||||
"labels": ["auth", "breaking-change"]
|
||||
}
|
||||
},
|
||||
{
|
||||
"timestamp": "2024-03-20T14:30:00Z",
|
||||
"entity_type": "issue",
|
||||
"entity_iid": 234,
|
||||
"project": "group/repo",
|
||||
"event_type": "note_evidence",
|
||||
"summary": "Should we support SAML too? I think we should stick with OAuth2 for now...",
|
||||
"actor": "charlie",
|
||||
"url": "https://gitlab.com/group/repo/-/issues/234#note_12345",
|
||||
"is_seed": true,
|
||||
"details": {
|
||||
"note_id": 12345,
|
||||
"snippet": "Should we support SAML too? I think we should stick with OAuth2 for now..."
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"meta": {
|
||||
"search_mode": "lexical",
|
||||
"expansion_depth": 1,
|
||||
"expand_mentions": false,
|
||||
"total_entities": 3,
|
||||
"total_events": 12,
|
||||
"evidence_notes_included": 4,
|
||||
"unresolved_references": 1
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.6 Acceptance Criteria
|
||||
|
||||
- [ ] `lore timeline <query>` returns chronologically ordered events
|
||||
- [ ] Seed entities found via FTS5 keyword search (issues, MRs, and notes)
|
||||
- [ ] State, label, and milestone events interleaved from resource event tables
|
||||
- [ ] Entity creation and merge events included
|
||||
- [ ] Evidence-bearing notes included as `note_evidence` events (top FTS5 matches, bounded default 10)
|
||||
- [ ] Cross-reference expansion follows `entity_references` to configurable depth
|
||||
- [ ] Default expansion follows `closes` + `related` edges; `--expand-mentions` adds `mentioned` edges
|
||||
- [ ] `--depth 0` disables expansion
|
||||
- [ ] `--since` filters by event timestamp
|
||||
- [ ] `-p` scopes to project
|
||||
- [ ] Human output is colored and readable
|
||||
- [ ] Robot mode returns structured JSON with expansion provenance (`via`) for expanded entities
|
||||
- [ ] Unresolved (external) references included in JSON output
|
||||
|
||||
---
|
||||
|
||||
## Gate 4: File Decision History (`lore file-history`)
|
||||
|
||||
### 4.1 Schema (Migration 011)
|
||||
|
||||
**File:** `migrations/011_file_changes.sql`
|
||||
|
||||
```sql
|
||||
-- Files changed by each merge request
|
||||
-- Source: GET /projects/:id/merge_requests/:iid/diffs
|
||||
CREATE TABLE mr_file_changes (
|
||||
id INTEGER PRIMARY KEY,
|
||||
merge_request_id INTEGER NOT NULL REFERENCES merge_requests(id) ON DELETE CASCADE,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
|
||||
old_path TEXT, -- NULL for new files
|
||||
new_path TEXT NOT NULL,
|
||||
change_type TEXT NOT NULL CHECK (change_type IN ('added', 'modified', 'deleted', 'renamed')),
|
||||
UNIQUE(merge_request_id, new_path)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_mr_files_new_path ON mr_file_changes(new_path);
|
||||
CREATE INDEX idx_mr_files_old_path ON mr_file_changes(old_path)
|
||||
WHERE old_path IS NOT NULL;
|
||||
CREATE INDEX idx_mr_files_mr ON mr_file_changes(merge_request_id);
|
||||
|
||||
-- Add commit SHAs to merge_requests (cherry-picked from Phase A)
|
||||
-- These link MRs to actual git history
|
||||
ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT;
|
||||
ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT;
|
||||
```
|
||||
|
||||
### 4.2 Config Extension
|
||||
|
||||
```json
|
||||
{
|
||||
"sync": {
|
||||
"fetchMrFileChanges": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Opt-in. When enabled, the sync pipeline fetches `GET /projects/:id/merge_requests/:iid/diffs` for each changed MR and extracts file metadata. Diff content is **not stored** — only file paths and change types.
|
||||
|
||||
### 4.3 Ingestion
|
||||
|
||||
**Uses the generic dependent fetch queue (`job_type = 'mr_diffs'`):**
|
||||
|
||||
1. After MR ingestion, if `fetchMrFileChanges` is true, enqueue a `mr_diffs` job in `pending_dependent_fetches`.
|
||||
2. Parse response: `changes[].{old_path, new_path, new_file, renamed_file, deleted_file}`.
|
||||
3. Derive `change_type`:
|
||||
- `new_file == true` → `'added'`
|
||||
- `renamed_file == true` → `'renamed'`
|
||||
- `deleted_file == true` → `'deleted'`
|
||||
- else → `'modified'`
|
||||
4. Upsert into `mr_file_changes`. On re-sync, DELETE existing rows for the MR and re-insert (diffs can change if MR is rebased).
|
||||
|
||||
**API call cost:** 1 additional call per MR. Acceptable for incremental sync (10–50 MRs/day).
|
||||
|
||||
### 4.4 Command Design
|
||||
|
||||
```bash
|
||||
# Show decision history for a file
|
||||
lore file-history src/auth/oauth.rs
|
||||
|
||||
# Scoped to project (required if file path exists in multiple projects)
|
||||
lore file-history src/auth/oauth.rs -p group/repo
|
||||
|
||||
# Include discussions on the MRs
|
||||
lore file-history src/auth/oauth.rs --discussions
|
||||
|
||||
# Follow rename chains (default: on)
|
||||
lore file-history src/auth/oauth.rs # follows renames automatically
|
||||
lore file-history src/auth/oauth.rs --no-follow-renames # disable rename chain resolution
|
||||
|
||||
# Limit results
|
||||
lore file-history src/auth/oauth.rs -n 10
|
||||
|
||||
# Filter to merged MRs only
|
||||
lore file-history src/auth/oauth.rs --merged
|
||||
|
||||
# Robot mode
|
||||
lore -J file-history src/auth/oauth.rs
|
||||
```
|
||||
|
||||
### 4.5 Query Logic
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
mr.iid,
|
||||
mr.title,
|
||||
mr.state,
|
||||
mr.author_username,
|
||||
mr.merged_at,
|
||||
mr.created_at,
|
||||
mr.web_url,
|
||||
mr.merge_commit_sha,
|
||||
mfc.change_type,
|
||||
mfc.old_path,
|
||||
(SELECT COUNT(*) FROM discussions d
|
||||
WHERE d.merge_request_id = mr.id) AS discussion_count,
|
||||
(SELECT COUNT(*) FROM notes n
|
||||
JOIN discussions d ON n.discussion_id = d.id
|
||||
WHERE d.merge_request_id = mr.id
|
||||
AND n.position_new_path = ?1) AS file_discussion_count
|
||||
FROM mr_file_changes mfc
|
||||
JOIN merge_requests mr ON mr.id = mfc.merge_request_id
|
||||
WHERE mfc.new_path = ?1 OR mfc.old_path = ?1
|
||||
ORDER BY COALESCE(mr.merged_at, mr.created_at) DESC;
|
||||
```
|
||||
|
||||
For each MR, optionally fetch related issues via `entity_references` (Gate 2 data).
|
||||
|
||||
### 4.6 Rename Handling
|
||||
|
||||
File renames are tracked via `old_path` and resolved as bounded chains:
|
||||
|
||||
1. Start with the query path in the path set: `{src/auth/oauth.rs}`
|
||||
2. Search `mr_file_changes` for rows where `change_type = 'renamed'` and either `new_path` or `old_path` is in the path set
|
||||
3. Add the other side of each rename to the path set
|
||||
4. Repeat until no new paths are discovered, up to a maximum of 10 hops (configurable)
|
||||
5. Use the full path set for the file history query
|
||||
|
||||
**Safeguards:**
|
||||
- Hop cap (default 10) prevents runaway expansion
|
||||
- Cycle detection: if a path is already in the set, skip it
|
||||
- The unioned path set is used for matching MRs in the main query
|
||||
|
||||
**Output:**
|
||||
- Human mode annotates the rename chain: `"src/auth/oauth.rs (renamed from src/auth/handler.rs ← src/auth.rs)"`
|
||||
- Robot mode JSON includes `rename_chain`: `["src/auth.rs", "src/auth/handler.rs", "src/auth/oauth.rs"]`
|
||||
- `--no-follow-renames` disables chain resolution (matches only the literal path provided)
|
||||
|
||||
### 4.7 Acceptance Criteria
|
||||
|
||||
- [ ] `mr_file_changes` table populated from GitLab diffs API
|
||||
- [ ] `merge_commit_sha` and `squash_commit_sha` captured in `merge_requests`
|
||||
- [ ] `lore file-history <path>` returns MRs ordered by merge/creation date
|
||||
- [ ] Output includes: MR title, state, author, change type, discussion count
|
||||
- [ ] `--discussions` shows inline discussion snippets from DiffNotes on the file
|
||||
- [ ] Rename chains resolved with bounded hop count (default 10) and cycle detection
|
||||
- [ ] `--no-follow-renames` disables chain resolution
|
||||
- [ ] Robot mode JSON includes `rename_chain` when renames are detected
|
||||
- [ ] Robot mode JSON output
|
||||
- [ ] `-p` required when path exists in multiple projects (Ambiguous error)
|
||||
|
||||
---
|
||||
|
||||
## Gate 5: Code Trace (`lore trace`)
|
||||
|
||||
### 5.1 Overview
|
||||
|
||||
`lore trace` answers "Why was this code introduced?" by tracing from a file (and optionally a line number) back through the MR and issue that motivated the change.
|
||||
|
||||
### 5.2 Two-Tier Architecture
|
||||
|
||||
**Tier 1 — API-only (no local git required):**
|
||||
|
||||
Uses `merge_commit_sha` and `squash_commit_sha` from the `merge_requests` table to link MRs to commits. Combined with `mr_file_changes`, this can answer "which MRs touched this file" and link to their motivating issues via `entity_references`.
|
||||
|
||||
This is equivalent to `lore file-history` enriched with issue context — effectively a file-scoped decision timeline.
|
||||
|
||||
**Tier 2 — Git integration (requires local clone):**
|
||||
|
||||
Uses `git blame` to map a specific line to a commit SHA, then resolves the commit to an MR via `merge_commit_sha` lookup. This provides line-level precision.
|
||||
|
||||
**Gate 5 ships Tier 1 only.** Tier 2 (git integration via `git2-rs`) is a future enhancement.
|
||||
|
||||
### 5.3 Command Design
|
||||
|
||||
```bash
|
||||
# Trace a file's history (Tier 1: API-only)
|
||||
lore trace src/auth/oauth.rs
|
||||
|
||||
# Trace a specific line (Tier 2: requires local git)
|
||||
lore trace src/auth/oauth.rs:45
|
||||
|
||||
# Robot mode
|
||||
lore -J trace src/auth/oauth.rs
|
||||
```
|
||||
|
||||
### 5.4 Query Flow (Tier 1)
|
||||
|
||||
```
|
||||
1. Find MRs that touched this file (mr_file_changes)
|
||||
↓
|
||||
2. For each MR, find related issues (entity_references WHERE reference_type = 'closes')
|
||||
↓
|
||||
3. For each issue, fetch discussions with rationale
|
||||
↓
|
||||
4. Build trace chain: file → MR → issue → discussions
|
||||
↓
|
||||
5. Order by merge date (most recent first)
|
||||
```
|
||||
|
||||
### 5.5 Output Format (Human)
|
||||
|
||||
```
|
||||
lore trace src/auth/oauth.rs
|
||||
|
||||
Trace: src/auth/oauth.rs
|
||||
────────────────────────
|
||||
|
||||
!567 feat: add OAuth2 provider MERGED 2024-03-25
|
||||
→ Closes #234: Migrate to OAuth2
|
||||
→ 12 discussion comments, 4 on this file
|
||||
→ Decision: Use rust-oauth2 crate (discussed in #234, comment by @alice)
|
||||
|
||||
!612 fix: token refresh race condition MERGED 2024-04-10
|
||||
→ Closes #299: OAuth2 login fails for SSO users
|
||||
→ 5 discussion comments, 2 on this file
|
||||
→ [src/auth/oauth.rs:45] "Add mutex around refresh to prevent double-refresh"
|
||||
|
||||
!701 refactor: extract TokenManager MERGED 2024-05-01
|
||||
→ Related: #312: Reduce auth module complexity
|
||||
→ 3 discussion comments
|
||||
→ Note: file was renamed from src/auth/handler.rs
|
||||
```
|
||||
|
||||
### 5.6 Tier 2 Design Notes (Future — Not in This Phase)
|
||||
|
||||
When git integration is added:
|
||||
|
||||
1. Add `git2-rs` dependency for native git operations
|
||||
2. Implement `git blame -L <line>,<line> <file>` to get commit SHA for a specific line
|
||||
3. Look up commit SHA in `merge_requests.merge_commit_sha` or `merge_requests.squash_commit_sha`
|
||||
4. If no match (commit was squashed), search `merge_commit_sha` for commits in the blame range
|
||||
5. Optional `blame_cache` table for performance (invalidated by content hash)
|
||||
|
||||
**Known limitation:** Squash commits break blame-to-MR mapping for individual commits within an MR. The squash commit SHA maps to the MR, but all lines show the same commit. This is a fundamental Git limitation documented in [GitLab Forum #77146](https://forum.gitlab.com/t/preserve-blame-in-squash-merge/77146).
|
||||
|
||||
### 5.7 Acceptance Criteria (Tier 1 Only)
|
||||
|
||||
- [ ] `lore trace <file>` shows MRs that touched the file with linked issues and discussion context
|
||||
- [ ] Output includes the MR → issue → discussion chain
|
||||
- [ ] Discussion snippets show DiffNote content on the traced file
|
||||
- [ ] Cross-references from `entity_references` used for MR→issue linking
|
||||
- [ ] Robot mode JSON output
|
||||
- [ ] Graceful handling when no MR data found ("Run `lore sync` with `fetchMrFileChanges: true`")
|
||||
|
||||
---
|
||||
|
||||
## Migration Strategy
|
||||
|
||||
### Migration Numbering
|
||||
|
||||
Phase B uses migration numbers starting at 010:
|
||||
|
||||
| Migration | Content | Gate |
|
||||
|-----------|---------|------|
|
||||
| 010 | Resource event tables, generic dependent fetch queue, entity_references | Gates 1, 2 |
|
||||
| 011 | mr_file_changes, merge_commit_sha, squash_commit_sha | Gate 4 |
|
||||
|
||||
Phase A's complete field capture migration should use 012+ when implemented, skipping fields already added by 011 (`merge_commit_sha`, `squash_commit_sha`).
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
- All new tables are additive (no ALTER on existing data-bearing columns)
|
||||
- `lore sync` works without event data — temporal commands gracefully report "No event data. Run `lore sync` to populate."
|
||||
- Existing search, issues, mrs commands are unaffected
|
||||
|
||||
---
|
||||
|
||||
## Risks and Mitigations
|
||||
|
||||
### Identified During Premortem
|
||||
|
||||
| Risk | Severity | Mitigation |
|
||||
|------|----------|------------|
|
||||
| API call volume explosion (3 event calls per entity) | Medium | Incremental sync limits to changed entities; opt-in config flag |
|
||||
| System note parsing fragile for non-English instances | Medium | Used only for assignee changes and cross-refs; `source_method` tracks provenance |
|
||||
| GitLab diffs API returns large payloads | Low | Extract file metadata only, discard diff content |
|
||||
| Cross-reference graph traversal unbounded | Medium | BFS depth capped at configurable limit (default 1); `mentioned` edges excluded by default |
|
||||
| Cross-project references lost when target not synced | Medium | Unresolved references stored with `target_entity_id = NULL`; still appear in timeline output |
|
||||
| Phase A migration numbering conflict | Low | Phase B uses 010-011; Phase A uses 012+ |
|
||||
| Timeline output lacks "why" evidence | Medium | Evidence-bearing notes from FTS5 included as first-class timeline events |
|
||||
| Squash commits break blame-to-MR mapping | Medium | Tier 2 (git integration) deferred; Tier 1 uses file-level MR matching |
|
||||
|
||||
### Accepted Limitations
|
||||
|
||||
- **No real-time monitoring.** Phase B is batch queries over historical data. "Notify me when my code changes" requires a different architecture (webhooks, polling daemon) and is out of scope.
|
||||
- **No pattern evolution.** Cross-project trend detection requires all of Phase B's infrastructure plus semantic clustering. Deferred to Phase C.
|
||||
- **English-only system note parsing.** Cross-reference extraction from system notes works reliably only for English-language GitLab instances. Structured API data works for all languages.
|
||||
- **Bounded rename chain resolution.** `lore file-history` resolves rename chains up to 10 hops with cycle detection. Pathological rename histories (>10 hops) are truncated.
|
||||
- **Evidence notes are keyword-matched, not summarized.** Timeline evidence notes are the raw FTS5-matched note text, not AI-generated summaries. This keeps the system deterministic and avoids LLM dependencies.
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| Metric | Target |
|
||||
|--------|--------|
|
||||
| `lore timeline` query latency | < 200ms for typical queries (< 50 seed entities) |
|
||||
| Timeline event coverage | State + label + creation + merge + evidence note events for all synced entities |
|
||||
| Timeline evidence quality | Top 10 FTS5-matched notes included per query; at least 1 evidence note for queries matching discussion-bearing entities |
|
||||
| Cross-reference coverage | > 80% of "closed by MR" relationships captured via structured API |
|
||||
| Unresolved reference capture | Cross-project references stored even when target project is not synced |
|
||||
| Incremental sync overhead | < 5% increase in sync time for event fetching |
|
||||
| `lore file-history` coverage | File changes captured for all synced MRs (when opt-in enabled) |
|
||||
| Rename chain resolution | Multi-hop renames correctly resolved up to 10 hops |
|
||||
|
||||
---
|
||||
|
||||
## Future Phases (Out of Scope)
|
||||
|
||||
### Phase C: Advanced Temporal Features
|
||||
- Pattern Evolution: cross-project trend detection via embedding clusters
|
||||
- Git integration (Tier 2): `git blame` → commit → MR resolution
|
||||
- MCP server: expose `timeline`, `file-history`, `trace` as typed MCP tools
|
||||
|
||||
### Phase D: Consumer Applications
|
||||
- Web UI: separate frontend consuming lore's JSON API via `lore serve`
|
||||
- Real-time monitoring: webhook listener or polling daemon for change notifications
|
||||
- IDE integration: editor plugins surfacing temporal context inline
|
||||
14
migrations/010_chunk_config.sql
Normal file
14
migrations/010_chunk_config.sql
Normal file
@@ -0,0 +1,14 @@
|
||||
-- Migration 010: Chunk config tracking + adaptive dedup support
|
||||
-- Schema version: 10
|
||||
|
||||
ALTER TABLE embedding_metadata ADD COLUMN chunk_max_bytes INTEGER;
|
||||
ALTER TABLE embedding_metadata ADD COLUMN chunk_count INTEGER;
|
||||
|
||||
-- Partial index: accelerates drift detection and adaptive dedup queries on sentinel rows
|
||||
CREATE INDEX idx_embedding_metadata_sentinel
|
||||
ON embedding_metadata(document_id, chunk_index)
|
||||
WHERE chunk_index = 0;
|
||||
|
||||
INSERT INTO schema_version (version, applied_at, description)
|
||||
VALUES (10, strftime('%s', 'now') * 1000,
|
||||
'Add chunk_max_bytes and chunk_count to embedding_metadata');
|
||||
1260
phase-a-review.html
Normal file
1260
phase-a-review.html
Normal file
File diff suppressed because it is too large
Load Diff
36
skills/agent-swarm-launcher/SKILL.md
Normal file
36
skills/agent-swarm-launcher/SKILL.md
Normal file
@@ -0,0 +1,36 @@
|
||||
---
|
||||
name: agent-swarm-launcher
|
||||
description: "Launch a multi-agent “swarm” workflow for a repository: read and follow AGENTS.md/README.md, perform an architecture/codebase reconnaissance, then coordinate work via Agent Mail / beads-style task tracking when those tools are available. Use when you want to quickly bootstrap a coordinated agent workflow, avoid communication deadlocks, and start making progress on prioritized tasks."
|
||||
---
|
||||
|
||||
# Agent Swarm Launcher
|
||||
|
||||
## Workflow (do in order)
|
||||
|
||||
1. Read *all* `AGENTS.md` and `README.md` files carefully and completely.
|
||||
- If multiple `AGENTS.md` files exist, treat deeper ones as higher priority within their directory scope.
|
||||
- Note any required workflows (e.g., TDD), tooling conventions, and “robot mode” flags.
|
||||
|
||||
2. Enter “code investigation” mode and understand the project.
|
||||
- Identify entrypoints, key packages/modules, and how data flows.
|
||||
- Note build/test commands and any local dev constraints.
|
||||
- Summarize the technical architecture and purpose of the project.
|
||||
|
||||
3. Register with Agent Mail and coordinate, if available.
|
||||
- If “MCP Agent Mail” exists in this environment, register and introduce yourself to the other agents.
|
||||
- Check Agent Mail and promptly respond to any messages.
|
||||
- If “beads” tracking is used by the repo/team, open/continue the current bead(s) and mark progress as you go.
|
||||
- If Agent Mail/beads are not available, state that plainly and proceed with a lightweight local substitute (a short task checklist in the thread).
|
||||
|
||||
4. Start work (do not get stuck waiting).
|
||||
- Acknowledge incoming requests promptly.
|
||||
- Do not get stuck in “communication purgatory” where nothing is getting done.
|
||||
- If you are blocked on prioritization, look for a prioritization tool mentioned in `AGENTS.md` (for example “bv”) and use it; otherwise propose the next best task(s) and proceed.
|
||||
- If `AGENTS.md` references a task system (e.g., beads), pick the next task you can complete usefully and start.
|
||||
|
||||
## Execution rules
|
||||
|
||||
- Follow repository instructions over this skill if they conflict.
|
||||
- Prefer action + short status updates over prolonged coordination.
|
||||
- If a referenced tool does not exist, do not hallucinate it—fall back and keep moving.
|
||||
- Do not claim you registered with or heard from other agents unless you actually did via the available tooling.
|
||||
4
skills/agent-swarm-launcher/agents/openai.yaml
Normal file
4
skills/agent-swarm-launcher/agents/openai.yaml
Normal file
@@ -0,0 +1,4 @@
|
||||
interface:
|
||||
display_name: "Agent Swarm Launcher"
|
||||
short_description: "Kick off multi-agent repo onboarding"
|
||||
default_prompt: "Use $agent-swarm-launcher to onboard, coordinate, and start the next prioritized task."
|
||||
@@ -21,6 +21,7 @@ pub struct EmbedCommandResult {
|
||||
/// Run the embed command.
|
||||
pub async fn run_embed(
|
||||
config: &Config,
|
||||
full: bool,
|
||||
retry_failed: bool,
|
||||
) -> Result<EmbedCommandResult> {
|
||||
let db_path = get_db_path(config.storage.db_path.as_deref());
|
||||
@@ -37,8 +38,18 @@ pub async fn run_embed(
|
||||
// Health check — fail fast if Ollama is down or model missing
|
||||
client.health_check().await?;
|
||||
|
||||
// If retry_failed, clear errors so they become pending again
|
||||
if retry_failed {
|
||||
if full {
|
||||
// Clear ALL embeddings and metadata atomically for a complete re-embed.
|
||||
// Wrapped in a transaction so a crash between the two DELETEs can't
|
||||
// leave orphaned data.
|
||||
conn.execute_batch(
|
||||
"BEGIN;
|
||||
DELETE FROM embedding_metadata;
|
||||
DELETE FROM embeddings;
|
||||
COMMIT;",
|
||||
)?;
|
||||
} else if retry_failed {
|
||||
// Clear errors so they become pending again
|
||||
conn.execute(
|
||||
"UPDATE embedding_metadata SET last_error = NULL, attempt_count = 0
|
||||
WHERE last_error IS NOT NULL",
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
//! Sync command: unified orchestrator for ingest -> generate-docs -> embed.
|
||||
|
||||
use console::style;
|
||||
use indicatif::{ProgressBar, ProgressStyle};
|
||||
use serde::Serialize;
|
||||
use tracing::{info, warn};
|
||||
|
||||
@@ -31,6 +32,22 @@ pub struct SyncResult {
|
||||
pub documents_embedded: usize,
|
||||
}
|
||||
|
||||
/// Create a styled spinner for a sync stage.
|
||||
fn stage_spinner(stage: u8, total: u8, msg: &str, robot_mode: bool) -> ProgressBar {
|
||||
if robot_mode {
|
||||
return ProgressBar::hidden();
|
||||
}
|
||||
let pb = ProgressBar::new_spinner();
|
||||
pb.set_style(
|
||||
ProgressStyle::default_spinner()
|
||||
.template("{spinner:.blue} {msg}")
|
||||
.expect("valid template"),
|
||||
);
|
||||
pb.enable_steady_tick(std::time::Duration::from_millis(80));
|
||||
pb.set_message(format!("[{stage}/{total}] {msg}"));
|
||||
pb
|
||||
}
|
||||
|
||||
/// Run the full sync pipeline: ingest -> generate-docs -> embed.
|
||||
pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResult> {
|
||||
let mut result = SyncResult::default();
|
||||
@@ -41,41 +58,70 @@ pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResul
|
||||
IngestDisplay::progress_only()
|
||||
};
|
||||
|
||||
let total_stages: u8 = if options.no_docs && options.no_embed {
|
||||
2
|
||||
} else if options.no_docs || options.no_embed {
|
||||
3
|
||||
} else {
|
||||
4
|
||||
};
|
||||
let mut current_stage: u8 = 0;
|
||||
|
||||
// Stage 1: Ingest issues
|
||||
info!("Sync stage 1/4: ingesting issues");
|
||||
current_stage += 1;
|
||||
let spinner = stage_spinner(current_stage, total_stages, "Fetching issues from GitLab...", options.robot_mode);
|
||||
info!("Sync stage {current_stage}/{total_stages}: ingesting issues");
|
||||
let issues_result = run_ingest(config, "issues", None, options.force, options.full, ingest_display).await?;
|
||||
result.issues_updated = issues_result.issues_upserted;
|
||||
result.discussions_fetched += issues_result.discussions_fetched;
|
||||
spinner.finish_and_clear();
|
||||
|
||||
// Stage 2: Ingest MRs
|
||||
info!("Sync stage 2/4: ingesting merge requests");
|
||||
current_stage += 1;
|
||||
let spinner = stage_spinner(current_stage, total_stages, "Fetching merge requests from GitLab...", options.robot_mode);
|
||||
info!("Sync stage {current_stage}/{total_stages}: ingesting merge requests");
|
||||
let mrs_result = run_ingest(config, "mrs", None, options.force, options.full, ingest_display).await?;
|
||||
result.mrs_updated = mrs_result.mrs_upserted;
|
||||
result.discussions_fetched += mrs_result.discussions_fetched;
|
||||
spinner.finish_and_clear();
|
||||
|
||||
// Stage 3: Generate documents (unless --no-docs)
|
||||
if options.no_docs {
|
||||
info!("Sync stage 3/4: skipping document generation (--no-docs)");
|
||||
} else {
|
||||
info!("Sync stage 3/4: generating documents");
|
||||
if !options.no_docs {
|
||||
current_stage += 1;
|
||||
let spinner = stage_spinner(current_stage, total_stages, "Processing documents...", options.robot_mode);
|
||||
info!("Sync stage {current_stage}/{total_stages}: generating documents");
|
||||
let docs_result = run_generate_docs(config, false, None)?;
|
||||
result.documents_regenerated = docs_result.regenerated;
|
||||
spinner.finish_and_clear();
|
||||
} else {
|
||||
info!("Sync: skipping document generation (--no-docs)");
|
||||
}
|
||||
|
||||
// Stage 4: Embed documents (unless --no-embed)
|
||||
if options.no_embed {
|
||||
info!("Sync stage 4/4: skipping embedding (--no-embed)");
|
||||
} else {
|
||||
info!("Sync stage 4/4: embedding documents");
|
||||
match run_embed(config, false).await {
|
||||
if !options.no_embed {
|
||||
current_stage += 1;
|
||||
let spinner = stage_spinner(current_stage, total_stages, "Generating embeddings...", options.robot_mode);
|
||||
info!("Sync stage {current_stage}/{total_stages}: embedding documents");
|
||||
match run_embed(config, options.full, false).await {
|
||||
Ok(embed_result) => {
|
||||
result.documents_embedded = embed_result.embedded;
|
||||
spinner.finish_and_clear();
|
||||
}
|
||||
Err(e) => {
|
||||
// Graceful degradation: Ollama down is a warning, not an error
|
||||
spinner.finish_and_clear();
|
||||
if !options.robot_mode {
|
||||
eprintln!(
|
||||
" {} Embedding skipped ({})",
|
||||
style("warn").yellow(),
|
||||
e
|
||||
);
|
||||
}
|
||||
warn!(error = %e, "Embedding stage failed (Ollama may be unavailable), continuing");
|
||||
}
|
||||
}
|
||||
} else {
|
||||
info!("Sync: skipping embedding (--no-embed)");
|
||||
}
|
||||
|
||||
info!(
|
||||
|
||||
@@ -483,6 +483,13 @@ pub struct SyncArgs {
|
||||
/// Arguments for `lore embed`
|
||||
#[derive(Parser)]
|
||||
pub struct EmbedArgs {
|
||||
/// Re-embed all documents (clears existing embeddings first)
|
||||
#[arg(long, overrides_with = "no_full")]
|
||||
pub full: bool,
|
||||
|
||||
#[arg(long = "no-full", hide = true, overrides_with = "full")]
|
||||
pub no_full: bool,
|
||||
|
||||
/// Retry previously failed embeddings
|
||||
#[arg(long, overrides_with = "no_retry_failed")]
|
||||
pub retry_failed: bool,
|
||||
|
||||
@@ -10,6 +10,10 @@ use tracing::{debug, info};
|
||||
|
||||
use super::error::{LoreError, Result};
|
||||
|
||||
/// Latest schema version, derived from the embedded migrations count.
|
||||
/// Used by the health check to verify databases are up-to-date.
|
||||
pub const LATEST_SCHEMA_VERSION: i32 = MIGRATIONS.len() as i32;
|
||||
|
||||
/// Embedded migrations - compiled into the binary.
|
||||
const MIGRATIONS: &[(&str, &str)] = &[
|
||||
("001", include_str!("../../migrations/001_initial.sql")),
|
||||
@@ -39,6 +43,10 @@ const MIGRATIONS: &[(&str, &str)] = &[
|
||||
"009",
|
||||
include_str!("../../migrations/009_embeddings.sql"),
|
||||
),
|
||||
(
|
||||
"010",
|
||||
include_str!("../../migrations/010_chunk_config.sql"),
|
||||
),
|
||||
];
|
||||
|
||||
/// Create a database connection with production-grade pragmas.
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
use rusqlite::Connection;
|
||||
|
||||
use crate::core::error::Result;
|
||||
use crate::embedding::chunking::{CHUNK_MAX_BYTES, EXPECTED_DIMS};
|
||||
|
||||
/// A document that needs embedding or re-embedding.
|
||||
#[derive(Debug)]
|
||||
@@ -12,17 +13,20 @@ pub struct PendingDocument {
|
||||
pub content_hash: String,
|
||||
}
|
||||
|
||||
/// Find documents that need embedding: new (no metadata) or changed (hash mismatch).
|
||||
/// Find documents that need embedding: new (no metadata), changed (hash mismatch),
|
||||
/// or config-drifted (chunk_max_bytes/model/dims mismatch).
|
||||
///
|
||||
/// Uses keyset pagination (WHERE d.id > last_id) and returns up to `page_size` results.
|
||||
pub fn find_pending_documents(
|
||||
conn: &Connection,
|
||||
page_size: usize,
|
||||
last_id: i64,
|
||||
model_name: &str,
|
||||
) -> Result<Vec<PendingDocument>> {
|
||||
// Documents that either:
|
||||
// 1. Have no embedding_metadata at all (new)
|
||||
// 2. Have metadata where document_hash != content_hash (changed)
|
||||
// 3. Config drift: chunk_max_bytes, model, or dims mismatch (or pre-migration NULL)
|
||||
let sql = r#"
|
||||
SELECT d.id, d.content_text, d.content_hash
|
||||
FROM documents d
|
||||
@@ -37,6 +41,16 @@ pub fn find_pending_documents(
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND em.document_hash != d.content_hash
|
||||
)
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND (
|
||||
em.chunk_max_bytes IS NULL
|
||||
OR em.chunk_max_bytes != ?3
|
||||
OR em.model != ?4
|
||||
OR em.dims != ?5
|
||||
)
|
||||
)
|
||||
)
|
||||
ORDER BY d.id
|
||||
LIMIT ?2
|
||||
@@ -44,35 +58,56 @@ pub fn find_pending_documents(
|
||||
|
||||
let mut stmt = conn.prepare(sql)?;
|
||||
let rows = stmt
|
||||
.query_map(rusqlite::params![last_id, page_size as i64], |row| {
|
||||
Ok(PendingDocument {
|
||||
document_id: row.get(0)?,
|
||||
content_text: row.get(1)?,
|
||||
content_hash: row.get(2)?,
|
||||
})
|
||||
})?
|
||||
.query_map(
|
||||
rusqlite::params![
|
||||
last_id,
|
||||
page_size as i64,
|
||||
CHUNK_MAX_BYTES as i64,
|
||||
model_name,
|
||||
EXPECTED_DIMS as i64,
|
||||
],
|
||||
|row| {
|
||||
Ok(PendingDocument {
|
||||
document_id: row.get(0)?,
|
||||
content_text: row.get(1)?,
|
||||
content_hash: row.get(2)?,
|
||||
})
|
||||
},
|
||||
)?
|
||||
.collect::<std::result::Result<Vec<_>, _>>()?;
|
||||
|
||||
Ok(rows)
|
||||
}
|
||||
|
||||
/// Count total documents that need embedding.
|
||||
pub fn count_pending_documents(conn: &Connection) -> Result<i64> {
|
||||
pub fn count_pending_documents(conn: &Connection, model_name: &str) -> Result<i64> {
|
||||
let count: i64 = conn.query_row(
|
||||
r#"
|
||||
SELECT COUNT(*)
|
||||
FROM documents d
|
||||
WHERE NOT EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
)
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND em.document_hash != d.content_hash
|
||||
WHERE (
|
||||
NOT EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
)
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND em.document_hash != d.content_hash
|
||||
)
|
||||
OR EXISTS (
|
||||
SELECT 1 FROM embedding_metadata em
|
||||
WHERE em.document_id = d.id AND em.chunk_index = 0
|
||||
AND (
|
||||
em.chunk_max_bytes IS NULL
|
||||
OR em.chunk_max_bytes != ?1
|
||||
OR em.model != ?2
|
||||
OR em.dims != ?3
|
||||
)
|
||||
)
|
||||
)
|
||||
"#,
|
||||
[],
|
||||
rusqlite::params![CHUNK_MAX_BYTES as i64, model_name, EXPECTED_DIMS as i64],
|
||||
|row| row.get(0),
|
||||
)?;
|
||||
Ok(count)
|
||||
|
||||
@@ -1,5 +1,7 @@
|
||||
/// Multiplier for encoding (document_id, chunk_index) into a single rowid.
|
||||
/// Supports up to 1000 chunks per document (32M chars at 32k/chunk).
|
||||
/// Supports up to 1000 chunks per document. At CHUNK_MAX_BYTES=6000,
|
||||
/// a 2MB document (MAX_DOCUMENT_BYTES_HARD) produces ~333 chunks.
|
||||
/// The pipeline enforces chunk_count < CHUNK_ROWID_MULTIPLIER at runtime.
|
||||
pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000;
|
||||
|
||||
/// Encode (document_id, chunk_index) into a sqlite-vec rowid.
|
||||
|
||||
@@ -2,11 +2,19 @@
|
||||
|
||||
/// Maximum bytes per chunk.
|
||||
/// Named `_BYTES` because `str::len()` returns byte count; multi-byte UTF-8
|
||||
/// sequences mean byte length ≥ char count.
|
||||
pub const CHUNK_MAX_BYTES: usize = 32_000;
|
||||
/// sequences mean byte length >= char count.
|
||||
///
|
||||
/// nomic-embed-text has an 8,192-token context window. English prose averages
|
||||
/// ~4 chars/token, but technical content (code, URLs, JSON) can be 1-2
|
||||
/// chars/token. We use 6,000 bytes as a conservative limit that stays safe
|
||||
/// even for code-heavy chunks (~6,000 tokens worst-case).
|
||||
pub const CHUNK_MAX_BYTES: usize = 6_000;
|
||||
|
||||
/// Expected embedding dimensions for nomic-embed-text.
|
||||
pub const EXPECTED_DIMS: usize = 768;
|
||||
|
||||
/// Character overlap between adjacent chunks.
|
||||
pub const CHUNK_OVERLAP_CHARS: usize = 500;
|
||||
pub const CHUNK_OVERLAP_CHARS: usize = 200;
|
||||
|
||||
/// Split document content into chunks suitable for embedding.
|
||||
///
|
||||
|
||||
@@ -1,18 +1,19 @@
|
||||
//! Async embedding pipeline: chunk documents, embed via Ollama, store in sqlite-vec.
|
||||
|
||||
use std::collections::HashSet;
|
||||
|
||||
use rusqlite::Connection;
|
||||
use sha2::{Digest, Sha256};
|
||||
use tracing::{info, warn};
|
||||
|
||||
use crate::core::error::Result;
|
||||
use crate::embedding::change_detector::{count_pending_documents, find_pending_documents};
|
||||
use crate::embedding::chunk_ids::encode_rowid;
|
||||
use crate::embedding::chunking::split_into_chunks;
|
||||
use crate::embedding::chunk_ids::{encode_rowid, CHUNK_ROWID_MULTIPLIER};
|
||||
use crate::embedding::chunking::{split_into_chunks, CHUNK_MAX_BYTES, EXPECTED_DIMS};
|
||||
use crate::embedding::ollama::OllamaClient;
|
||||
|
||||
const BATCH_SIZE: usize = 32;
|
||||
const DB_PAGE_SIZE: usize = 500;
|
||||
const EXPECTED_DIMS: usize = 768;
|
||||
|
||||
/// Result of an embedding run.
|
||||
#[derive(Debug, Default)]
|
||||
@@ -26,6 +27,7 @@ pub struct EmbedResult {
|
||||
struct ChunkWork {
|
||||
doc_id: i64,
|
||||
chunk_index: usize,
|
||||
total_chunks: usize,
|
||||
doc_hash: String,
|
||||
chunk_hash: String,
|
||||
text: String,
|
||||
@@ -41,7 +43,7 @@ pub async fn embed_documents(
|
||||
model_name: &str,
|
||||
progress_callback: Option<Box<dyn Fn(usize, usize)>>,
|
||||
) -> Result<EmbedResult> {
|
||||
let total = count_pending_documents(conn)? as usize;
|
||||
let total = count_pending_documents(conn, model_name)? as usize;
|
||||
let mut result = EmbedResult::default();
|
||||
let mut last_id: i64 = 0;
|
||||
let mut processed: usize = 0;
|
||||
@@ -53,13 +55,21 @@ pub async fn embed_documents(
|
||||
info!(total, "Starting embedding pipeline");
|
||||
|
||||
loop {
|
||||
let pending = find_pending_documents(conn, DB_PAGE_SIZE, last_id)?;
|
||||
let pending = find_pending_documents(conn, DB_PAGE_SIZE, last_id, model_name)?;
|
||||
if pending.is_empty() {
|
||||
break;
|
||||
}
|
||||
|
||||
// Wrap all DB writes for this page in a savepoint so that
|
||||
// clear_document_embeddings + store_embedding are atomic. If the
|
||||
// process crashes mid-page, the savepoint is never released and
|
||||
// SQLite rolls back — preventing partial document states where old
|
||||
// embeddings are cleared but new ones haven't been written yet.
|
||||
conn.execute_batch("SAVEPOINT embed_page")?;
|
||||
|
||||
// Build chunk work items for this page
|
||||
let mut all_chunks: Vec<ChunkWork> = Vec::new();
|
||||
let mut page_normal_docs: usize = 0;
|
||||
|
||||
for doc in &pending {
|
||||
// Always advance the cursor, even for skipped docs, to avoid re-fetching
|
||||
@@ -71,27 +81,65 @@ pub async fn embed_documents(
|
||||
continue;
|
||||
}
|
||||
|
||||
// Clear existing embeddings for this document before re-embedding
|
||||
clear_document_embeddings(conn, doc.document_id)?;
|
||||
|
||||
let chunks = split_into_chunks(&doc.content_text);
|
||||
let total_chunks = chunks.len();
|
||||
|
||||
// Overflow guard: skip documents that produce too many chunks.
|
||||
// Must run BEFORE clear_document_embeddings so existing embeddings
|
||||
// are preserved when we skip.
|
||||
if total_chunks as i64 >= CHUNK_ROWID_MULTIPLIER {
|
||||
warn!(
|
||||
doc_id = doc.document_id,
|
||||
chunk_count = total_chunks,
|
||||
max = CHUNK_ROWID_MULTIPLIER,
|
||||
"Document produces too many chunks, skipping to prevent rowid collision"
|
||||
);
|
||||
// Record a sentinel error so the document is not re-detected as
|
||||
// pending on subsequent runs (prevents infinite re-processing).
|
||||
record_embedding_error(
|
||||
conn,
|
||||
doc.document_id,
|
||||
0, // sentinel chunk_index
|
||||
&doc.content_hash,
|
||||
"overflow-sentinel",
|
||||
model_name,
|
||||
&format!(
|
||||
"Document produces {} chunks, exceeding max {}",
|
||||
total_chunks, CHUNK_ROWID_MULTIPLIER
|
||||
),
|
||||
)?;
|
||||
result.skipped += 1;
|
||||
processed += 1;
|
||||
if let Some(ref cb) = progress_callback {
|
||||
cb(processed, total);
|
||||
}
|
||||
continue;
|
||||
}
|
||||
|
||||
// Don't clear existing embeddings here — defer until the first
|
||||
// successful chunk embedding so that if ALL chunks for a document
|
||||
// fail, old embeddings survive instead of leaving zero data.
|
||||
|
||||
for (chunk_index, text) in chunks {
|
||||
all_chunks.push(ChunkWork {
|
||||
doc_id: doc.document_id,
|
||||
chunk_index,
|
||||
total_chunks,
|
||||
doc_hash: doc.content_hash.clone(),
|
||||
chunk_hash: sha256_hash(&text),
|
||||
text,
|
||||
});
|
||||
}
|
||||
|
||||
// Track progress per document (not per chunk) to match `total`
|
||||
processed += 1;
|
||||
if let Some(ref cb) = progress_callback {
|
||||
cb(processed, total);
|
||||
}
|
||||
page_normal_docs += 1;
|
||||
// Don't fire progress here — wait until embedding completes below.
|
||||
}
|
||||
|
||||
// Track documents whose old embeddings have been cleared.
|
||||
// We defer clearing until the first successful chunk embedding so
|
||||
// that if ALL chunks for a document fail, old embeddings survive.
|
||||
let mut cleared_docs: HashSet<i64> = HashSet::new();
|
||||
|
||||
// Process chunks in batches of BATCH_SIZE
|
||||
for batch in all_chunks.chunks(BATCH_SIZE) {
|
||||
let texts: Vec<String> = batch.iter().map(|c| c.text.clone()).collect();
|
||||
@@ -129,6 +177,12 @@ pub async fn embed_documents(
|
||||
continue;
|
||||
}
|
||||
|
||||
// Clear old embeddings on first successful chunk for this document
|
||||
if !cleared_docs.contains(&chunk.doc_id) {
|
||||
clear_document_embeddings(conn, chunk.doc_id)?;
|
||||
cleared_docs.insert(chunk.doc_id);
|
||||
}
|
||||
|
||||
store_embedding(
|
||||
conn,
|
||||
chunk.doc_id,
|
||||
@@ -137,28 +191,99 @@ pub async fn embed_documents(
|
||||
&chunk.chunk_hash,
|
||||
model_name,
|
||||
embedding,
|
||||
chunk.total_chunks,
|
||||
)?;
|
||||
result.embedded += 1;
|
||||
}
|
||||
}
|
||||
Err(e) => {
|
||||
warn!(error = %e, "Batch embedding failed");
|
||||
for chunk in batch {
|
||||
record_embedding_error(
|
||||
conn,
|
||||
chunk.doc_id,
|
||||
chunk.chunk_index,
|
||||
&chunk.doc_hash,
|
||||
&chunk.chunk_hash,
|
||||
model_name,
|
||||
&e.to_string(),
|
||||
)?;
|
||||
result.failed += 1;
|
||||
// Batch failed — retry each chunk individually so one
|
||||
// oversized chunk doesn't poison the entire batch.
|
||||
let err_str = e.to_string();
|
||||
let err_lower = err_str.to_lowercase();
|
||||
// Ollama error messages vary across versions. Match broadly
|
||||
// against known patterns to detect context-window overflow.
|
||||
let is_context_error = err_lower.contains("context length")
|
||||
|| err_lower.contains("too long")
|
||||
|| err_lower.contains("maximum context")
|
||||
|| err_lower.contains("token limit")
|
||||
|| err_lower.contains("exceeds")
|
||||
|| (err_lower.contains("413") && err_lower.contains("http"));
|
||||
|
||||
if is_context_error && batch.len() > 1 {
|
||||
warn!("Batch failed with context length error, retrying chunks individually");
|
||||
for chunk in batch {
|
||||
match client.embed_batch(vec![chunk.text.clone()]).await {
|
||||
Ok(embeddings) if !embeddings.is_empty()
|
||||
&& embeddings[0].len() == EXPECTED_DIMS =>
|
||||
{
|
||||
// Clear old embeddings on first successful chunk
|
||||
if !cleared_docs.contains(&chunk.doc_id) {
|
||||
clear_document_embeddings(conn, chunk.doc_id)?;
|
||||
cleared_docs.insert(chunk.doc_id);
|
||||
}
|
||||
|
||||
store_embedding(
|
||||
conn,
|
||||
chunk.doc_id,
|
||||
chunk.chunk_index,
|
||||
&chunk.doc_hash,
|
||||
&chunk.chunk_hash,
|
||||
model_name,
|
||||
&embeddings[0],
|
||||
chunk.total_chunks,
|
||||
)?;
|
||||
result.embedded += 1;
|
||||
}
|
||||
_ => {
|
||||
warn!(
|
||||
doc_id = chunk.doc_id,
|
||||
chunk_index = chunk.chunk_index,
|
||||
chunk_bytes = chunk.text.len(),
|
||||
"Chunk too large for model context window"
|
||||
);
|
||||
record_embedding_error(
|
||||
conn,
|
||||
chunk.doc_id,
|
||||
chunk.chunk_index,
|
||||
&chunk.doc_hash,
|
||||
&chunk.chunk_hash,
|
||||
model_name,
|
||||
"Chunk exceeds model context window",
|
||||
)?;
|
||||
result.failed += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
} else {
|
||||
warn!(error = %e, "Batch embedding failed");
|
||||
for chunk in batch {
|
||||
record_embedding_error(
|
||||
conn,
|
||||
chunk.doc_id,
|
||||
chunk.chunk_index,
|
||||
&chunk.doc_hash,
|
||||
&chunk.chunk_hash,
|
||||
model_name,
|
||||
&e.to_string(),
|
||||
)?;
|
||||
result.failed += 1;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
// Fire progress for all normal documents after embedding completes.
|
||||
// This ensures progress reflects actual embedding work, not just chunking.
|
||||
processed += page_normal_docs;
|
||||
if let Some(ref cb) = progress_callback {
|
||||
cb(processed, total);
|
||||
}
|
||||
|
||||
// Commit all DB writes for this page atomically.
|
||||
conn.execute_batch("RELEASE embed_page")?;
|
||||
}
|
||||
|
||||
info!(
|
||||
@@ -197,6 +322,7 @@ fn store_embedding(
|
||||
chunk_hash: &str,
|
||||
model_name: &str,
|
||||
embedding: &[f32],
|
||||
total_chunks: usize,
|
||||
) -> Result<()> {
|
||||
let rowid = encode_rowid(doc_id, chunk_index as i64);
|
||||
|
||||
@@ -207,13 +333,23 @@ fn store_embedding(
|
||||
rusqlite::params![rowid, embedding_bytes],
|
||||
)?;
|
||||
|
||||
// Only store chunk_count on the sentinel row (chunk_index=0)
|
||||
let chunk_count: Option<i64> = if chunk_index == 0 {
|
||||
Some(total_chunks as i64)
|
||||
} else {
|
||||
None
|
||||
};
|
||||
|
||||
let now = chrono::Utc::now().timestamp_millis();
|
||||
conn.execute(
|
||||
"INSERT OR REPLACE INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL)",
|
||||
rusqlite::params![doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64, doc_hash, chunk_hash, now],
|
||||
created_at, attempt_count, last_error, chunk_max_bytes, chunk_count)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, NULL, ?8, ?9)",
|
||||
rusqlite::params![
|
||||
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
||||
doc_hash, chunk_hash, now, CHUNK_MAX_BYTES as i64, chunk_count
|
||||
],
|
||||
)?;
|
||||
|
||||
Ok(())
|
||||
@@ -233,13 +369,17 @@ fn record_embedding_error(
|
||||
conn.execute(
|
||||
"INSERT INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error, last_attempt_at)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7)
|
||||
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
|
||||
VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, 1, ?8, ?7, ?9)
|
||||
ON CONFLICT(document_id, chunk_index) DO UPDATE SET
|
||||
attempt_count = embedding_metadata.attempt_count + 1,
|
||||
last_error = ?8,
|
||||
last_attempt_at = ?7",
|
||||
rusqlite::params![doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64, doc_hash, chunk_hash, now, error],
|
||||
last_attempt_at = ?7,
|
||||
chunk_max_bytes = ?9",
|
||||
rusqlite::params![
|
||||
doc_id, chunk_index as i64, model_name, EXPECTED_DIMS as i64,
|
||||
doc_hash, chunk_hash, now, error, CHUNK_MAX_BYTES as i64
|
||||
],
|
||||
)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
10
src/main.rs
10
src/main.rs
@@ -26,7 +26,7 @@ use lore::cli::{
|
||||
Cli, Commands, CountArgs, EmbedArgs, GenerateDocsArgs, IngestArgs, IssuesArgs, MrsArgs,
|
||||
SearchArgs, StatsArgs, SyncArgs,
|
||||
};
|
||||
use lore::core::db::{create_connection, get_schema_version, run_migrations};
|
||||
use lore::core::db::{create_connection, get_schema_version, run_migrations, LATEST_SCHEMA_VERSION};
|
||||
use lore::core::error::{LoreError, RobotErrorOutput};
|
||||
use lore::core::paths::get_config_path;
|
||||
use lore::core::paths::get_db_path;
|
||||
@@ -1112,8 +1112,9 @@ async fn handle_embed(
|
||||
robot_mode: bool,
|
||||
) -> Result<(), Box<dyn std::error::Error>> {
|
||||
let config = Config::load(config_override)?;
|
||||
let full = args.full && !args.no_full;
|
||||
let retry_failed = args.retry_failed && !args.no_retry_failed;
|
||||
let result = run_embed(&config, retry_failed).await?;
|
||||
let result = run_embed(&config, full, retry_failed).await?;
|
||||
if robot_mode {
|
||||
print_embed_json(&result);
|
||||
} else {
|
||||
@@ -1183,8 +1184,7 @@ async fn handle_health(
|
||||
match create_connection(&db_path) {
|
||||
Ok(conn) => {
|
||||
let version = get_schema_version(&conn);
|
||||
let latest = 9; // Number of embedded migrations
|
||||
(true, version, version >= latest)
|
||||
(true, version, version >= LATEST_SCHEMA_VERSION)
|
||||
}
|
||||
Err(_) => (true, 0, false),
|
||||
}
|
||||
@@ -1340,7 +1340,7 @@ fn handle_robot_docs(robot_mode: bool) -> Result<(), Box<dyn std::error::Error>>
|
||||
},
|
||||
"embed": {
|
||||
"description": "Generate vector embeddings for documents via Ollama",
|
||||
"flags": ["--retry-failed"],
|
||||
"flags": ["--full", "--retry-failed"],
|
||||
"example": "lore --robot embed"
|
||||
},
|
||||
"migrate": {
|
||||
|
||||
@@ -12,10 +12,39 @@ pub struct VectorResult {
|
||||
pub distance: f64,
|
||||
}
|
||||
|
||||
/// Query the maximum number of chunks per document for adaptive dedup sizing.
|
||||
fn max_chunks_per_document(conn: &Connection) -> i64 {
|
||||
// Fast path: stored chunk_count on sentinel rows (post-migration 010)
|
||||
let stored: Option<i64> = conn
|
||||
.query_row(
|
||||
"SELECT MAX(chunk_count) FROM embedding_metadata
|
||||
WHERE chunk_index = 0 AND chunk_count IS NOT NULL",
|
||||
[],
|
||||
|row| row.get(0),
|
||||
)
|
||||
.unwrap_or(None);
|
||||
|
||||
if let Some(max) = stored {
|
||||
return max;
|
||||
}
|
||||
|
||||
// Fallback for pre-migration data: count chunks per document
|
||||
conn.query_row(
|
||||
"SELECT COALESCE(MAX(cnt), 1) FROM (
|
||||
SELECT COUNT(*) as cnt FROM embedding_metadata
|
||||
WHERE last_error IS NULL GROUP BY document_id
|
||||
)",
|
||||
[],
|
||||
|row| row.get(0),
|
||||
)
|
||||
.unwrap_or(1)
|
||||
}
|
||||
|
||||
/// Search documents using sqlite-vec KNN query.
|
||||
///
|
||||
/// Over-fetches 3x limit to handle chunk deduplication (multiple chunks per
|
||||
/// document produce multiple KNN results for the same document_id).
|
||||
/// Over-fetches by an adaptive multiplier based on actual max chunks per document
|
||||
/// to handle chunk deduplication (multiple chunks per document produce multiple
|
||||
/// KNN results for the same document_id).
|
||||
/// Returns deduplicated results with best (lowest) distance per document.
|
||||
pub fn search_vector(
|
||||
conn: &Connection,
|
||||
@@ -32,7 +61,9 @@ pub fn search_vector(
|
||||
.flat_map(|f| f.to_le_bytes())
|
||||
.collect();
|
||||
|
||||
let k = limit * 3; // Over-fetch for dedup
|
||||
let max_chunks = max_chunks_per_document(conn);
|
||||
let multiplier = ((max_chunks as usize * 3 / 2) + 1).max(8);
|
||||
let k = limit * multiplier;
|
||||
|
||||
let mut stmt = conn.prepare(
|
||||
"SELECT rowid, distance
|
||||
@@ -69,7 +100,7 @@ pub fn search_vector(
|
||||
distance,
|
||||
})
|
||||
.collect();
|
||||
results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap_or(std::cmp::Ordering::Equal));
|
||||
results.sort_by(|a, b| a.distance.total_cmp(&b.distance));
|
||||
results.truncate(limit);
|
||||
|
||||
Ok(results)
|
||||
@@ -132,7 +163,7 @@ mod tests {
|
||||
.into_iter()
|
||||
.map(|(document_id, distance)| VectorResult { document_id, distance })
|
||||
.collect();
|
||||
results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap_or(std::cmp::Ordering::Equal));
|
||||
results.sort_by(|a, b| a.distance.total_cmp(&b.distance));
|
||||
results.truncate(limit);
|
||||
results
|
||||
}
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
//! Integration tests for embedding storage and vector search.
|
||||
//!
|
||||
//! These tests create an in-memory SQLite database with sqlite-vec loaded,
|
||||
//! apply all migrations through 009 (embeddings), and verify KNN search
|
||||
//! apply all migrations through 010 (chunk config), and verify KNN search
|
||||
//! and metadata operations.
|
||||
|
||||
use lore::core::db::create_connection;
|
||||
@@ -18,7 +18,7 @@ fn create_test_db() -> (TempDir, Connection) {
|
||||
|
||||
let migrations_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("migrations");
|
||||
|
||||
for version in 1..=9 {
|
||||
for version in 1..=10 {
|
||||
let entries: Vec<_> = std::fs::read_dir(&migrations_dir)
|
||||
.unwrap()
|
||||
.filter_map(|e| e.ok())
|
||||
@@ -181,3 +181,122 @@ fn empty_database_returns_no_results() {
|
||||
let results = lore::search::search_vector(&conn, &axis_vector(0), 10).unwrap();
|
||||
assert!(results.is_empty(), "Empty DB should return no results");
|
||||
}
|
||||
|
||||
// --- Bug-fix regression tests ---
|
||||
|
||||
#[test]
|
||||
fn overflow_doc_with_error_sentinel_not_re_detected_as_pending() {
|
||||
// Bug 2: Documents skipped for chunk overflow must record a sentinel error
|
||||
// in embedding_metadata so they are not re-detected as pending on subsequent
|
||||
// pipeline runs (which would cause an infinite re-processing loop).
|
||||
let (_tmp, conn) = create_test_db();
|
||||
|
||||
insert_document(&conn, 1, "Overflow doc", "Some content");
|
||||
|
||||
// Simulate what the pipeline does when a document exceeds CHUNK_ROWID_MULTIPLIER:
|
||||
// it records an error sentinel at chunk_index=0.
|
||||
let now = chrono::Utc::now().timestamp_millis();
|
||||
conn.execute(
|
||||
"INSERT INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, last_error, last_attempt_at, chunk_max_bytes)
|
||||
VALUES (1, 0, 'nomic-embed-text', 768, 'hash_1', 'overflow-sentinel', ?1, 1, 'Document produces too many chunks', ?1, ?2)",
|
||||
rusqlite::params![now, lore::embedding::CHUNK_MAX_BYTES as i64],
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
// Now find_pending_documents should NOT return this document
|
||||
let pending = lore::embedding::find_pending_documents(&conn, 100, 0, "nomic-embed-text").unwrap();
|
||||
assert!(
|
||||
pending.is_empty(),
|
||||
"Document with overflow error sentinel should not be re-detected as pending, got {} pending",
|
||||
pending.len()
|
||||
);
|
||||
|
||||
// count_pending_documents should also return 0
|
||||
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
|
||||
assert_eq!(count, 0, "Count should be 0 for document with overflow sentinel");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn count_and_find_pending_agree() {
|
||||
// Bug 1: count_pending_documents and find_pending_documents must use
|
||||
// logically equivalent WHERE clauses to produce consistent results.
|
||||
let (_tmp, conn) = create_test_db();
|
||||
|
||||
// Case 1: No documents at all
|
||||
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
|
||||
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
|
||||
assert_eq!(count as usize, found.len(), "Empty DB: count and find should agree");
|
||||
|
||||
// Case 2: New document (no metadata)
|
||||
insert_document(&conn, 1, "New doc", "Content");
|
||||
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
|
||||
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
|
||||
assert_eq!(count as usize, found.len(), "New doc: count and find should agree");
|
||||
assert_eq!(count, 1);
|
||||
|
||||
// Case 3: Document with matching metadata (not pending)
|
||||
let now = chrono::Utc::now().timestamp_millis();
|
||||
conn.execute(
|
||||
"INSERT INTO embedding_metadata
|
||||
(document_id, chunk_index, model, dims, document_hash, chunk_hash,
|
||||
created_at, attempt_count, chunk_max_bytes)
|
||||
VALUES (1, 0, 'nomic-embed-text', 768, 'hash_1', 'ch', ?1, 1, ?2)",
|
||||
rusqlite::params![now, lore::embedding::CHUNK_MAX_BYTES as i64],
|
||||
)
|
||||
.unwrap();
|
||||
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
|
||||
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
|
||||
assert_eq!(count as usize, found.len(), "Complete doc: count and find should agree");
|
||||
assert_eq!(count, 0);
|
||||
|
||||
// Case 4: Config drift (chunk_max_bytes mismatch)
|
||||
conn.execute(
|
||||
"UPDATE embedding_metadata SET chunk_max_bytes = 999 WHERE document_id = 1",
|
||||
[],
|
||||
)
|
||||
.unwrap();
|
||||
let count = lore::embedding::count_pending_documents(&conn, "nomic-embed-text").unwrap();
|
||||
let found = lore::embedding::find_pending_documents(&conn, 1000, 0, "nomic-embed-text").unwrap();
|
||||
assert_eq!(count as usize, found.len(), "Config drift: count and find should agree");
|
||||
assert_eq!(count, 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn full_embed_delete_is_atomic() {
|
||||
// Bug 7: The --full flag's two DELETE statements should be atomic.
|
||||
// This test verifies that both tables are cleared together.
|
||||
let (_tmp, conn) = create_test_db();
|
||||
|
||||
insert_document(&conn, 1, "Doc", "Content");
|
||||
insert_embedding(&conn, 1, 0, &axis_vector(0));
|
||||
|
||||
// Verify data exists
|
||||
let meta_count: i64 = conn
|
||||
.query_row("SELECT COUNT(*) FROM embedding_metadata", [], |r| r.get(0))
|
||||
.unwrap();
|
||||
let embed_count: i64 = conn
|
||||
.query_row("SELECT COUNT(*) FROM embeddings", [], |r| r.get(0))
|
||||
.unwrap();
|
||||
assert_eq!(meta_count, 1);
|
||||
assert_eq!(embed_count, 1);
|
||||
|
||||
// Execute the atomic delete (same as embed.rs --full)
|
||||
conn.execute_batch(
|
||||
"BEGIN;
|
||||
DELETE FROM embedding_metadata;
|
||||
DELETE FROM embeddings;
|
||||
COMMIT;",
|
||||
)
|
||||
.unwrap();
|
||||
|
||||
let meta_count: i64 = conn
|
||||
.query_row("SELECT COUNT(*) FROM embedding_metadata", [], |r| r.get(0))
|
||||
.unwrap();
|
||||
let embed_count: i64 = conn
|
||||
.query_row("SELECT COUNT(*) FROM embeddings", [], |r| r.get(0))
|
||||
.unwrap();
|
||||
assert_eq!(meta_count, 0, "Metadata should be cleared");
|
||||
assert_eq!(embed_count, 0, "Embeddings should be cleared");
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user