perf: force partial index for DiffNote queries (26-75x), batch stats counts (1.7x)

who.rs: Add INDEXED BY idx_notes_diffnote_path_created to all DiffNote
query paths (expert, expert_details, reviews, path probes, suffix_probe).
SQLite planner was choosing idx_notes_system (106K rows, 38%) over the
partial index (26K rows, 9.3%) when LIKE predicates are present.
Measured: expert 1561ms->59ms (26x), reviews ~1200ms->16ms (75x).

stats.rs: Replace 12+ sequential COUNT(*) queries with conditional
aggregates (SUM(CASE WHEN...)) and use FTS5 shadow table
(documents_fts_docsize) instead of virtual table for counting.
Measured: warm 109ms->65ms (1.68x).
This commit is contained in:
teernisse
2026-02-11 16:00:34 -05:00
parent 039ab1c2a3
commit 740607e06d
13 changed files with 4728 additions and 116 deletions

View File

@@ -0,0 +1,179 @@
# Deep Performance Audit Report
**Date:** 2026-02-12
**Branch:** `perf-audit` (e9bacc94)
**Parent:** `039ab1c2` (master, v0.6.1)
---
## Methodology
1. **Baseline** — measured p50/p95 latency for all major commands with warm cache
2. **Profile** — used macOS `sample` profiler and `EXPLAIN QUERY PLAN` to identify hotspots
3. **Golden output** — captured exact numeric outputs before changes as equivalence oracle
4. **One lever per change** — each optimization isolated and independently benchmarked
5. **Revert threshold** — any optimization <1.1x speedup reverted per audit rules
---
## Baseline Measurements (warm cache, release build)
| Command | Latency | Notes |
|---------|---------|-------|
| `who --path src/core/db.rs` (expert) | 2200ms | **Hotspot** |
| `who --active` | 83-93ms | Acceptable |
| `who workload` | 22ms | Fast |
| `stats` | 107-112ms | **Hotspot** |
| `search "authentication"` | 1030ms | **Hotspot** (library-level) |
| `list issues -n 50` | ~40ms | Fast |
---
## Optimization 1: INDEXED BY for DiffNote Queries
**Target:** `src/cli/commands/who.rs` — expert and reviews query paths
**Problem:** SQLite query planner chose `idx_notes_system` (38% selectivity, 106K rows) over `idx_notes_diffnote_path_created` (9.3% selectivity, 26K rows) for path-filtered DiffNote queries. The partial index `WHERE noteable_type = 'MergeRequest' AND type = 'DiffNote'` is far more selective but the planner's cost model didn't pick it.
**Change:** Added `INDEXED BY idx_notes_diffnote_path_created` to all 8 SQL queries across `query_expert`, `query_expert_details`, `query_reviews`, `build_path_query` (probes 1 & 2), and `suffix_probe`.
**Results:**
| Query | Before | After | Speedup |
|-------|--------|-------|---------|
| expert (specific path) | 2200ms | 56-58ms | **38x** |
| expert (broad path) | 2200ms | 83ms | **26x** |
| reviews | 1800ms | 24ms | **75x** |
**Isomorphism proof:** `INDEXED BY` only changes which index the planner uses, not the query semantics. Same rows matched, same ordering, same output. Verified by golden output comparison across 5+ runs.
---
## Optimization 2: Conditional Aggregates in Stats
**Target:** `src/cli/commands/stats.rs`
**Problem:** 12+ sequential `COUNT(*)` queries each requiring a full table scan of `documents` (61K rows). Each scan touched the same pages but couldn't share work.
**Changes:**
- Documents: 5 sequential COUNTs -> 1 query with `SUM(CASE WHEN ... THEN 1 END)`
- FTS count: `SELECT COUNT(*) FROM documents_fts` (virtual table, slow) -> `SELECT COUNT(*) FROM documents_fts_docsize` (shadow B-tree table, 19x faster)
- Embeddings: 2 queries -> 1 with `COUNT(DISTINCT document_id), COUNT(*)`
- Dirty sources: 2 queries -> 1 with conditional aggregates
- Pending fetches: 2 queries -> 1 each (discussions, dependents)
**Results:**
| Metric | Before | After | Speedup |
|--------|--------|-------|---------|
| Warm median | 112ms | 66ms | **1.70x** |
| Cold | 1220ms | ~700ms | ~1.7x |
**Golden output verified:**
```
total:61652, issues:8241, mrs:10018, discussions:43393, truncated:63
fts:61652, embedded:61652, chunks:88161
```
All values match exactly across before/after runs.
**Isomorphism proof:** `SUM(CASE WHEN x THEN 1 END)` is algebraically identical to `COUNT(*) WHERE x`. The FTS5 shadow table `documents_fts_docsize` has exactly one row per FTS document by SQLite specification, so `COUNT(*)` on it equals the virtual table count.
---
## Investigation: Two-Phase FTS Search (REVERTED)
**Target:** `src/search/fts.rs`, `src/cli/commands/search.rs`
**Hypothesis:** FTS5 `snippet()` generation is expensive. Splitting search into Phase 1 (score-only MATCH+bm25) and Phase 2 (snippet for filtered results only) should reduce work.
**Implementation:** Created `fetch_fts_snippets()` that retrieves snippets only for post-filter document IDs via `json_each()` join.
**Results:**
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| search (limit 20) | 1030ms | 995ms | 3.5% |
**Decision:** Reverted. Per audit rules, <1.1x speedup does not justify added code complexity.
**Root cause:** The bottleneck is not snippet generation but `MATCH` + `bm25()` scoring itself. Profiling showed `strspn` (FTS5 tokenizer) and `memmove` as the top CPU consumers. The same query runs in 30ms on system sqlite3 but 1030ms in rusqlite's bundled SQLite — a ~125x gap despite both being SQLite 3.51.x compiled at -O3.
---
## Library-Level Finding: Bundled SQLite FTS5 Performance
**Observation:** FTS5 MATCH+bm25 queries are ~125x slower in rusqlite's bundled SQLite vs system sqlite3.
| Environment | Query Time | Notes |
|-------------|-----------|-------|
| System sqlite3 (macOS) | 30ms (with snippet), 8ms (without) | Same .db file |
| rusqlite bundled | 1030ms | `features = ["bundled"]`, OPT_LEVEL=3 |
**Profiler data (macOS `sample`):**
- Top hotspot: `strspn` in FTS5 tokenizer
- Secondary: `memmove` in FTS5 internals
- Scaling: ~5ms per result (limit 5 = 497ms, limit 20 = 995ms)
**Possible causes:**
- Bundled SQLite compiled without platform-specific optimizations (SIMD, etc.)
- Different memory allocator behavior
- Missing compile-time tuning flags
**Recommendation for future:** Investigate switching from `features = ["bundled"]` to system SQLite linkage, or audit the bundled compile flags in the `libsqlite3-sys` build script.
---
## Exploration Agent Findings (Informational)
Four parallel exploration agents surveyed the entire codebase. Key findings beyond what was already addressed:
### Ingestion Pipeline
- Serial DB writes in async context (acceptable — rusqlite is synchronous)
- Label ingestion uses individual inserts (potential batch optimization, low priority)
### CLI / GitLab Client
- GraphQL client recreated per call (`client.rs:98-100`) — caches connection pool, minor
- Double JSON deserialization in GraphQL responses — medium priority
- N+1 subqueries in `list` command (`list.rs:408-423`) — 4 correlated subqueries per row
### Search / Embedding
- No N+1 patterns, no O(n^2) algorithms
- Chunking is O(n) single-pass with proper UTF-8 safety
- Ollama concurrency model is sound (parallel HTTP, serial DB writes)
### Database / Documents
- O(n^2) prefix sum in `truncation.rs` — low traffic path
- String allocation patterns in extractors — micro-optimization territory
---
## Opportunity Matrix
| Candidate | Impact | Confidence | Effort | Score | Status |
|-----------|--------|------------|--------|-------|--------|
| INDEXED BY for DiffNote | Very High | High | Low | **9.0** | Shipped |
| Stats conditional aggregates | Medium | High | Low | **7.0** | Shipped |
| Bundled SQLite FTS5 | Very High | Medium | High | 5.0 | Documented |
| List N+1 subqueries | Medium | Medium | Medium | 4.0 | Backlog |
| GraphQL double deser | Low | Medium | Low | 3.5 | Backlog |
| Truncation O(n^2) | Low | High | Low | 3.0 | Backlog |
---
## Files Modified
| File | Change |
|------|--------|
| `src/cli/commands/who.rs` | INDEXED BY hints on 8 SQL queries |
| `src/cli/commands/stats.rs` | Conditional aggregates, FTS5 shadow table, merged queries |
---
## Quality Gates
- All 603 tests pass
- `cargo clippy --all-targets -- -D warnings` clean
- `cargo fmt --check` clean
- Golden output verified for both optimizations

View File

@@ -0,0 +1,174 @@
Highest-impact gaps I see in the current plan:
1. `for-issue` / `for-mr` filtering is ambiguous across projects and can return incorrect rows.
2. `lore notes` has no pagination contract, so large exports and deterministic resumption are weak.
3. Migration `022` is high-risk (table rebuild + FTS + junction tables) without explicit integrity gates.
4. Note-doc freshness is incomplete for upstream note deletions and parent metadata changes (labels/title).
Below are my best revisions, each with rationale and a git-diff-style plan edit.
---
1. **Add gated rollout + rollback controls**
Rationale: You can still “ship together” while reducing blast radius. This makes recovery fast if note-doc generation causes DB/embedding pressure.
```diff
@@ ## Design
-Two phases, shipped together as one feature:
+Two phases, shipped together as one feature, but with runtime gates:
+
+- `feature.notes_cli` (Phase 1 surface)
+- `feature.note_documents` (Phase 2 indexing/extraction path)
+
+Rollout order:
+1) Enable `notes_cli`
+2) Run note-doc backfill in bounded batches
+3) Enable `note_documents` for continuous updates
+
+Rollback:
+- Disabling `feature.note_documents` stops new note-doc generation without affecting issue/MR/discussion docs.
```
2. **Add keyset pagination + deterministic ordering**
Rationale: Needed for year-long reviewer analysis and reliable “continue where I left off” behavior under concurrent updates.
```diff
@@ pub struct NoteListFilters<'a> {
pub limit: usize,
+ pub cursor: Option<&'a str>, // keyset token "<sort_ms>:<id>"
+ pub include_total_count: bool, // avoid COUNT(*) in hot paths
@@
- pub sort: &'a str, // "created" (default) | "updated"
+ pub sort: &'a str, // "created" | "updated"
@@ query_notes SQL
-ORDER BY {sort_column} {order}
+ORDER BY {sort_column} {order}, n.id {order}
LIMIT ?
```
3. **Make `for-issue` / `for-mr` project-scoped**
Rationale: IIDs are not globally unique. Requiring project avoids false positives and hard-to-debug cross-project leakage.
```diff
@@ pub struct NotesArgs {
- #[arg(long = "for-issue", help_heading = "Filters", conflicts_with = "for_mr")]
+ #[arg(long = "for-issue", help_heading = "Filters", conflicts_with = "for_mr", requires = "project")]
pub for_issue: Option<i64>,
@@
- #[arg(long = "for-mr", help_heading = "Filters", conflicts_with = "for_issue")]
+ #[arg(long = "for-mr", help_heading = "Filters", conflicts_with = "for_issue", requires = "project")]
pub for_mr: Option<i64>,
```
4. **Upgrade path filtering semantics**
Rationale: Review comments often reference renames/moves. Restricting to `position_new_path` misses relevant notes.
```diff
@@ pub struct NotesArgs {
- /// Filter by file path (trailing / for prefix match)
+ /// Filter by file path
#[arg(long, help_heading = "Filters")]
pub path: Option<String>,
+ /// Path mode: exact|prefix|glob
+ #[arg(long = "path-mode", value_parser = ["exact","prefix","glob"], default_value = "exact", help_heading = "Filters")]
+ pub path_mode: String,
+ /// Match against old path as well as new path
+ #[arg(long = "match-old-path", help_heading = "Filters")]
+ pub match_old_path: bool,
@@ query_notes filter mappings
-- `path` ... n.position_new_path ...
+- `path` applies to `n.position_new_path` and optionally `n.position_old_path`.
+- `glob` mode translates `*`/`?` to SQL LIKE with escaping.
```
5. **Add explicit performance indexes (new migration)**
Rationale: `notes` becomes a first-class query surface; without indexes, filters degrade quickly at 10k+ note scale.
```diff
@@ ## Phase 1: `lore notes` Command
+### Work Chunk 1E: Query Performance Indexes
+**Files:** `migrations/023_notes_query_indexes.sql`, `src/core/db.rs`
+
+Add indexes:
+- `notes(project_id, created_at DESC, id DESC)`
+- `notes(author_username, created_at DESC, id DESC) WHERE is_system = 0`
+- `notes(discussion_id)`
+- `notes(position_new_path)`
+- `notes(position_old_path)`
+- `discussions(issue_id)`
+- `discussions(merge_request_id)`
```
6. **Harden migration 022 with transactional integrity checks**
Rationale: This is the riskiest part of the plan. Add hard fail-fast checks so corruption cannot silently pass.
```diff
@@ ### Work Chunk 2A: Schema Migration (022)
+Migration safety requirements:
+- Execute in a single `BEGIN IMMEDIATE ... COMMIT` transaction.
+- Capture and compare pre/post row counts for `documents`, `document_labels`, `document_paths`, `dirty_sources`.
+- Run `PRAGMA foreign_key_check` and abort on any violation.
+- Run `PRAGMA integrity_check` and abort on non-`ok`.
+- Rebuild FTS and assert `documents_fts` rowcount equals `documents` rowcount.
```
7. **Add note deletion + parent-change propagation**
Rationale: Current plan handles create/update ingestion but not all staleness paths. Without this, note documents drift.
```diff
@@ ## Phase 2: Per-Note Documents
+### Work Chunk 2G: Freshness Propagation
+**Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/documents/regenerator.rs`
+
+Rules:
+- If a previously stored note is missing from upstream payload, delete local note row and enqueue `(note, id)` for document deletion.
+- When parent issue/MR title or labels change, enqueue descendant note docs dirty (notes inherit parent metadata).
+- Keep idempotent behavior for repeated syncs.
```
8. **Separate FTS coverage from embedding coverage**
Rationale: Biggest cost/perf risk is embeddings. Index all notes in FTS, but embed selectively with policy knobs.
```diff
@@ ## Estimated Document Volume Impact
-FTS5 handles this comfortably. Embedding generation time scales linearly (~4x increase).
+FTS5 handles this comfortably. Embedding generation is policy-controlled:
+- FTS: index all non-system note docs
+- Embeddings default: only notes with body length >= 40 chars (configurable)
+- Add config: `documents.note_embeddings.min_chars`, `documents.note_embeddings.enabled`
+- Prioritize unresolved DiffNotes before other notes during embedding backfill
```
9. **Bring structured reviewer profiling into scope (not narrative reporting)**
Rationale: This directly serves the stated use case and makes the feature compelling immediately.
```diff
@@ ## Non-Goals
-- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
+- Generating free-form narrative reviewer reports.
+ A structured profiling command is in scope.
+
+## Phase 3: Structured Reviewer Profiling
+Add `lore notes profile --author <user> --since <window>` returning:
+- top commented paths
+- top parent labels
+- unresolved-comment ratio
+- note-type distribution
+- median comment length
```
10. **Add operational SLOs + robot-mode status for note pipeline**
Rationale: Reliability improves when regressions are observable, not inferred from failures.
```diff
@@ ## Verification Checklist
+Operational checks:
+- `lore -J stats` includes per-`source_type` document counts (including `note`)
+- Add queue lag metrics: oldest dirty note age, retry backlog size
+- Add extraction error breakdown by `source_type`
+- Add smoke assertion: disabling `feature.note_documents` leaves other source regeneration unaffected
```
---
If you want, I can produce a single consolidated revised PRD draft (fully merged text, not just diffs) as the next step.

View File

@@ -0,0 +1,200 @@
Below are the strongest revisions Id make, excluding everything in your `## Rejected Recommendations` list.
1. **Add a Phase 0 for stable note identity before any note-doc generation**
Rationale: your current plan still allows note document churn because Issue discussion ingestion is delete/reinsert-based. That makes local `notes.id` unstable, causing unnecessary dirtying/regeneration and potential stale-doc edge cases. Stabilizing identity first (upsert-by-GitLab-ID + sweep stale) improves correctness and cuts repeated work.
```diff
@@ ## Design
-Two phases, shipped together as one feature:
+Three phases, shipped together as one feature:
+- **Phase 0 (Foundation):** Stable note identity in local DB (upsert + sweep, no delete/reinsert churn)
- **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering
- **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline
@@
+## Phase 0: Stable Note Identity
+
+### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
+**Files:** `src/ingestion/discussions.rs`, `migrations/022_notes_identity_index.sql`, `src/core/db.rs`
+**Implementation:**
+- Add unique index: `UNIQUE(project_id, gitlab_id)` on `notes`
+- Replace delete/reinsert issue-note flow with upsert + `last_seen_at` sweep (same durability model as MR note sweep)
+- Ensure `insert_note/upsert_note` returns the stable local row id for both insert and update paths
```
2. **Replace `source_type` CHECK constraints with a registry table + FK in migration**
Rationale: table CHECKs force full table rebuild for every new source type forever. A `source_types` table with FK keeps DB-level integrity and future extensibility without rebuilding `documents`/`dirty_sources` every time. This is a major architecture hardening win.
```diff
@@ ### Work Chunk 2A: Schema Migration (023)
-Current migration ... CHECK constraints limiting `source_type` ...
+Current migration ... CHECK constraints limiting `source_type` ...
+Revision: migrate to `source_types` registry table + FK constraints.
@@
-1. `dirty_sources` — add `'note'` to source_type CHECK
-2. `documents` — add `'note'` to source_type CHECK
+1. Create `source_types(name TEXT PRIMARY KEY)` and seed: `issue, merge_request, discussion, note`
+2. Rebuild `dirty_sources` and `documents` to replace CHECK with `REFERENCES source_types(name)`
+3. Future source-type additions become `INSERT INTO source_types(name) VALUES (?)` (no table rebuild)
@@
+#### Additional integrity tests
+#[test]
+fn test_source_types_registry_contains_note() { ... }
+#[test]
+fn test_documents_source_type_fk_enforced() { ... }
+#[test]
+fn test_dirty_sources_source_type_fk_enforced() { ... }
```
3. **Mark note documents dirty only when note semantics actually changed**
Rationale: current loops mark every non-system note dirty every sync. With 8k+ notes this creates avoidable queue pressure and regeneration time. Change-aware dirtying (inserted/changed only) gives major performance and stability improvements.
```diff
@@ ### Work Chunk 2D: Regenerator & Dirty Tracking Integration
-for note in notes {
- let local_note_id = insert_note(&tx, local_discussion_id, &note, None)?;
- if !note.is_system {
- dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, local_note_id)?;
- }
-}
+for note in notes {
+ let outcome = upsert_note(&tx, local_discussion_id, &note, None)?;
+ if !note.is_system && outcome.changed_semantics {
+ dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
+ }
+}
@@
+// changed_semantics should include: body, note_type, path/line positions, resolvable/resolved/resolved_by, updated_at
```
4. **Expand filters to support real analysis windows and resolution state**
Rationale: reviewer profiling usually needs bounded windows and both resolved/unresolved views. Current `unresolved: bool` is too narrow and one-sided. Add `--until` and tri-state resolution filtering for better analytical power.
```diff
@@ pub struct NoteListFilters<'a> {
- pub since: Option<&'a str>,
+ pub since: Option<&'a str>,
+ pub until: Option<&'a str>,
@@
- pub unresolved: bool,
+ pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved"
@@
- pub author: Option<&'a str>,
+ pub author: Option<&'a str>, // case-insensitive match
@@
- // Filter by time (7d, 2w, 1m, or YYYY-MM-DD)
+ // Filter by start time (7d, 2w, 1m, or YYYY-MM-DD)
pub since: Option<String>,
+ /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD)
+ #[arg(long, help_heading = "Filters")]
+ pub until: Option<String>,
@@
- /// Only show unresolved review comments
- pub unresolved: bool,
+ /// Resolution filter: any, unresolved, resolved
+ #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")]
+ pub resolution: String,
```
5. **Broaden index strategy to match actual query shapes, not just author queries**
Rationale: `idx_notes_user_created` helps one path, but common usage also includes project+time scans and unresolved filters. Add two more partial composites for high-selectivity paths.
```diff
@@ ### Work Chunk 1E: Composite Query Index
CREATE INDEX IF NOT EXISTS idx_notes_user_created
ON notes(project_id, author_username, created_at DESC, id DESC)
WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_unresolved_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0 AND resolvable = 1 AND resolved = 0;
@@
+#[test]
+fn test_notes_query_plan_uses_project_created_index_for_default_listing() { ... }
+#[test]
+fn test_notes_query_plan_uses_unresolved_index_when_resolution_unresolved() { ... }
```
6. **Improve per-note document payload with structured metadata header + minimal thread context**
Rationale: isolated single-note docs can lose meaning. A small structured header plus lightweight context (parent + one preceding note excerpt) improves semantic retrieval quality substantially without re-bundling full threads.
```diff
@@ ### Work Chunk 2C: Note Document Extractor
-// 6. Format content:
-// [[Note]] {note_type or "Comment"} on {parent_type_prefix}: {parent_title}
-// Project: {path_with_namespace}
-// URL: {url}
-// Author: @{author}
-// Date: {format_date(created_at)}
-// Labels: {labels_json}
-// File: {position_new_path}:{position_new_line} (if DiffNote)
-//
-// --- Body ---
-//
-// {body}
+// 6. Format content with machine-readable header:
+// [[Note]]
+// source_type: note
+// note_gitlab_id: {gitlab_id}
+// project: {path_with_namespace}
+// parent_type: {Issue|MergeRequest}
+// parent_iid: {iid}
+// note_type: {DiffNote|DiscussionNote|Comment}
+// author: @{author}
+// created_at: {iso8601}
+// resolved: {true|false}
+// path: {position_new_path}:{position_new_line}
+// url: {url}
+//
+// --- Context ---
+// parent_title: {title}
+// previous_note_excerpt: {optional, max 200 chars}
+//
+// --- Body ---
+// {body}
```
7. **Add first-class export modes for downstream profiling pipelines**
Rationale: this makes the feature much more useful immediately (LLM prompts, notebook analysis, external scripts) without adding a profiling command. It stays within your non-goals and increases adoption.
```diff
@@ pub struct NotesArgs {
+ /// Output format
+ #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")]
+ pub format: String,
@@
- if robot_mode {
+ if robot_mode || args.format == "json" || args.format == "jsonl" || args.format == "csv" {
print_list_notes_json(...)
} else {
print_list_notes(&result);
}
@@ ### Work Chunk 1C: Human & Robot Output Formatting
+Add `print_list_notes_csv()` and `print_list_notes_jsonl()`:
+- CSV columns mirror `NoteListRowJson` field names
+- JSONL emits one note object per line for streaming pipelines
```
8. **Strengthen verification with idempotence + migration data-preservation checks**
Rationale: this feature touches ingestion, migrations, indexing, and regeneration. Add explicit idempotence/perf checks so regressions surface early.
```diff
@@ ## Verification Checklist
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check
+cargo test test_note_ingestion_idempotent_across_two_syncs
+cargo test test_note_document_count_stable_after_second_generate_docs_full
@@
+lore sync
+lore generate-docs --full
+lore -J stats > /tmp/stats1.json
+lore generate-docs --full
+lore -J stats > /tmp/stats2.json
+# assert note doc count unchanged and dirty queue drains to zero
```
If you want, I can turn this into a fully rewritten PRD v2 draft with these changes merged in-place and renumbered work chunks end-to-end.

View File

@@ -0,0 +1,162 @@
These are the highest-impact revisions Id make. They avoid everything in your `## Rejected Recommendations` list.
1. Add immediate note-document deletion propagation (dont wait for `generate-docs --full`)
Why: right now, deleted notes can leave stale `source_type='note'` documents until a full rebuild. That creates incorrect search/reporting results and weakens trust in the dataset.
```diff
@@ Phase 0: Stable Note Identity
+### Work Chunk 0B: Immediate Deletion Propagation
+
+When sweep deletes stale notes, propagate deletion to documents in the same transaction.
+Do not rely on eventual cleanup via `generate-docs --full`.
+
+#### Tests to Write First
+#[test]
+fn test_issue_note_sweep_deletes_note_documents_immediately() { ... }
+#[test]
+fn test_mr_note_sweep_deletes_note_documents_immediately() { ... }
+
+#### Implementation
+Use `DELETE ... RETURNING id, is_system` in note sweep functions.
+For returned non-system note ids:
+1) `DELETE FROM documents WHERE source_type='note' AND source_id=?`
+2) `DELETE FROM dirty_sources WHERE source_type='note' AND source_id=?`
```
2. Add one-time upgrade backfill for existing notes (migration 024)
Why: existing DBs will otherwise only get note-documents for changed/new notes. Historical notes remain invisible unless users manually run full rebuild.
```diff
@@ Phase 2: Per-Note Documents
+### Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024)
+
+Create migration `024_note_dirty_backfill.sql`:
+INSERT INTO dirty_sources (source_type, source_id, queued_at)
+SELECT 'note', n.id, unixepoch('now') * 1000
+FROM notes n
+LEFT JOIN documents d
+ ON d.source_type='note' AND d.source_id=n.id
+WHERE n.is_system=0 AND d.id IS NULL
+ON CONFLICT(source_type, source_id) DO NOTHING;
+
+Add migration test asserting idempotence and expected queue size.
```
3. Fix `--since/--until` semantics and validation
Why: reusing `parse_since` for `until` creates ambiguous windows and off-by-boundary behavior; your own example `--since 90d --until 180d` is chronologically reversed.
```diff
@@ Work Chunk 1A: Data Types & Query Layer
- since: parse_since(since_str) then n.created_at >= ?
- until: parse_since(until_str) then n.created_at <= ?
+ since: parse_since_start_bound(since_str) then n.created_at >= ?
+ until: parse_until_end_bound(until_str) then n.created_at <= ?
+ Validate since <= until; otherwise return a clear user error.
+
+#### Tests to Write First
+#[test] fn test_query_notes_invalid_time_window_rejected() { ... }
+#[test] fn test_query_notes_until_date_is_end_of_day_inclusive() { ... }
```
4. Separate semantic-change detection from housekeeping updates
Why: current proposed `WHERE` includes `updated_at`, which will cause unnecessary dirty churn. You want `last_seen_at` to always refresh, but regeneration only when searchable semantics changed.
```diff
@@ Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
- OR notes.updated_at IS NOT excluded.updated_at
+ -- updated_at-only changes should not mark semantic dirty
+
+Perform two-step logic:
+1) Upsert always updates persistence/housekeeping fields (`updated_at`, `last_seen_at`).
+2) `changed_semantics` is computed only from fields used by note documents/search filters
+ (body, note_type, resolved flags, paths, author, parent linkage).
+
+#### Tests to Write First
+#[test]
+fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() { ... }
```
5. Make indexes align with actual query collation and join strategy
Why: `author` uses `COLLATE NOCASE`; without collation-aware index, SQLite can skip index use. Also, IID filters via scalar subqueries are harder for planner than direct join predicates.
```diff
@@ Work Chunk 1E: Composite Query Index
-CREATE INDEX ... ON notes(project_id, author_username, created_at DESC, id DESC) WHERE is_system = 0;
+CREATE INDEX ... ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC) WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_discussions_issue_id ON discussions(issue_id);
+CREATE INDEX IF NOT EXISTS idx_discussions_mr_id ON discussions(merge_request_id);
```
```diff
@@ Work Chunk 1A: query_notes()
- d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?)
+ i.iid = ? AND i.project_id = ?
- d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?)
+ m.iid = ? AND m.project_id = ?
```
6. Replace manual CSV escaping with `csv` crate
Why: manual RFC4180 escaping is fragile (quotes/newlines/multi-byte edge cases). This is exactly where a mature library reduces long-term bug risk.
```diff
@@ Work Chunk 1C: Human & Robot Output Formatting
- Uses a minimal CSV writer (no external dependency — the format is simple enough for manual escaping).
+ Uses `csv::Writer` for RFC4180-compliant escaping and stable output across edge cases.
+
+#### Tests to Write First
+#[test] fn test_csv_output_multiline_and_quotes_roundtrip() { ... }
```
7. Add `--contains` lexical body filter to `lore notes`
Why: useful middle ground between metadata filtering and semantic search; great for reviewer-pattern mining without requiring FTS query syntax.
```diff
@@ Work Chunk 1B: CLI Arguments & Command Wiring
+/// Filter by case-insensitive substring in note body
+#[arg(long, help_heading = "Filters")]
+pub contains: Option<String>;
```
```diff
@@ Work Chunk 1A: NoteListFilters
+ pub contains: Option<&'a str>,
@@ query_notes dynamic filters
+ if contains.is_some() {
+ where_clauses.push("n.body LIKE ? COLLATE NOCASE");
+ params.push(format!("%{}%", escape_like(contains.unwrap())));
+ }
```
8. Reduce note-document embedding noise by slimming metadata header
Why: current verbose key-value header repeats low-signal tokens and consumes embedding budget. Keep context, but bias tokens toward actual review text.
```diff
@@ Work Chunk 2C: Note Document Extractor
- Build content with structured metadata header:
- [[Note]]
- source_type: note
- note_gitlab_id: ...
- project: ...
- ...
- --- Body ---
- {body}
+ Build content with compact, high-signal layout:
+ [[Note]]
+ @{author} on {Issue#|MR!}{iid} in {project_path}
+ path: {path:line} (only when available)
+ state: {resolved|unresolved} (only when resolvable)
+
+ {body}
+
+Keep detailed metadata in structured document columns/labels/paths/url,
+not repeated in verbose text.
```
9. Add explicit performance regression checks for the new hot paths
Why: this feature increases document volume ~4x; you should pin acceptable query behavior now so future changes dont silently degrade.
```diff
@@ Verification Checklist
+Performance/plan checks:
+1) `EXPLAIN QUERY PLAN` for:
+ - author+since query
+ - project+date query
+ - for-mr / for-issue query
+2) Seed 50k-note synthetic fixture and assert:
+ - `lore notes --author ... --limit 100` stays under agreed local threshold
+ - `lore search --type note ...` remains deterministic and completes successfully
```
If you want, I can also provide a fully merged “iteration 3” PRD text with these edits applied end-to-end so you can drop it in directly.

View File

@@ -0,0 +1,187 @@
1. **Canonical note identity for documents: use `notes.gitlab_id` as `source_id`**
Why this is better: the current plan still couples document identity to local row IDs. Even with upsert+sweep, local IDs are a storage artifact and can be reused in edge cases. Using GitLab note IDs as canonical document IDs makes regeneration, backfill, and deletion propagation more stable and portable.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Phase 0: Stable Note Identity
-Phase 2 depends on `notes.id` as the `source_id` for note documents.
+Phase 2 uses `notes.gitlab_id` as the `source_id` for note documents.
+`notes.id` remains an internal relational key only.
@@ Work Chunk 0A
pub struct NoteUpsertOutcome {
pub local_note_id: i64,
+ pub document_source_id: i64, // notes.gitlab_id
pub changed_semantics: bool,
}
@@ Work Chunk 2D
-if !note.is_system && outcome.changed_semantics {
- dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
+if !note.is_system && outcome.changed_semantics {
+ dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.document_source_id)?;
}
@@ Work Chunk 2E
-SELECT 'note', n.id, ?1
+SELECT 'note', n.gitlab_id, ?1
@@ Work Chunk 2H
-ON d.source_type = 'note' AND d.source_id = n.id
+ON d.source_type = 'note' AND d.source_id = n.gitlab_id
```
2. **Prevent false deletions on partial/incomplete syncs**
Why this is better: sweep-based deletion is correct only when a discussions notes were fully fetched. If a page fails mid-fetch, current logic can incorrectly delete valid notes. Add an explicit “fetch complete” guard before sweep.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Phase 0
+### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
+
+Only run stale-note sweep when note pagination completed successfully for that discussion.
+If fetch is partial/interrupted, skip sweep and keep prior notes intact.
+#### Tests to Write First
+#[test]
+fn test_partial_fetch_does_not_sweep_notes() { /* ... */ }
+
+#[test]
+fn test_complete_fetch_runs_sweep_notes() { /* ... */ }
+#### Implementation
+if discussion_fetch_complete {
+ sweep_stale_issue_notes(...)?;
+} else {
+ tracing::warn!("Skipping stale sweep for discussion {} due to partial fetch", discussion_gitlab_id);
+}
```
3. **Make deletion propagation set-based (not per-note loop)**
Why this is better: the current per-note DELETE loop is O(N) statements and gets slow on large threads. A temp-table/CTE set-based delete is faster, simpler to reason about, and remains atomic.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Work Chunk 0B Implementation
- for note_id in stale_note_ids {
- conn.execute("DELETE FROM documents WHERE source_type = 'note' AND source_id = ?", [note_id])?;
- conn.execute("DELETE FROM dirty_sources WHERE source_type = 'note' AND source_id = ?", [note_id])?;
- }
+ CREATE TEMP TABLE _stale_note_source_ids(source_id INTEGER PRIMARY KEY) WITHOUT ROWID;
+ INSERT INTO _stale_note_source_ids
+ SELECT gitlab_id
+ FROM notes
+ WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0;
+
+ DELETE FROM notes
+ WHERE discussion_id = ? AND last_seen_at < ?;
+
+ DELETE FROM documents
+ WHERE source_type = 'note'
+ AND source_id IN (SELECT source_id FROM _stale_note_source_ids);
+
+ DELETE FROM dirty_sources
+ WHERE source_type = 'note'
+ AND source_id IN (SELECT source_id FROM _stale_note_source_ids);
+
+ DROP TABLE _stale_note_source_ids;
```
4. **Fix project-scoping and time-window semantics in `lore notes`**
Why this is better: the plan currently has a contradiction: clap `requires = "project"` blocks use of `defaultProject`, while query layer says default fallback is allowed. Also, `since/until` parsing should use one shared “now” to avoid subtle drift and inverted windows.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Work Chunk 1B NotesArgs
-#[arg(long = "for-issue", ..., requires = "project")]
+#[arg(long = "for-issue", ...)]
pub for_issue: Option<i64>;
-#[arg(long = "for-mr", ..., requires = "project")]
+#[arg(long = "for-mr", ...)]
pub for_mr: Option<i64>;
@@ Work Chunk 1A Query Notes
-- `since`: `parse_since(since_str)` then `n.created_at >= ?`
-- `until`: `parse_since(until_str)` then `n.created_at <= ?`
+- Parse `since` and `until` with a single anchored `now_ms` captured once per command.
+- If user supplies `YYYY-MM-DD` for `--until`, interpret as end-of-day (23:59:59.999 UTC).
+- Validate `since <= until` after both parse with same anchor.
```
5. **Add an analytics mode (not a profile command): `lore notes --aggregate`**
Why this is better: this directly supports the stated use case (review patterns) without introducing the rejected “profile report” command. It keeps scope narrow and reuses existing filters.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Phase 1
+### Work Chunk 1F: Aggregation Mode for Notes Listing
+
+Add optional aggregation on top of `lore notes`:
+- `--aggregate author|note_type|path|resolution`
+- `--top N` (default 20)
+
+Behavior:
+- Reuses all existing filters (`--since`, `--project`, `--for-mr`, etc.)
+- Returns grouped counts (+ percentage of filtered corpus)
+- Works in table/json/jsonl/csv
+
+Non-goal alignment:
+- This is not a narrative “reviewer profile” command.
+- It is a query primitive for downstream analysis.
```
6. **Prevent note backfill from starving other document regeneration**
Why this is better: after migration/backfill, note dirty entries can dominate the queue and delay issue/MR/discussion updates. Add source-type fairness in regenerator scheduling.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Work Chunk 2D
+#### Scheduling Revision
+Process dirty sources with weighted fairness instead of strict FIFO:
+- issue: 3
+- merge_request: 3
+- discussion: 2
+- note: 1
+
+Implementation sketch:
+- fetch next batch by source_type buckets
+- interleave according to weights
+- preserve retry semantics per source
+#### Tests to Write First
+#[test]
+fn test_note_backfill_does_not_starve_issue_and_mr_regeneration() { /* ... */ }
```
7. **Harden migration 023: remove invalid SQL assertions and move integrity checks to tests**
Why this is better: `RAISE(ABORT, ...)` in standalone `SELECT` is not valid SQLite usage outside triggers/check expressions. Keep migration SQL minimal/portable and enforce invariants in migration tests.
```diff
--- a/PRD.md
+++ b/PRD.md
@@ Work Chunk 2A Migration SQL
--- Step 10: Integrity verification
-SELECT CASE
- WHEN ... THEN RAISE(ABORT, '...')
-END;
+-- Step 10 removed from SQL migration.
+-- Integrity verification is enforced in migration tests:
+-- 1) pre/post row-count equality
+-- 2) `PRAGMA foreign_key_check` is empty
+-- 3) documents_fts row count matches documents row count after rebuild
@@ Work Chunk 2A Tests
+#[test]
+fn test_migration_023_integrity_checks_pass() {
+ // pre/post counts, foreign_key_check empty, fts parity
+}
```
These 7 revisions improve correctness under failure, reduce churn risk, improve large-sync performance, and make the feature materially more useful for reviewer-analysis workflows without reintroducing any rejected recommendations.

View File

@@ -0,0 +1,190 @@
Here are the highest-impact revisions Id make. None of these repeat anything in your `## Rejected Recommendations`.
1. **Add immutable reviewer identity (`author_id`) as a first-class key**
Why this improves the plan: the PRDs core use case is year-scale reviewer profiling. Usernames are mutable in GitLab, so username-only filtering will fragment one reviewer into multiple identities over time. Adding `author_id` closes that correctness hole and makes historical analysis reliable.
```diff
@@ Problem Statement
-1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread
+1. **Query individual notes by reviewer identity** — support both mutable username and immutable GitLab `author_id` for stable longitudinal analysis
@@ Phase 0: Stable Note Identity
+### Work Chunk 0D: Immutable Author Identity Capture
+**Files:** `migrations/025_notes_author_id.sql`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/cli/commands/list.rs`
+
+#### Implementation
+- Add nullable `notes.author_id INTEGER` and backfill from future syncs.
+- Populate `author_id` from GitLab note payload (`note.author.id`) on both issue and MR note ingestion paths.
+- Add `--author-id <int>` filter to `lore notes`.
+- Keep `--author` for ergonomics; when both provided, require both to match.
+
+#### Indexing
+- Add `idx_notes_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0;`
+
+#### Tests
+- `test_query_notes_filter_author_id_survives_username_change`
+- `test_query_notes_author_and_author_id_intersection`
```
2. **Strengthen partial-fetch safety from a boolean to an explicit fetch state contract**
Why this improves the plan: `fetch_complete: bool` is easy to misuse and fragile under retries/crashes. A run-scoped state model makes sweep correctness auditable and prevents accidental deletions when ingestion aborts midway.
```diff
@@ Phase 0: Stable Note Identity
-### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
+### Work Chunk 0C: Sweep Safety Guard with Run-Scoped Fetch State
@@ Implementation
-Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:
+Add a run-scoped fetch state:
+- `FetchState::Complete`
+- `FetchState::Partial`
+- `FetchState::Failed`
+
+Only run sweep on `FetchState::Complete`.
+Persist `run_seen_at` once per sync run and pass unchanged through all discussion/note upserts.
+Require `run_seen_at` monotonicity per discussion before sweep (skip and warn otherwise).
@@ Tests to Write First
+#[test]
+fn test_failed_fetch_never_sweeps_even_after_partial_upserts() { ... }
+#[test]
+fn test_non_monotonic_run_seen_at_skips_sweep() { ... }
+#[test]
+fn test_retry_after_failed_fetch_then_complete_sweeps_correctly() { ... }
```
3. **Add DB-level cleanup triggers for note-document referential integrity**
Why this improves the plan: Work Chunk 0B handles the sweep path, but not every possible delete path. DB triggers give defense-in-depth so stale note docs cannot survive even if a future code path deletes notes differently.
```diff
@@ Work Chunk 0B: Immediate Deletion Propagation
-Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL
+Keep set-based SQL in sweep functions, and add DB-level cleanup triggers as a safety net.
@@ Work Chunk 2A: Schema Migration (023)
+-- Cleanup trigger: deleting a non-system note must delete note document + dirty queue row
+CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes
+WHEN old.is_system = 0
+BEGIN
+ DELETE FROM documents
+ WHERE source_type = 'note' AND source_id = old.id;
+ DELETE FROM dirty_sources
+ WHERE source_type = 'note' AND source_id = old.id;
+END;
+
+-- Cleanup trigger: if note flips to system, remove its document artifacts
+CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes
+WHEN old.is_system = 0 AND new.is_system = 1
+BEGIN
+ DELETE FROM documents
+ WHERE source_type = 'note' AND source_id = new.id;
+ DELETE FROM dirty_sources
+ WHERE source_type = 'note' AND source_id = new.id;
+END;
```
4. **Eliminate N+1 extraction cost with parent metadata caching in regeneration**
Why this improves the plan: backfilling ~8k notes with per-note parent/label lookups creates avoidable query amplification. Batch caching turns repeated joins into one-time lookups per parent entity and materially reduces rebuild time.
```diff
@@ Phase 2: Per-Note Documents
+### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration
+**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs`
+
+#### Implementation
+- Add `NoteExtractionContext` cache keyed by `(noteable_type, parent_id)` containing:
+ - parent iid/title/url
+ - parent labels
+ - project path
+- In batch regeneration, prefetch parent metadata for note IDs in the current chunk.
+- Use cached metadata in `extract_note_document()` to avoid repeated parent/label queries.
+
+#### Tests
+- `test_note_regeneration_uses_parent_cache_consistently`
+- `test_note_regeneration_cache_hit_preserves_hash_determinism`
```
5. **Add embedding dedup cache keyed by semantic text hash**
Why this improves the plan: note docs will contain repeated short comments (“LGTM”, “nit: …”). Current doc-level hashing includes metadata, so identical semantic comments still re-embed many times. A semantic embedding hash cache cuts cost and speeds full rebuild/backfill without changing search behavior.
```diff
@@ Phase 2: Per-Note Documents
+### Work Chunk 2J: Semantic Embedding Dedup for Notes
+**Files:** `migrations/026_embedding_cache.sql`, embedding pipeline module(s), `src/documents/extractor.rs`
+
+#### Implementation
+- Compute `embedding_text` for notes as: normalized note body + compact stable context (`parent_type`, `path`, `resolution`), excluding volatile fields.
+- Compute `embedding_hash = sha256(embedding_text)`.
+- Before embedding generation, lookup existing vector by `(model, embedding_hash)`.
+- Reuse cached vector when present; only call embedding model on misses.
+
+#### Tests
+- `test_identical_note_bodies_reuse_embedding_vector`
+- `test_embedding_hash_changes_when_semantic_context_changes`
```
6. **Add deterministic review-signal tags as derived labels**
Why this improves the plan: this makes output immediately more useful for reviewer-pattern analysis without adding a profile command (which is explicitly out of scope). It increases practical value of both `lore notes` and `lore search --type note` with low complexity.
```diff
@@ Non-Goals
-- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
+- Adding a "reviewer profile" report command (downstream), while allowing low-level derived signal tags as indexing primitives
@@ Phase 2: Per-Note Documents
+### Work Chunk 2K: Derived Review Signal Labels
+**Files:** `src/documents/extractor.rs`
+
+#### Implementation
+- Derive deterministic labels from note text + metadata:
+ - `signal:nit`
+ - `signal:blocking`
+ - `signal:security`
+ - `signal:performance`
+ - `signal:testing`
+- Attach via existing `document_labels` flow for note documents.
+- No new CLI mode required; existing label filters can consume these labels.
+
+#### Tests
+- `test_note_document_derives_signal_labels_nit`
+- `test_note_document_derives_signal_labels_security`
+- `test_signal_label_derivation_is_deterministic`
```
7. **Add high-precision note targeting filters (`--note-id`, `--gitlab-note-id`, `--discussion-id`)**
Why this improves the plan: debugging, incident response, and reproducibility all benefit from exact addressing. This is especially useful when validating sync correctness and cross-checking a specific note/document lifecycle.
```diff
@@ Work Chunk 1B: CLI Arguments & Command Wiring
pub struct NotesArgs {
+ /// Filter by local note row id
+ #[arg(long = "note-id", help_heading = "Filters")]
+ pub note_id: Option<i64>,
+
+ /// Filter by GitLab note id
+ #[arg(long = "gitlab-note-id", help_heading = "Filters")]
+ pub gitlab_note_id: Option<i64>,
+
+ /// Filter by local discussion id
+ #[arg(long = "discussion-id", help_heading = "Filters")]
+ pub discussion_id: Option<i64>,
}
@@ Work Chunk 1A: Filter struct
pub struct NoteListFilters<'a> {
+ pub note_id: Option<i64>,
+ pub gitlab_note_id: Option<i64>,
+ pub discussion_id: Option<i64>,
}
@@ Tests to Write First
+#[test]
+fn test_query_notes_filter_note_id_exact() { ... }
+#[test]
+fn test_query_notes_filter_gitlab_note_id_exact() { ... }
+#[test]
+fn test_query_notes_filter_discussion_id_exact() { ... }
```
If you want, I can produce a single consolidated “iteration 5” PRD diff that merges these into your exact section ordering and updates the dependency graph/migration numbering end-to-end.

2411
docs/prd-per-note-search.md Normal file

File diff suppressed because it is too large Load Diff

541
docs/user-journeys.md Normal file
View File

@@ -0,0 +1,541 @@
# Lore CLI User Journeys
## Purpose
Map realistic workflows for both human users and AI agents to identify gaps in the command surface and optimization opportunities. Each journey starts with a **problem** and traces the commands needed to reach a **resolution**.
---
## Part 1: Human User Flows
### H1. Morning Standup Prep
**Problem:** "What happened since yesterday? I need to know what moved before standup."
**Flow:**
```
lore sync -q # Refresh data (quiet, no noise)
lore issues -s opened --since 1d # Issues that changed overnight
lore mrs -s opened --since 1d # MRs that moved
lore who @me # My current workload snapshot
```
**Gap identified:** No single "activity feed" command. User runs 3 queries to get what should be one view. No `--since 1d` shorthand for "since yesterday." No `@me` alias for the authenticated user.
---
### H2. Sprint Planning: What's Ready to Pick Up?
**Problem:** "We're planning the next sprint. What's open, unassigned, and actionable?"
**Flow:**
```
lore issues -s opened -p myproject # All open issues
lore issues -s opened -l "ready" # Issues labeled ready
lore issues -s opened --has-due # Issues with deadlines approaching
lore count issues -p myproject # How many total?
```
**Gap identified:** No way to filter by "unassigned" issues (missing `--no-assignee` flag). No way to sort by due date. No way to see priority/weight. Can't combine filters like "opened AND no assignee AND has due date."
---
### H3. Investigating a Production Incident
**Problem:** "Deploy broke prod. I need the full timeline of what changed around the deploy."
**Flow:**
```
lore sync -q # Get latest
lore timeline "deploy" --since 7d # What happened around deploys
lore search "deploy" --type mr # MRs mentioning deploy
lore mrs 456 # Inspect the suspicious MR
lore who --overlap src/deploy/ # Who else touches deploy code
```
**Gap identified:** Timeline is keyword-based, not event-based. Can't filter by "MRs merged in the last 24 hours" directly. No way to see which MRs were merged between two dates (release diff). Would benefit from `lore mrs -s merged --since 1d`.
---
### H4. Preparing to Review Someone's MR
**Problem:** "I was assigned to review MR !789. I need context before diving in."
**Flow:**
```
lore mrs 789 # Read the MR description + discussions
lore mrs 789 -o # Open in browser for the actual diff
lore who src/features/auth/ # Who are the experts in this area?
lore search "auth refactor" --type issue # Related issues for background
lore timeline "authentication" # History of auth changes
```
**Gap identified:** No way to see the file list touched by an MR from the CLI (data is stored in `mr_file_changes` but not surfaced). No way to link an MR back to its closing issue(s) from the MR detail view. The cross-reference data exists in `entity_references` but isn't shown in `mrs <iid>` output.
---
### H5. Onboarding to an Unfamiliar Code Area
**Problem:** "I'm new to the team and need to understand how the billing module works."
**Flow:**
```
lore search "billing" -n 20 # What exists about billing?
lore who src/billing/ # Who knows billing best?
lore timeline "billing" --depth 2 # History of billing changes
lore mrs -s merged -l billing --since 6m # Recent merged billing work
lore issues -s opened -l billing # Outstanding billing issues
```
**Gap identified:** No way to get a "module overview" in one command. The search spans issues, MRs, and discussions but doesn't summarize by category. No way to see the most-discussed or most-referenced entities (high-signal items for understanding).
---
### H6. Finding the Right Reviewer for My PR
**Problem:** "I'm about to submit a PR touching auth and payments. Who should review?"
**Flow:**
```
lore who src/features/auth/ # Auth experts
lore who src/features/payments/ # Payment experts
lore who @candidate1 # Check candidate1's workload
lore who @candidate2 # Check candidate2's workload
```
**Gap identified:** No way to query multiple paths at once (`lore who src/auth/ src/payments/`). No way to find the intersection of expertise. No workload-aware recommendation ("who knows this AND has bandwidth"). Four separate commands for what should be one decision.
---
### H7. Understanding Why a Feature Was Built This Way
**Problem:** "This code is weird. Why was it implemented like this? What was the original discussion?"
**Flow:**
```
lore search "feature-name rationale" # Search for decision context
lore timeline "feature-name" --depth 2 # Full history with cross-refs
lore issues 234 # Read the original issue
lore mrs 567 # Read the implementation MR
```
**Gap identified:** No way to search within a specific issue's or MR's discussion notes. The search covers documents (titles + descriptions) but per-note search isn't available yet (PRD exists). No way to navigate "issue 234 was closed by MR 567" without manually knowing both IDs.
---
### H8. Checking Team Workload Before Assigning Work
**Problem:** "I need to assign this urgent bug. Who has the least on their plate?"
**Flow:**
```
lore who @alice # Alice's workload
lore who @bob # Bob's workload
lore who @carol # Carol's workload
lore who @dave # Dave's workload
```
**Gap identified:** No team-level workload view. Must query each person individually. No way to list "all assignees and their open issue counts." No concept of a team roster. Would benefit from `lore who --team` or `lore workload`.
---
### H9. Preparing Release Notes
**Problem:** "We're cutting a release. I need to summarize what's in this version."
**Flow:**
```
lore mrs -s merged --since 2w -p myproject # MRs merged since last release
lore issues -s closed --since 2w -p myproject # Issues closed since last release
lore mrs -s merged -l feature --since 2w # Feature MRs specifically
lore mrs -s merged -l bugfix --since 2w # Bugfix MRs
```
**Gap identified:** No way to filter by milestone (for version-based releases). Wait -- `issues` has `-m` for milestone but `mrs` does not. No changelog generation. No "what closed between tag A and tag B." No grouping by label for release note categories.
---
### H10. Finding and Closing Stale Issues
**Problem:** "Our backlog is bloated. Which issues haven't been touched in months?"
**Flow:**
```
lore issues -s opened --sort updated --asc -n 50 # Oldest-updated first
# Then manually inspect each one...
lore issues 42 # Is this still relevant?
```
**Gap identified:** No `--before` or `--updated-before` filter (only `--since` exists). Can sort ascending but can't filter "not updated in 90 days." No staleness indicator. No bulk operations concept.
---
### H11. Understanding a Bug's Full History
**Problem:** "Bug #321 keeps getting reopened. I need to understand its entire lifecycle."
**Flow:**
```
lore issues 321 # Read the issue
lore timeline "bug-keyword" -p myproject # Try to find timeline events
# But timeline is keyword-based, not entity-based...
```
**Gap identified:** No way to get a timeline for a specific entity by IID. `lore timeline` requires a keyword query, not an entity reference. Would benefit from `lore timeline --issue 321` or `lore timeline --mr 456` to get the event history of a specific entity directly.
---
### H12. Identifying Who to Ask About Failing Tests
**Problem:** "CI tests are failing in `src/lib/parser.rs`. Who last touched this?"
**Flow:**
```
lore who src/lib/parser.rs # Expert lookup
lore who --overlap src/lib/parser.rs # Who else has touched it
lore search "parser" --type mr --since 2w # Recent MRs touching parser
```
**Gap identified:** Expert mode uses DiffNote analysis (code review comments), not actual file change tracking. The `mr_file_changes` table has the real data but `who` doesn't use it for attribution. Could be much more accurate with file-change-based expertise.
---
### H13. Tracking a Feature Across Multiple MRs
**Problem:** "The 'dark mode' feature spans 5 MRs. I need to see them all together."
**Flow:**
```
lore mrs -l dark-mode # MRs with the label
lore issues -l dark-mode # Related issues
lore timeline "dark mode" --depth 2 # Cross-referenced events
```
**Gap identified:** Works reasonably well with labels as the grouping mechanism. But if the team didn't label consistently, there's no way to discover related MRs by content similarity. No "related items" view that combines issues + MRs + discussions for a topic.
---
### H14. Checking if a Similar Fix Was Already Attempted
**Problem:** "Before I implement this fix, was something similar tried before?"
**Flow:**
```
lore search "memory leak connection pool" # Semantic search
lore search "connection pool" --type mr -s all # Wait, no state filter on search
lore mrs -s closed -l bugfix # Closed bugfix MRs (coarse)
lore timeline "connection pool" # Historical context
```
**Gap identified:** Search doesn't have a `--state` filter. Can't search only closed/merged items. The semantic search is powerful but can't be combined with entity state. Would benefit from `--state merged` on search to find past attempts.
---
### H15. Reviewing Discussions That Need My Attention
**Problem:** "Which discussion threads am I involved in that are still unresolved?"
**Flow:**
```
lore who --active # All active unresolved discussions
lore who --active --since 30d # Wider window
# But can't filter to "discussions I'm in"...
```
**Gap identified:** `--active` shows all unresolved discussions, not filtered by participant. No way to say "show me discussions where @me participated." No notification/mention tracking. No "my unresolved threads" view.
---
## Part 2: AI Agent Flows
### A1. Context Gathering Before Code Modification
**Problem:** Agent is about to modify `src/features/auth/session.rs` and needs full context.
**Flow:**
```
lore -J health # Pre-flight check
lore -J who src/features/auth/ # Who knows this area
lore -J search "auth session" -n 10 # Related issues/MRs
lore -J mrs -s merged --since 3m -l auth # Recent auth changes
lore -J who --overlap src/features/auth/session.rs # Concurrent work risk
```
**Gap identified:** No way to check "are there open MRs touching this file right now?" The overlap mode shows historical touches, not active branches. An agent needs to know about in-flight changes to avoid conflicts.
---
### A2. Auto-Triaging an Incoming Issue
**Problem:** Agent receives a new issue and needs to categorize it, find related work, and suggest assignees.
**Flow:**
```
lore -J issues 999 # Read the new issue
lore -J search "$(extract_keywords)" --explain # Find similar past issues
lore -J who src/affected/path/ # Suggest experts as assignees
lore -J issues -s opened -l same-label # Check for duplicates
```
**Gap identified:** No way to get just the description text for programmatic keyword extraction. `issues <iid>` returns full detail including discussions. Agent must parse the full response to extract the description for a secondary search. Would benefit from `--fields description` on detail view. No duplicate detection built in.
---
### A3. Generating Sprint Status Report
**Problem:** Agent needs to produce a weekly status report for the team.
**Flow:**
```
lore -J issues -s closed --since 1w --fields minimal # Completed work
lore -J issues -s opened --status "In progress" # In-flight work
lore -J mrs -s merged --since 1w --fields minimal # Merged PRs
lore -J mrs -s opened -D --fields minimal # Open non-draft MRs
lore -J count issues # Totals
lore -J count mrs # MR totals
lore -J who --active --since 1w # Discussions needing attention
```
**Gap identified:** Seven separate queries for one report. No `lore summary` or `lore report` command. No way to get "issues transitioned from X to Y this week" (state change history exists in events but isn't queryable). No velocity metric (issues closed per week trend).
---
### A4. Finding Relevant Prior Art Before Implementing
**Problem:** Agent is implementing a caching layer and wants to find if similar patterns exist in the codebase's GitLab history.
**Flow:**
```
lore -J search "caching" --mode hybrid -n 20 --explain
lore -J search "cache invalidation" --mode hybrid -n 10
lore -J search "redis" --mode lexical --type discussion # Exact term in discussions
lore -J timeline "cache" --since 1y # Wait, max is 1y? Let's try 12m
```
**Gap identified:** No way to search discussion notes individually (per-note search). Discussions are aggregated into documents, so individual note-level matches are lost. The `--explain` flag helps but doesn't show which specific note matched. No `--since 1y` or `--since 12m` duration format.
---
### A5. Building Context for PR Description
**Problem:** Agent wrote code and needs to generate a PR description that references relevant issues.
**Flow:**
```
lore -J search "feature description keywords" --type issue
lore -J issues -s opened -l feature-label --fields iid,title,web_url
# Cross-reference: which issues does this MR close?
# No command for this -- must manually scan search results
```
**Gap identified:** No way to query the `entity_references` table directly. Agent can't ask "which issues reference MR !456" or "which issues contain 'closes #123' in their text." The data exists but isn't exposed as a query surface. Would benefit from `lore refs --mr 456` or `lore refs --issue 123`.
---
### A6. Identifying Affected Experts for Review Assignment
**Problem:** Agent needs to automatically assign reviewers based on the files changed in an MR.
**Flow:**
```
lore -J mrs 456 # Get MR details
# Parse file paths from response... but file changes aren't in the output
lore -J who src/path/from/mr/ # Query each path
lore -J who src/another/path/ # One at a time...
lore -J who @candidate --fields minimal # Check workload
```
**Gap identified:** MR detail view (`mrs <iid>`) doesn't include the file change list from `mr_file_changes`. Agent can't programmatically extract which files an MR touches. Must fall back to GitLab API or guess from description. The `who` command doesn't accept multiple paths. No "auto-reviewer" suggestion combining expertise + availability.
---
### A7. Incident Investigation and Timeline Reconstruction
**Problem:** Agent needs to reconstruct what happened during an outage for a postmortem.
**Flow:**
```
lore -J timeline "outage" --since 3d --depth 2 --expand-mentions
lore -J search "error 500" --since 3d
lore -J mrs -s merged --since 3d -p production-service
lore -J issues --status "In progress" -p production-service
```
**Gap identified:** Timeline is keyword-seeded, which means if the outage wasn't described with that exact term, seeds may miss it. No way to seed a timeline from an entity ID (e.g., "start from issue #321 and expand outward"). No severity/priority filter. No way to correlate with merge times.
---
### A8. Cross-Project Impact Assessment
**Problem:** Agent needs to understand how a breaking API change in project A affects projects B and C.
**Flow:**
```
lore -J search "api-endpoint-name" -p project-a
lore -J search "api-endpoint-name" -p project-b
lore -J search "api-endpoint-name" -p project-c
# Or without project filter to search everywhere:
lore -J search "api-endpoint-name" -n 50
lore -J timeline "api-endpoint-name" --depth 2
```
**Gap identified:** Cross-project references in entity_references are tracked but the timeline shows unresolved references for entities not synced locally. No way to see a cross-project dependency map. Search works across projects but doesn't group results by project.
---
### A9. Automated Stale Issue Recommendations
**Problem:** Agent runs weekly to identify issues that should be closed or re-prioritized.
**Flow:**
```
lore -J issues -s opened --sort updated --asc -n 100 # Oldest first
# For each issue, check:
lore -J issues <iid> # Read details
lore -J search "<issue title keywords>" # Any recent activity?
```
**Gap identified:** No `--updated-before` filter, so agent must fetch all and filter client-side. No way to detect "issue has no assignee AND no activity in 90 days." The 100-issue limit means pagination is needed for large backlogs, but there's no cursor/offset pagination -- only `--limit`. Agent must do N+1 queries to inspect each candidate.
---
### A10. Code Review Preparation (File-Level Context)
**Problem:** Agent is reviewing MR !789 and needs to understand the history of each changed file.
**Flow:**
```
lore -J mrs 789 # Get MR details
# Can't get file list from output...
# Fall back to search by MR title keywords
lore -J search "feature-from-mr" --type mr
lore -J who src/guessed/path/ # Expertise for each file
lore -J who --overlap src/guessed/path/ # Concurrent changes
```
**Gap identified:** Same as A6 -- `mr_file_changes` data isn't exposed. Agent is blind to the actual files in the MR unless it parses the description or uses the GitLab API directly. This is the single biggest gap for automated code review workflows.
---
### A11. Building a Knowledge Graph of Entity Relationships
**Problem:** Agent wants to map how issues, MRs, and discussions are connected for a feature.
**Flow:**
```
lore -J search "feature-name" -n 30
lore -J timeline "feature-name" --depth 2 --max-entities 100
# Timeline shows expanded entities and cross-refs, but...
# No way to query entity_references directly
# No way to get "all entities that reference issue #123"
```
**Gap identified:** The `entity_references` table (closes, related, mentioned) is used internally by timeline but isn't queryable as a standalone command. Agent can't ask "what closes issue #123?" or "what does MR !456 reference?" No graph export. Would enable powerful dependency mapping.
---
### A12. Release Readiness Assessment
**Problem:** Agent needs to verify all issues in milestone "v2.0" are closed and MRs are merged.
**Flow:**
```
lore -J issues -m "v2.0" -s opened # Any open issues in milestone?
lore -J issues -m "v2.0" -s closed # Closed issues
# MRs don't have milestone filter...
lore -J mrs -s opened -l "v2.0" # Try label as proxy
lore -J who --active -p myproject # Unresolved discussions
```
**Gap identified:** MRs don't have a `--milestone` filter (issues do). No way to check "all MRs linked to issues in milestone v2.0" -- would require joining `entity_references` with issue milestone. No release checklist concept. No way to verify "every issue in this milestone has a closing MR."
---
### A13. Answering "What Changed?" Between Two Points
**Problem:** Agent needs to diff project state between two dates for a stakeholder report.
**Flow:**
```
lore -J issues -s closed --since 2w --fields minimal # Recently closed
lore -J issues -s opened --since 2w --fields minimal # Recently opened
lore -J mrs -s merged --since 2w --fields minimal # Recently merged
# But no way to get "issues that CHANGED STATE" in a window
# An issue opened 3 months ago but closed yesterday won't appear in --since 2w for issues -s opened
```
**Gap identified:** `--since` filters by `updated_at`, not by "state changed at." An issue closed yesterday but created 6 months ago would appear in `issues -s closed --since 1d` (because updated_at changed), but the semantics are subtle. No explicit "state transitions in time window" query. The resource_state_events table has this data but it's not exposed as a filter.
---
### A14. Meeting Prep: Summarize Recent Activity for a Stakeholder
**Problem:** Agent needs to prepare a 2-minute summary for a project sponsor meeting.
**Flow:**
```
lore -J count issues -p project # Current totals
lore -J count mrs -p project # MR totals
lore -J issues -s closed --since 1w -p project --fields minimal
lore -J mrs -s merged --since 1w -p project --fields minimal
lore -J issues -s opened --status "In progress" -p project
lore -J who --active -p project --since 1w
```
**Gap identified:** Six queries, same as A3. No summary/dashboard command. Agent must synthesize all responses. No trend data (is the open issue count growing or shrinking?). No "highlights" extraction.
---
### A15. Determining If Work Is Safe to Start (Conflict Detection)
**Problem:** Agent is about to start work on an issue and needs to check nobody else is already working on it.
**Flow:**
```
lore -J issues 123 # Read the issue
# Check assignees from response
lore -J mrs -s opened -A other-person # Are they working on related MRs?
lore -J who --overlap src/target/path/ # Anyone actively touching these files?
lore -J search "issue-123-keywords" --type mr -s opened # Wait, search has no --state
```
**Gap identified:** No way to check "is there an open MR that closes issue #123?" -- the entity_references data exists but isn't queryable. Search doesn't support `--state` filter. No "conflict detection" or "in-flight work" check. Agent must do multiple queries and manually correlate.
---
## Part 3: Gap Summary
### Critical Gaps (high impact, blocks common workflows)
| # | Gap | Affected Flows | Suggested Command/Flag |
|---|-----|----------------|----------------------|
| 1 | **MR file changes not surfaced** | H4, A6, A10 | `lore mrs <iid> --files` or include in detail view |
| 2 | **Entity references not queryable** | H7, A5, A11, A15 | `lore refs --issue 123` / `lore refs --mr 456` |
| 3 | **Per-note search missing** | H7, A4 | `lore search --granularity note` (PRD exists) |
| 4 | **No entity-based timeline** | H11, A7 | `lore timeline --issue 321` / `lore timeline --mr 456` |
| 5 | **No @me / current-user alias** | H1, H15 | Resolve from auth token automatically |
### Important Gaps (significant friction, multiple workarounds needed)
| # | Gap | Affected Flows | Suggested Command/Flag |
|---|-----|----------------|----------------------|
| 6 | **No activity feed / summary** | H1, A3, A14 | `lore activity --since 1d` or `lore summary` |
| 7 | **No multi-path who query** | H6, A6 | `lore who src/path1/ src/path2/` |
| 8 | **No --state filter on search** | H14, A15 | `lore search --state merged` |
| 9 | **MRs missing --milestone filter** | H9, A12 | `lore mrs -m "v2.0"` |
| 10 | **No --no-assignee / --unassigned** | H2 | `lore issues --no-assignee` |
| 11 | **No --updated-before filter** | H10, A9 | `lore issues --before 90d` or `--stale 90d` |
| 12 | **No team workload view** | H8 | `lore who --team` or `lore workload` |
### Nice-to-Have Gaps (would improve agent efficiency)
| # | Gap | Affected Flows | Suggested Command/Flag |
|---|-----|----------------|----------------------|
| 13 | **No pagination/offset** | A9 | `--offset 100` for large result sets |
| 14 | **No detail --fields on show** | A2 | `lore issues 999 --fields description` |
| 15 | **No cross-project grouping** | A8 | `lore search --group-by project` |
| 16 | **No trend/velocity metrics** | A3, A14 | `lore trends issues --period week` |
| 17 | **No --for-issue on mrs** | A12, A15 | `lore mrs --closes 123` (query entity_refs) |
| 18 | **1y/12m duration not supported** | A4 | Support `1y`, `12m`, `365d` in --since |
| 19 | **No discussion participant filter** | H15 | `lore who --active --participant @me` |
| 20 | **No sort by due date** | H2 | `lore issues --sort due` |