Files
gitlore/docs/prd-per-note-search.feedback-3.md
teernisse e9bacc94e1 perf: force partial index for DiffNote queries (26-75x), batch stats counts (1.7x)
who.rs: Add INDEXED BY idx_notes_diffnote_path_created to all DiffNote
query paths (expert, expert_details, reviews, path probes, suffix_probe).
SQLite planner was choosing idx_notes_system (106K rows, 38%) over the
partial index (26K rows, 9.3%) when LIKE predicates are present.
Measured: expert 1561ms->59ms (26x), reviews ~1200ms->16ms (75x).

stats.rs: Replace 12+ sequential COUNT(*) queries with conditional
aggregates (SUM(CASE WHEN...)) and use FTS5 shadow table
(documents_fts_docsize) instead of virtual table for counting.
Measured: warm 109ms->65ms (1.68x).
2026-02-12 09:59:46 -05:00

162 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
These are the highest-impact revisions Id make. They avoid everything in your `## Rejected Recommendations` list.
1. Add immediate note-document deletion propagation (dont wait for `generate-docs --full`)
Why: right now, deleted notes can leave stale `source_type='note'` documents until a full rebuild. That creates incorrect search/reporting results and weakens trust in the dataset.
```diff
@@ Phase 0: Stable Note Identity
+### Work Chunk 0B: Immediate Deletion Propagation
+
+When sweep deletes stale notes, propagate deletion to documents in the same transaction.
+Do not rely on eventual cleanup via `generate-docs --full`.
+
+#### Tests to Write First
+#[test]
+fn test_issue_note_sweep_deletes_note_documents_immediately() { ... }
+#[test]
+fn test_mr_note_sweep_deletes_note_documents_immediately() { ... }
+
+#### Implementation
+Use `DELETE ... RETURNING id, is_system` in note sweep functions.
+For returned non-system note ids:
+1) `DELETE FROM documents WHERE source_type='note' AND source_id=?`
+2) `DELETE FROM dirty_sources WHERE source_type='note' AND source_id=?`
```
2. Add one-time upgrade backfill for existing notes (migration 024)
Why: existing DBs will otherwise only get note-documents for changed/new notes. Historical notes remain invisible unless users manually run full rebuild.
```diff
@@ Phase 2: Per-Note Documents
+### Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024)
+
+Create migration `024_note_dirty_backfill.sql`:
+INSERT INTO dirty_sources (source_type, source_id, queued_at)
+SELECT 'note', n.id, unixepoch('now') * 1000
+FROM notes n
+LEFT JOIN documents d
+ ON d.source_type='note' AND d.source_id=n.id
+WHERE n.is_system=0 AND d.id IS NULL
+ON CONFLICT(source_type, source_id) DO NOTHING;
+
+Add migration test asserting idempotence and expected queue size.
```
3. Fix `--since/--until` semantics and validation
Why: reusing `parse_since` for `until` creates ambiguous windows and off-by-boundary behavior; your own example `--since 90d --until 180d` is chronologically reversed.
```diff
@@ Work Chunk 1A: Data Types & Query Layer
- since: parse_since(since_str) then n.created_at >= ?
- until: parse_since(until_str) then n.created_at <= ?
+ since: parse_since_start_bound(since_str) then n.created_at >= ?
+ until: parse_until_end_bound(until_str) then n.created_at <= ?
+ Validate since <= until; otherwise return a clear user error.
+
+#### Tests to Write First
+#[test] fn test_query_notes_invalid_time_window_rejected() { ... }
+#[test] fn test_query_notes_until_date_is_end_of_day_inclusive() { ... }
```
4. Separate semantic-change detection from housekeeping updates
Why: current proposed `WHERE` includes `updated_at`, which will cause unnecessary dirty churn. You want `last_seen_at` to always refresh, but regeneration only when searchable semantics changed.
```diff
@@ Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
- OR notes.updated_at IS NOT excluded.updated_at
+ -- updated_at-only changes should not mark semantic dirty
+
+Perform two-step logic:
+1) Upsert always updates persistence/housekeeping fields (`updated_at`, `last_seen_at`).
+2) `changed_semantics` is computed only from fields used by note documents/search filters
+ (body, note_type, resolved flags, paths, author, parent linkage).
+
+#### Tests to Write First
+#[test]
+fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() { ... }
```
5. Make indexes align with actual query collation and join strategy
Why: `author` uses `COLLATE NOCASE`; without collation-aware index, SQLite can skip index use. Also, IID filters via scalar subqueries are harder for planner than direct join predicates.
```diff
@@ Work Chunk 1E: Composite Query Index
-CREATE INDEX ... ON notes(project_id, author_username, created_at DESC, id DESC) WHERE is_system = 0;
+CREATE INDEX ... ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC) WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_discussions_issue_id ON discussions(issue_id);
+CREATE INDEX IF NOT EXISTS idx_discussions_mr_id ON discussions(merge_request_id);
```
```diff
@@ Work Chunk 1A: query_notes()
- d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?)
+ i.iid = ? AND i.project_id = ?
- d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?)
+ m.iid = ? AND m.project_id = ?
```
6. Replace manual CSV escaping with `csv` crate
Why: manual RFC4180 escaping is fragile (quotes/newlines/multi-byte edge cases). This is exactly where a mature library reduces long-term bug risk.
```diff
@@ Work Chunk 1C: Human & Robot Output Formatting
- Uses a minimal CSV writer (no external dependency — the format is simple enough for manual escaping).
+ Uses `csv::Writer` for RFC4180-compliant escaping and stable output across edge cases.
+
+#### Tests to Write First
+#[test] fn test_csv_output_multiline_and_quotes_roundtrip() { ... }
```
7. Add `--contains` lexical body filter to `lore notes`
Why: useful middle ground between metadata filtering and semantic search; great for reviewer-pattern mining without requiring FTS query syntax.
```diff
@@ Work Chunk 1B: CLI Arguments & Command Wiring
+/// Filter by case-insensitive substring in note body
+#[arg(long, help_heading = "Filters")]
+pub contains: Option<String>;
```
```diff
@@ Work Chunk 1A: NoteListFilters
+ pub contains: Option<&'a str>,
@@ query_notes dynamic filters
+ if contains.is_some() {
+ where_clauses.push("n.body LIKE ? COLLATE NOCASE");
+ params.push(format!("%{}%", escape_like(contains.unwrap())));
+ }
```
8. Reduce note-document embedding noise by slimming metadata header
Why: current verbose key-value header repeats low-signal tokens and consumes embedding budget. Keep context, but bias tokens toward actual review text.
```diff
@@ Work Chunk 2C: Note Document Extractor
- Build content with structured metadata header:
- [[Note]]
- source_type: note
- note_gitlab_id: ...
- project: ...
- ...
- --- Body ---
- {body}
+ Build content with compact, high-signal layout:
+ [[Note]]
+ @{author} on {Issue#|MR!}{iid} in {project_path}
+ path: {path:line} (only when available)
+ state: {resolved|unresolved} (only when resolvable)
+
+ {body}
+
+Keep detailed metadata in structured document columns/labels/paths/url,
+not repeated in verbose text.
```
9. Add explicit performance regression checks for the new hot paths
Why: this feature increases document volume ~4x; you should pin acceptable query behavior now so future changes dont silently degrade.
```diff
@@ Verification Checklist
+Performance/plan checks:
+1) `EXPLAIN QUERY PLAN` for:
+ - author+since query
+ - project+date query
+ - for-mr / for-issue query
+2) Seed 50k-note synthetic fixture and assert:
+ - `lore notes --author ... --limit 100` stays under agreed local threshold
+ - `lore search --type note ...` remains deterministic and completes successfully
```
If you want, I can also provide a fully merged “iteration 3” PRD text with these edits applied end-to-end so you can drop it in directly.