who.rs: Add INDEXED BY idx_notes_diffnote_path_created to all DiffNote query paths (expert, expert_details, reviews, path probes, suffix_probe). SQLite planner was choosing idx_notes_system (106K rows, 38%) over the partial index (26K rows, 9.3%) when LIKE predicates are present. Measured: expert 1561ms->59ms (26x), reviews ~1200ms->16ms (75x). stats.rs: Replace 12+ sequential COUNT(*) queries with conditional aggregates (SUM(CASE WHEN...)) and use FTS5 shadow table (documents_fts_docsize) instead of virtual table for counting. Measured: warm 109ms->65ms (1.68x).
162 lines
6.9 KiB
Markdown
162 lines
6.9 KiB
Markdown
These are the highest-impact revisions I’d make. They avoid everything in your `## Rejected Recommendations` list.
|
||
|
||
1. Add immediate note-document deletion propagation (don’t wait for `generate-docs --full`)
|
||
Why: right now, deleted notes can leave stale `source_type='note'` documents until a full rebuild. That creates incorrect search/reporting results and weakens trust in the dataset.
|
||
```diff
|
||
@@ Phase 0: Stable Note Identity
|
||
+### Work Chunk 0B: Immediate Deletion Propagation
|
||
+
|
||
+When sweep deletes stale notes, propagate deletion to documents in the same transaction.
|
||
+Do not rely on eventual cleanup via `generate-docs --full`.
|
||
+
|
||
+#### Tests to Write First
|
||
+#[test]
|
||
+fn test_issue_note_sweep_deletes_note_documents_immediately() { ... }
|
||
+#[test]
|
||
+fn test_mr_note_sweep_deletes_note_documents_immediately() { ... }
|
||
+
|
||
+#### Implementation
|
||
+Use `DELETE ... RETURNING id, is_system` in note sweep functions.
|
||
+For returned non-system note ids:
|
||
+1) `DELETE FROM documents WHERE source_type='note' AND source_id=?`
|
||
+2) `DELETE FROM dirty_sources WHERE source_type='note' AND source_id=?`
|
||
```
|
||
|
||
2. Add one-time upgrade backfill for existing notes (migration 024)
|
||
Why: existing DBs will otherwise only get note-documents for changed/new notes. Historical notes remain invisible unless users manually run full rebuild.
|
||
```diff
|
||
@@ Phase 2: Per-Note Documents
|
||
+### Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024)
|
||
+
|
||
+Create migration `024_note_dirty_backfill.sql`:
|
||
+INSERT INTO dirty_sources (source_type, source_id, queued_at)
|
||
+SELECT 'note', n.id, unixepoch('now') * 1000
|
||
+FROM notes n
|
||
+LEFT JOIN documents d
|
||
+ ON d.source_type='note' AND d.source_id=n.id
|
||
+WHERE n.is_system=0 AND d.id IS NULL
|
||
+ON CONFLICT(source_type, source_id) DO NOTHING;
|
||
+
|
||
+Add migration test asserting idempotence and expected queue size.
|
||
```
|
||
|
||
3. Fix `--since/--until` semantics and validation
|
||
Why: reusing `parse_since` for `until` creates ambiguous windows and off-by-boundary behavior; your own example `--since 90d --until 180d` is chronologically reversed.
|
||
```diff
|
||
@@ Work Chunk 1A: Data Types & Query Layer
|
||
- since: parse_since(since_str) then n.created_at >= ?
|
||
- until: parse_since(until_str) then n.created_at <= ?
|
||
+ since: parse_since_start_bound(since_str) then n.created_at >= ?
|
||
+ until: parse_until_end_bound(until_str) then n.created_at <= ?
|
||
+ Validate since <= until; otherwise return a clear user error.
|
||
+
|
||
+#### Tests to Write First
|
||
+#[test] fn test_query_notes_invalid_time_window_rejected() { ... }
|
||
+#[test] fn test_query_notes_until_date_is_end_of_day_inclusive() { ... }
|
||
```
|
||
|
||
4. Separate semantic-change detection from housekeeping updates
|
||
Why: current proposed `WHERE` includes `updated_at`, which will cause unnecessary dirty churn. You want `last_seen_at` to always refresh, but regeneration only when searchable semantics changed.
|
||
```diff
|
||
@@ Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
|
||
- OR notes.updated_at IS NOT excluded.updated_at
|
||
+ -- updated_at-only changes should not mark semantic dirty
|
||
+
|
||
+Perform two-step logic:
|
||
+1) Upsert always updates persistence/housekeeping fields (`updated_at`, `last_seen_at`).
|
||
+2) `changed_semantics` is computed only from fields used by note documents/search filters
|
||
+ (body, note_type, resolved flags, paths, author, parent linkage).
|
||
+
|
||
+#### Tests to Write First
|
||
+#[test]
|
||
+fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() { ... }
|
||
```
|
||
|
||
5. Make indexes align with actual query collation and join strategy
|
||
Why: `author` uses `COLLATE NOCASE`; without collation-aware index, SQLite can skip index use. Also, IID filters via scalar subqueries are harder for planner than direct join predicates.
|
||
```diff
|
||
@@ Work Chunk 1E: Composite Query Index
|
||
-CREATE INDEX ... ON notes(project_id, author_username, created_at DESC, id DESC) WHERE is_system = 0;
|
||
+CREATE INDEX ... ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC) WHERE is_system = 0;
|
||
+
|
||
+CREATE INDEX IF NOT EXISTS idx_discussions_issue_id ON discussions(issue_id);
|
||
+CREATE INDEX IF NOT EXISTS idx_discussions_mr_id ON discussions(merge_request_id);
|
||
```
|
||
|
||
```diff
|
||
@@ Work Chunk 1A: query_notes()
|
||
- d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?)
|
||
+ i.iid = ? AND i.project_id = ?
|
||
- d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?)
|
||
+ m.iid = ? AND m.project_id = ?
|
||
```
|
||
|
||
6. Replace manual CSV escaping with `csv` crate
|
||
Why: manual RFC4180 escaping is fragile (quotes/newlines/multi-byte edge cases). This is exactly where a mature library reduces long-term bug risk.
|
||
```diff
|
||
@@ Work Chunk 1C: Human & Robot Output Formatting
|
||
- Uses a minimal CSV writer (no external dependency — the format is simple enough for manual escaping).
|
||
+ Uses `csv::Writer` for RFC4180-compliant escaping and stable output across edge cases.
|
||
+
|
||
+#### Tests to Write First
|
||
+#[test] fn test_csv_output_multiline_and_quotes_roundtrip() { ... }
|
||
```
|
||
|
||
7. Add `--contains` lexical body filter to `lore notes`
|
||
Why: useful middle ground between metadata filtering and semantic search; great for reviewer-pattern mining without requiring FTS query syntax.
|
||
```diff
|
||
@@ Work Chunk 1B: CLI Arguments & Command Wiring
|
||
+/// Filter by case-insensitive substring in note body
|
||
+#[arg(long, help_heading = "Filters")]
|
||
+pub contains: Option<String>;
|
||
```
|
||
|
||
```diff
|
||
@@ Work Chunk 1A: NoteListFilters
|
||
+ pub contains: Option<&'a str>,
|
||
@@ query_notes dynamic filters
|
||
+ if contains.is_some() {
|
||
+ where_clauses.push("n.body LIKE ? COLLATE NOCASE");
|
||
+ params.push(format!("%{}%", escape_like(contains.unwrap())));
|
||
+ }
|
||
```
|
||
|
||
8. Reduce note-document embedding noise by slimming metadata header
|
||
Why: current verbose key-value header repeats low-signal tokens and consumes embedding budget. Keep context, but bias tokens toward actual review text.
|
||
```diff
|
||
@@ Work Chunk 2C: Note Document Extractor
|
||
- Build content with structured metadata header:
|
||
- [[Note]]
|
||
- source_type: note
|
||
- note_gitlab_id: ...
|
||
- project: ...
|
||
- ...
|
||
- --- Body ---
|
||
- {body}
|
||
+ Build content with compact, high-signal layout:
|
||
+ [[Note]]
|
||
+ @{author} on {Issue#|MR!}{iid} in {project_path}
|
||
+ path: {path:line} (only when available)
|
||
+ state: {resolved|unresolved} (only when resolvable)
|
||
+
|
||
+ {body}
|
||
+
|
||
+Keep detailed metadata in structured document columns/labels/paths/url,
|
||
+not repeated in verbose text.
|
||
```
|
||
|
||
9. Add explicit performance regression checks for the new hot paths
|
||
Why: this feature increases document volume ~4x; you should pin acceptable query behavior now so future changes don’t silently degrade.
|
||
```diff
|
||
@@ Verification Checklist
|
||
+Performance/plan checks:
|
||
+1) `EXPLAIN QUERY PLAN` for:
|
||
+ - author+since query
|
||
+ - project+date query
|
||
+ - for-mr / for-issue query
|
||
+2) Seed 50k-note synthetic fixture and assert:
|
||
+ - `lore notes --author ... --limit 100` stays under agreed local threshold
|
||
+ - `lore search --type note ...` remains deterministic and completes successfully
|
||
```
|
||
|
||
If you want, I can also provide a fully merged “iteration 3” PRD text with these edits applied end-to-end so you can drop it in directly. |