These are the highest-impact revisions I’d make. They avoid everything in your `## Rejected Recommendations` list. 1. Add immediate note-document deletion propagation (don’t wait for `generate-docs --full`) Why: right now, deleted notes can leave stale `source_type='note'` documents until a full rebuild. That creates incorrect search/reporting results and weakens trust in the dataset. ```diff @@ Phase 0: Stable Note Identity +### Work Chunk 0B: Immediate Deletion Propagation + +When sweep deletes stale notes, propagate deletion to documents in the same transaction. +Do not rely on eventual cleanup via `generate-docs --full`. + +#### Tests to Write First +#[test] +fn test_issue_note_sweep_deletes_note_documents_immediately() { ... } +#[test] +fn test_mr_note_sweep_deletes_note_documents_immediately() { ... } + +#### Implementation +Use `DELETE ... RETURNING id, is_system` in note sweep functions. +For returned non-system note ids: +1) `DELETE FROM documents WHERE source_type='note' AND source_id=?` +2) `DELETE FROM dirty_sources WHERE source_type='note' AND source_id=?` ``` 2. Add one-time upgrade backfill for existing notes (migration 024) Why: existing DBs will otherwise only get note-documents for changed/new notes. Historical notes remain invisible unless users manually run full rebuild. ```diff @@ Phase 2: Per-Note Documents +### Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024) + +Create migration `024_note_dirty_backfill.sql`: +INSERT INTO dirty_sources (source_type, source_id, queued_at) +SELECT 'note', n.id, unixepoch('now') * 1000 +FROM notes n +LEFT JOIN documents d + ON d.source_type='note' AND d.source_id=n.id +WHERE n.is_system=0 AND d.id IS NULL +ON CONFLICT(source_type, source_id) DO NOTHING; + +Add migration test asserting idempotence and expected queue size. ``` 3. Fix `--since/--until` semantics and validation Why: reusing `parse_since` for `until` creates ambiguous windows and off-by-boundary behavior; your own example `--since 90d --until 180d` is chronologically reversed. ```diff @@ Work Chunk 1A: Data Types & Query Layer - since: parse_since(since_str) then n.created_at >= ? - until: parse_since(until_str) then n.created_at <= ? + since: parse_since_start_bound(since_str) then n.created_at >= ? + until: parse_until_end_bound(until_str) then n.created_at <= ? + Validate since <= until; otherwise return a clear user error. + +#### Tests to Write First +#[test] fn test_query_notes_invalid_time_window_rejected() { ... } +#[test] fn test_query_notes_until_date_is_end_of_day_inclusive() { ... } ``` 4. Separate semantic-change detection from housekeeping updates Why: current proposed `WHERE` includes `updated_at`, which will cause unnecessary dirty churn. You want `last_seen_at` to always refresh, but regeneration only when searchable semantics changed. ```diff @@ Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes - OR notes.updated_at IS NOT excluded.updated_at + -- updated_at-only changes should not mark semantic dirty + +Perform two-step logic: +1) Upsert always updates persistence/housekeeping fields (`updated_at`, `last_seen_at`). +2) `changed_semantics` is computed only from fields used by note documents/search filters + (body, note_type, resolved flags, paths, author, parent linkage). + +#### Tests to Write First +#[test] +fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() { ... } ``` 5. Make indexes align with actual query collation and join strategy Why: `author` uses `COLLATE NOCASE`; without collation-aware index, SQLite can skip index use. Also, IID filters via scalar subqueries are harder for planner than direct join predicates. ```diff @@ Work Chunk 1E: Composite Query Index -CREATE INDEX ... ON notes(project_id, author_username, created_at DESC, id DESC) WHERE is_system = 0; +CREATE INDEX ... ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC) WHERE is_system = 0; + +CREATE INDEX IF NOT EXISTS idx_discussions_issue_id ON discussions(issue_id); +CREATE INDEX IF NOT EXISTS idx_discussions_mr_id ON discussions(merge_request_id); ``` ```diff @@ Work Chunk 1A: query_notes() - d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?) + i.iid = ? AND i.project_id = ? - d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?) + m.iid = ? AND m.project_id = ? ``` 6. Replace manual CSV escaping with `csv` crate Why: manual RFC4180 escaping is fragile (quotes/newlines/multi-byte edge cases). This is exactly where a mature library reduces long-term bug risk. ```diff @@ Work Chunk 1C: Human & Robot Output Formatting - Uses a minimal CSV writer (no external dependency — the format is simple enough for manual escaping). + Uses `csv::Writer` for RFC4180-compliant escaping and stable output across edge cases. + +#### Tests to Write First +#[test] fn test_csv_output_multiline_and_quotes_roundtrip() { ... } ``` 7. Add `--contains` lexical body filter to `lore notes` Why: useful middle ground between metadata filtering and semantic search; great for reviewer-pattern mining without requiring FTS query syntax. ```diff @@ Work Chunk 1B: CLI Arguments & Command Wiring +/// Filter by case-insensitive substring in note body +#[arg(long, help_heading = "Filters")] +pub contains: Option; ``` ```diff @@ Work Chunk 1A: NoteListFilters + pub contains: Option<&'a str>, @@ query_notes dynamic filters + if contains.is_some() { + where_clauses.push("n.body LIKE ? COLLATE NOCASE"); + params.push(format!("%{}%", escape_like(contains.unwrap()))); + } ``` 8. Reduce note-document embedding noise by slimming metadata header Why: current verbose key-value header repeats low-signal tokens and consumes embedding budget. Keep context, but bias tokens toward actual review text. ```diff @@ Work Chunk 2C: Note Document Extractor - Build content with structured metadata header: - [[Note]] - source_type: note - note_gitlab_id: ... - project: ... - ... - --- Body --- - {body} + Build content with compact, high-signal layout: + [[Note]] + @{author} on {Issue#|MR!}{iid} in {project_path} + path: {path:line} (only when available) + state: {resolved|unresolved} (only when resolvable) + + {body} + +Keep detailed metadata in structured document columns/labels/paths/url, +not repeated in verbose text. ``` 9. Add explicit performance regression checks for the new hot paths Why: this feature increases document volume ~4x; you should pin acceptable query behavior now so future changes don’t silently degrade. ```diff @@ Verification Checklist +Performance/plan checks: +1) `EXPLAIN QUERY PLAN` for: + - author+since query + - project+date query + - for-mr / for-issue query +2) Seed 50k-note synthetic fixture and assert: + - `lore notes --author ... --limit 100` stays under agreed local threshold + - `lore search --type note ...` remains deterministic and completes successfully ``` If you want, I can also provide a fully merged “iteration 3” PRD text with these edits applied end-to-end so you can drop it in directly.