docs: add per-note search PRD and user journey documentation

Per-note search PRD: Comprehensive product requirements for evolving the search system from document-level to note-level granularity. Includes 6 rounds of iterative feedback refining scope, ranking strategy, migration path, and robot mode integration. User journeys: Detailed walkthrough of 8 primary user workflows covering issue triage, MR review lookup, code archaeology, expert discovery, sync pipeline operation, and agent integration patterns.
2026-02-11 16:00:34 -05:00
parent cd25cf61ca
commit 125938fba6
8 changed files with 4103 additions and 0 deletions
--- a/docs/prd-per-note-search.feedback-5.md
+++ b/docs/prd-per-note-search.feedback-5.md
@@ -0,0 +1,190 @@
+Here are the highest-impact revisions I’d make. None of these repeat anything in your `## Rejected Recommendations`.
+
+1. **Add immutable reviewer identity (`author_id`) as a first-class key**
+Why this improves the plan: the PRD’s core use case is year-scale reviewer profiling. Usernames are mutable in GitLab, so username-only filtering will fragment one reviewer into multiple identities over time. Adding `author_id` closes that correctness hole and makes historical analysis reliable.
+
+```diff
+@@ Problem Statement
+-1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread
+1. **Query individual notes by reviewer identity** — support both mutable username and immutable GitLab `author_id` for stable longitudinal analysis
+
+@@ Phase 0: Stable Note Identity
+### Work Chunk 0D: Immutable Author Identity Capture
+**Files:** `migrations/025_notes_author_id.sql`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/cli/commands/list.rs`
+
+#### Implementation
+- Add nullable `notes.author_id INTEGER` and backfill from future syncs.
+- Populate `author_id` from GitLab note payload (`note.author.id`) on both issue and MR note ingestion paths.
+- Add `--author-id <int>` filter to `lore notes`.
+- Keep `--author` for ergonomics; when both provided, require both to match.
+
+#### Indexing
+- Add `idx_notes_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0;`
+
+#### Tests
+- `test_query_notes_filter_author_id_survives_username_change`
+- `test_query_notes_author_and_author_id_intersection`
+```
+
+2. **Strengthen partial-fetch safety from a boolean to an explicit fetch state contract**
+Why this improves the plan: `fetch_complete: bool` is easy to misuse and fragile under retries/crashes. A run-scoped state model makes sweep correctness auditable and prevents accidental deletions when ingestion aborts midway.
+
+```diff
+@@ Phase 0: Stable Note Identity
+-### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
+### Work Chunk 0C: Sweep Safety Guard with Run-Scoped Fetch State
+
+@@ Implementation
+-Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:
+Add a run-scoped fetch state:
+- `FetchState::Complete`
+- `FetchState::Partial`
+- `FetchState::Failed`
+
+Only run sweep on `FetchState::Complete`.
+Persist `run_seen_at` once per sync run and pass unchanged through all discussion/note upserts.
+Require `run_seen_at` monotonicity per discussion before sweep (skip and warn otherwise).
+
+@@ Tests to Write First
+#[test]
+fn test_failed_fetch_never_sweeps_even_after_partial_upserts() { ... }
+#[test]
+fn test_non_monotonic_run_seen_at_skips_sweep() { ... }
+#[test]
+fn test_retry_after_failed_fetch_then_complete_sweeps_correctly() { ... }
+```
+
+3. **Add DB-level cleanup triggers for note-document referential integrity**
+Why this improves the plan: Work Chunk 0B handles the sweep path, but not every possible delete path. DB triggers give defense-in-depth so stale note docs cannot survive even if a future code path deletes notes differently.
+
+```diff
+@@ Work Chunk 0B: Immediate Deletion Propagation
+-Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL
+Keep set-based SQL in sweep functions, and add DB-level cleanup triggers as a safety net.
+
+@@ Work Chunk 2A: Schema Migration (023)
+-- Cleanup trigger: deleting a non-system note must delete note document + dirty queue row
+CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes
+WHEN old.is_system = 0
+BEGIN
+  DELETE FROM documents
+   WHERE source_type = 'note' AND source_id = old.id;
+  DELETE FROM dirty_sources
+   WHERE source_type = 'note' AND source_id = old.id;
+END;
+
+-- Cleanup trigger: if note flips to system, remove its document artifacts
+CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes
+WHEN old.is_system = 0 AND new.is_system = 1
+BEGIN
+  DELETE FROM documents
+   WHERE source_type = 'note' AND source_id = new.id;
+  DELETE FROM dirty_sources
+   WHERE source_type = 'note' AND source_id = new.id;
+END;
+```
+
+4. **Eliminate N+1 extraction cost with parent metadata caching in regeneration**
+Why this improves the plan: backfilling ~8k notes with per-note parent/label lookups creates avoidable query amplification. Batch caching turns repeated joins into one-time lookups per parent entity and materially reduces rebuild time.
+
+```diff
+@@ Phase 2: Per-Note Documents
+### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration
+**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs`
+
+#### Implementation
+- Add `NoteExtractionContext` cache keyed by `(noteable_type, parent_id)` containing:
+  - parent iid/title/url
+  - parent labels
+  - project path
+- In batch regeneration, prefetch parent metadata for note IDs in the current chunk.
+- Use cached metadata in `extract_note_document()` to avoid repeated parent/label queries.
+
+#### Tests
+- `test_note_regeneration_uses_parent_cache_consistently`
+- `test_note_regeneration_cache_hit_preserves_hash_determinism`
+```
+
+5. **Add embedding dedup cache keyed by semantic text hash**
+Why this improves the plan: note docs will contain repeated short comments (“LGTM”, “nit: …”). Current doc-level hashing includes metadata, so identical semantic comments still re-embed many times. A semantic embedding hash cache cuts cost and speeds full rebuild/backfill without changing search behavior.
+
+```diff
+@@ Phase 2: Per-Note Documents
+### Work Chunk 2J: Semantic Embedding Dedup for Notes
+**Files:** `migrations/026_embedding_cache.sql`, embedding pipeline module(s), `src/documents/extractor.rs`
+
+#### Implementation
+- Compute `embedding_text` for notes as: normalized note body + compact stable context (`parent_type`, `path`, `resolution`), excluding volatile fields.
+- Compute `embedding_hash = sha256(embedding_text)`.
+- Before embedding generation, lookup existing vector by `(model, embedding_hash)`.
+- Reuse cached vector when present; only call embedding model on misses.
+
+#### Tests
+- `test_identical_note_bodies_reuse_embedding_vector`
+- `test_embedding_hash_changes_when_semantic_context_changes`
+```
+
+6. **Add deterministic review-signal tags as derived labels**
+Why this improves the plan: this makes output immediately more useful for reviewer-pattern analysis without adding a profile command (which is explicitly out of scope). It increases practical value of both `lore notes` and `lore search --type note` with low complexity.
+
+```diff
+@@ Non-Goals
+-- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
+- Adding a "reviewer profile" report command (downstream), while allowing low-level derived signal tags as indexing primitives
+
+@@ Phase 2: Per-Note Documents
+### Work Chunk 2K: Derived Review Signal Labels
+**Files:** `src/documents/extractor.rs`
+
+#### Implementation
+- Derive deterministic labels from note text + metadata:
+  - `signal:nit`
+  - `signal:blocking`
+  - `signal:security`
+  - `signal:performance`
+  - `signal:testing`
+- Attach via existing `document_labels` flow for note documents.
+- No new CLI mode required; existing label filters can consume these labels.
+
+#### Tests
+- `test_note_document_derives_signal_labels_nit`
+- `test_note_document_derives_signal_labels_security`
+- `test_signal_label_derivation_is_deterministic`
+```
+
+7. **Add high-precision note targeting filters (`--note-id`, `--gitlab-note-id`, `--discussion-id`)**
+Why this improves the plan: debugging, incident response, and reproducibility all benefit from exact addressing. This is especially useful when validating sync correctness and cross-checking a specific note/document lifecycle.
+
+```diff
+@@ Work Chunk 1B: CLI Arguments & Command Wiring
+ pub struct NotesArgs {
+    /// Filter by local note row id
+    #[arg(long = "note-id", help_heading = "Filters")]
+    pub note_id: Option<i64>,
+
+    /// Filter by GitLab note id
+    #[arg(long = "gitlab-note-id", help_heading = "Filters")]
+    pub gitlab_note_id: Option<i64>,
+
+    /// Filter by local discussion id
+    #[arg(long = "discussion-id", help_heading = "Filters")]
+    pub discussion_id: Option<i64>,
+ }
+
+@@ Work Chunk 1A: Filter struct
+ pub struct NoteListFilters<'a> {
+    pub note_id: Option<i64>,
+    pub gitlab_note_id: Option<i64>,
+    pub discussion_id: Option<i64>,
+ }
+
+@@ Tests to Write First
+#[test]
+fn test_query_notes_filter_note_id_exact() { ... }
+#[test]
+fn test_query_notes_filter_gitlab_note_id_exact() { ... }
+#[test]
+fn test_query_notes_filter_discussion_id_exact() { ... }
+```
+
+If you want, I can produce a single consolidated “iteration 5” PRD diff that merges these into your exact section ordering and updates the dependency graph/migration numbering end-to-end.