docs: add per-note search PRD and user journey documentation

Per-note search PRD: Comprehensive product requirements for evolving the search system from document-level to note-level granularity. Includes 6 rounds of iterative feedback refining scope, ranking strategy, migration path, and robot mode integration. User journeys: Detailed walkthrough of 8 primary user workflows covering issue triage, MR review lookup, code archaeology, expert discovery, sync pipeline operation, and agent integration patterns.
2026-02-11 16:00:34 -05:00
parent cd25cf61ca
commit 125938fba6
8 changed files with 4103 additions and 0 deletions
--- a/docs/prd-per-note-search.feedback-2.md
+++ b/docs/prd-per-note-search.feedback-2.md
@@ -0,0 +1,200 @@
+Below are the strongest revisions I’d make, excluding everything in your `## Rejected Recommendations` list.
+
+1. **Add a Phase 0 for stable note identity before any note-doc generation**
+Rationale: your current plan still allows note document churn because Issue discussion ingestion is delete/reinsert-based. That makes local `notes.id` unstable, causing unnecessary dirtying/regeneration and potential stale-doc edge cases. Stabilizing identity first (upsert-by-GitLab-ID + sweep stale) improves correctness and cuts repeated work.
+
+```diff
+@@ ## Design
+-Two phases, shipped together as one feature:
+Three phases, shipped together as one feature:
+- **Phase 0 (Foundation):** Stable note identity in local DB (upsert + sweep, no delete/reinsert churn)
+ - **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering
+ - **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline
+@@
+## Phase 0: Stable Note Identity
+
+### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
+**Files:** `src/ingestion/discussions.rs`, `migrations/022_notes_identity_index.sql`, `src/core/db.rs`
+**Implementation:**
+- Add unique index: `UNIQUE(project_id, gitlab_id)` on `notes`
+- Replace delete/reinsert issue-note flow with upsert + `last_seen_at` sweep (same durability model as MR note sweep)
+- Ensure `insert_note/upsert_note` returns the stable local row id for both insert and update paths
+```
+
+2. **Replace `source_type` CHECK constraints with a registry table + FK in migration**
+Rationale: table CHECKs force full table rebuild for every new source type forever. A `source_types` table with FK keeps DB-level integrity and future extensibility without rebuilding `documents`/`dirty_sources` every time. This is a major architecture hardening win.
+
+```diff
+@@ ### Work Chunk 2A: Schema Migration (023)
+-Current migration ... CHECK constraints limiting `source_type` ...
+Current migration ... CHECK constraints limiting `source_type` ...
+Revision: migrate to `source_types` registry table + FK constraints.
+@@
+-1. `dirty_sources` — add `'note'` to source_type CHECK
+-2. `documents` — add `'note'` to source_type CHECK
+1. Create `source_types(name TEXT PRIMARY KEY)` and seed: `issue, merge_request, discussion, note`
+2. Rebuild `dirty_sources` and `documents` to replace CHECK with `REFERENCES source_types(name)`
+3. Future source-type additions become `INSERT INTO source_types(name) VALUES (?)` (no table rebuild)
+@@
+#### Additional integrity tests
+#[test]
+fn test_source_types_registry_contains_note() { ... }
+#[test]
+fn test_documents_source_type_fk_enforced() { ... }
+#[test]
+fn test_dirty_sources_source_type_fk_enforced() { ... }
+```
+
+3. **Mark note documents dirty only when note semantics actually changed**
+Rationale: current loops mark every non-system note dirty every sync. With 8k+ notes this creates avoidable queue pressure and regeneration time. Change-aware dirtying (inserted/changed only) gives major performance and stability improvements.
+
+```diff
+@@ ### Work Chunk 2D: Regenerator & Dirty Tracking Integration
+-for note in notes {
+-    let local_note_id = insert_note(&tx, local_discussion_id, &note, None)?;
+-    if !note.is_system {
+-        dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, local_note_id)?;
+-    }
+-}
+for note in notes {
+    let outcome = upsert_note(&tx, local_discussion_id, &note, None)?;
+    if !note.is_system && outcome.changed_semantics {
+        dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
+    }
+}
+@@
+// changed_semantics should include: body, note_type, path/line positions, resolvable/resolved/resolved_by, updated_at
+```
+
+4. **Expand filters to support real analysis windows and resolution state**
+Rationale: reviewer profiling usually needs bounded windows and both resolved/unresolved views. Current `unresolved: bool` is too narrow and one-sided. Add `--until` and tri-state resolution filtering for better analytical power.
+
+```diff
+@@ pub struct NoteListFilters<'a> {
+-    pub since: Option<&'a str>,
+    pub since: Option<&'a str>,
+    pub until: Option<&'a str>,
+@@
+-    pub unresolved: bool,
+    pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved"
+@@
+-    pub author: Option<&'a str>,
+    pub author: Option<&'a str>, // case-insensitive match
+@@
+-    // Filter by time (7d, 2w, 1m, or YYYY-MM-DD)
+    // Filter by start time (7d, 2w, 1m, or YYYY-MM-DD)
+     pub since: Option<String>,
+    /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD)
+    #[arg(long, help_heading = "Filters")]
+    pub until: Option<String>,
+@@
+-    /// Only show unresolved review comments
+-    pub unresolved: bool,
+    /// Resolution filter: any, unresolved, resolved
+    #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")]
+    pub resolution: String,
+```
+
+5. **Broaden index strategy to match actual query shapes, not just author queries**
+Rationale: `idx_notes_user_created` helps one path, but common usage also includes project+time scans and unresolved filters. Add two more partial composites for high-selectivity paths.
+
+```diff
+@@ ### Work Chunk 1E: Composite Query Index
+ CREATE INDEX IF NOT EXISTS idx_notes_user_created
+ ON notes(project_id, author_username, created_at DESC, id DESC)
+ WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_unresolved_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0 AND resolvable = 1 AND resolved = 0;
+@@
+#[test]
+fn test_notes_query_plan_uses_project_created_index_for_default_listing() { ... }
+#[test]
+fn test_notes_query_plan_uses_unresolved_index_when_resolution_unresolved() { ... }
+```
+
+6. **Improve per-note document payload with structured metadata header + minimal thread context**
+Rationale: isolated single-note docs can lose meaning. A small structured header plus lightweight context (parent + one preceding note excerpt) improves semantic retrieval quality substantially without re-bundling full threads.
+
+```diff
+@@ ### Work Chunk 2C: Note Document Extractor
+-// 6. Format content:
+-//    [[Note]] {note_type or "Comment"} on {parent_type_prefix}: {parent_title}
+-//    Project: {path_with_namespace}
+-//    URL: {url}
+-//    Author: @{author}
+-//    Date: {format_date(created_at)}
+-//    Labels: {labels_json}
+-//    File: {position_new_path}:{position_new_line}  (if DiffNote)
+-//
+-//    --- Body ---
+-//
+-//    {body}
+// 6. Format content with machine-readable header:
+//    [[Note]]
+//    source_type: note
+//    note_gitlab_id: {gitlab_id}
+//    project: {path_with_namespace}
+//    parent_type: {Issue|MergeRequest}
+//    parent_iid: {iid}
+//    note_type: {DiffNote|DiscussionNote|Comment}
+//    author: @{author}
+//    created_at: {iso8601}
+//    resolved: {true|false}
+//    path: {position_new_path}:{position_new_line}
+//    url: {url}
+//
+//    --- Context ---
+//    parent_title: {title}
+//    previous_note_excerpt: {optional, max 200 chars}
+//
+//    --- Body ---
+//    {body}
+```
+
+7. **Add first-class export modes for downstream profiling pipelines**
+Rationale: this makes the feature much more useful immediately (LLM prompts, notebook analysis, external scripts) without adding a profiling command. It stays within your non-goals and increases adoption.
+
+```diff
+@@ pub struct NotesArgs {
+    /// Output format
+    #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")]
+    pub format: String,
+@@
+-    if robot_mode {
+    if robot_mode || args.format == "json" || args.format == "jsonl" || args.format == "csv" {
+         print_list_notes_json(...)
+     } else {
+         print_list_notes(&result);
+     }
+@@ ### Work Chunk 1C: Human & Robot Output Formatting
+Add `print_list_notes_csv()` and `print_list_notes_jsonl()`:
+- CSV columns mirror `NoteListRowJson` field names
+- JSONL emits one note object per line for streaming pipelines
+```
+
+8. **Strengthen verification with idempotence + migration data-preservation checks**
+Rationale: this feature touches ingestion, migrations, indexing, and regeneration. Add explicit idempotence/perf checks so regressions surface early.
+
+```diff
+@@ ## Verification Checklist
+ cargo test
+ cargo clippy --all-targets -- -D warnings
+ cargo fmt --check
+cargo test test_note_ingestion_idempotent_across_two_syncs
+cargo test test_note_document_count_stable_after_second_generate_docs_full
+@@
+lore sync
+lore generate-docs --full
+lore -J stats > /tmp/stats1.json
+lore generate-docs --full
+lore -J stats > /tmp/stats2.json
+# assert note doc count unchanged and dirty queue drains to zero
+```
+
+If you want, I can turn this into a fully rewritten PRD v2 draft with these changes merged in-place and renumbered work chunks end-to-end.