docs: add per-note search PRD and user journey documentation

Per-note search PRD: Comprehensive product requirements for evolving
the search system from document-level to note-level granularity.
Includes 6 rounds of iterative feedback refining scope, ranking
strategy, migration path, and robot mode integration.

User journeys: Detailed walkthrough of 8 primary user workflows
covering issue triage, MR review lookup, code archaeology, expert
discovery, sync pipeline operation, and agent integration patterns.
This commit is contained in:
teernisse
2026-02-11 16:00:34 -05:00
parent cd25cf61ca
commit 125938fba6
8 changed files with 4103 additions and 0 deletions

View File

@@ -0,0 +1,200 @@
Below are the strongest revisions Id make, excluding everything in your `## Rejected Recommendations` list.
1. **Add a Phase 0 for stable note identity before any note-doc generation**
Rationale: your current plan still allows note document churn because Issue discussion ingestion is delete/reinsert-based. That makes local `notes.id` unstable, causing unnecessary dirtying/regeneration and potential stale-doc edge cases. Stabilizing identity first (upsert-by-GitLab-ID + sweep stale) improves correctness and cuts repeated work.
```diff
@@ ## Design
-Two phases, shipped together as one feature:
+Three phases, shipped together as one feature:
+- **Phase 0 (Foundation):** Stable note identity in local DB (upsert + sweep, no delete/reinsert churn)
- **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering
- **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline
@@
+## Phase 0: Stable Note Identity
+
+### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
+**Files:** `src/ingestion/discussions.rs`, `migrations/022_notes_identity_index.sql`, `src/core/db.rs`
+**Implementation:**
+- Add unique index: `UNIQUE(project_id, gitlab_id)` on `notes`
+- Replace delete/reinsert issue-note flow with upsert + `last_seen_at` sweep (same durability model as MR note sweep)
+- Ensure `insert_note/upsert_note` returns the stable local row id for both insert and update paths
```
2. **Replace `source_type` CHECK constraints with a registry table + FK in migration**
Rationale: table CHECKs force full table rebuild for every new source type forever. A `source_types` table with FK keeps DB-level integrity and future extensibility without rebuilding `documents`/`dirty_sources` every time. This is a major architecture hardening win.
```diff
@@ ### Work Chunk 2A: Schema Migration (023)
-Current migration ... CHECK constraints limiting `source_type` ...
+Current migration ... CHECK constraints limiting `source_type` ...
+Revision: migrate to `source_types` registry table + FK constraints.
@@
-1. `dirty_sources` — add `'note'` to source_type CHECK
-2. `documents` — add `'note'` to source_type CHECK
+1. Create `source_types(name TEXT PRIMARY KEY)` and seed: `issue, merge_request, discussion, note`
+2. Rebuild `dirty_sources` and `documents` to replace CHECK with `REFERENCES source_types(name)`
+3. Future source-type additions become `INSERT INTO source_types(name) VALUES (?)` (no table rebuild)
@@
+#### Additional integrity tests
+#[test]
+fn test_source_types_registry_contains_note() { ... }
+#[test]
+fn test_documents_source_type_fk_enforced() { ... }
+#[test]
+fn test_dirty_sources_source_type_fk_enforced() { ... }
```
3. **Mark note documents dirty only when note semantics actually changed**
Rationale: current loops mark every non-system note dirty every sync. With 8k+ notes this creates avoidable queue pressure and regeneration time. Change-aware dirtying (inserted/changed only) gives major performance and stability improvements.
```diff
@@ ### Work Chunk 2D: Regenerator & Dirty Tracking Integration
-for note in notes {
- let local_note_id = insert_note(&tx, local_discussion_id, &note, None)?;
- if !note.is_system {
- dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, local_note_id)?;
- }
-}
+for note in notes {
+ let outcome = upsert_note(&tx, local_discussion_id, &note, None)?;
+ if !note.is_system && outcome.changed_semantics {
+ dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
+ }
+}
@@
+// changed_semantics should include: body, note_type, path/line positions, resolvable/resolved/resolved_by, updated_at
```
4. **Expand filters to support real analysis windows and resolution state**
Rationale: reviewer profiling usually needs bounded windows and both resolved/unresolved views. Current `unresolved: bool` is too narrow and one-sided. Add `--until` and tri-state resolution filtering for better analytical power.
```diff
@@ pub struct NoteListFilters<'a> {
- pub since: Option<&'a str>,
+ pub since: Option<&'a str>,
+ pub until: Option<&'a str>,
@@
- pub unresolved: bool,
+ pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved"
@@
- pub author: Option<&'a str>,
+ pub author: Option<&'a str>, // case-insensitive match
@@
- // Filter by time (7d, 2w, 1m, or YYYY-MM-DD)
+ // Filter by start time (7d, 2w, 1m, or YYYY-MM-DD)
pub since: Option<String>,
+ /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD)
+ #[arg(long, help_heading = "Filters")]
+ pub until: Option<String>,
@@
- /// Only show unresolved review comments
- pub unresolved: bool,
+ /// Resolution filter: any, unresolved, resolved
+ #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")]
+ pub resolution: String,
```
5. **Broaden index strategy to match actual query shapes, not just author queries**
Rationale: `idx_notes_user_created` helps one path, but common usage also includes project+time scans and unresolved filters. Add two more partial composites for high-selectivity paths.
```diff
@@ ### Work Chunk 1E: Composite Query Index
CREATE INDEX IF NOT EXISTS idx_notes_user_created
ON notes(project_id, author_username, created_at DESC, id DESC)
WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_unresolved_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0 AND resolvable = 1 AND resolved = 0;
@@
+#[test]
+fn test_notes_query_plan_uses_project_created_index_for_default_listing() { ... }
+#[test]
+fn test_notes_query_plan_uses_unresolved_index_when_resolution_unresolved() { ... }
```
6. **Improve per-note document payload with structured metadata header + minimal thread context**
Rationale: isolated single-note docs can lose meaning. A small structured header plus lightweight context (parent + one preceding note excerpt) improves semantic retrieval quality substantially without re-bundling full threads.
```diff
@@ ### Work Chunk 2C: Note Document Extractor
-// 6. Format content:
-// [[Note]] {note_type or "Comment"} on {parent_type_prefix}: {parent_title}
-// Project: {path_with_namespace}
-// URL: {url}
-// Author: @{author}
-// Date: {format_date(created_at)}
-// Labels: {labels_json}
-// File: {position_new_path}:{position_new_line} (if DiffNote)
-//
-// --- Body ---
-//
-// {body}
+// 6. Format content with machine-readable header:
+// [[Note]]
+// source_type: note
+// note_gitlab_id: {gitlab_id}
+// project: {path_with_namespace}
+// parent_type: {Issue|MergeRequest}
+// parent_iid: {iid}
+// note_type: {DiffNote|DiscussionNote|Comment}
+// author: @{author}
+// created_at: {iso8601}
+// resolved: {true|false}
+// path: {position_new_path}:{position_new_line}
+// url: {url}
+//
+// --- Context ---
+// parent_title: {title}
+// previous_note_excerpt: {optional, max 200 chars}
+//
+// --- Body ---
+// {body}
```
7. **Add first-class export modes for downstream profiling pipelines**
Rationale: this makes the feature much more useful immediately (LLM prompts, notebook analysis, external scripts) without adding a profiling command. It stays within your non-goals and increases adoption.
```diff
@@ pub struct NotesArgs {
+ /// Output format
+ #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")]
+ pub format: String,
@@
- if robot_mode {
+ if robot_mode || args.format == "json" || args.format == "jsonl" || args.format == "csv" {
print_list_notes_json(...)
} else {
print_list_notes(&result);
}
@@ ### Work Chunk 1C: Human & Robot Output Formatting
+Add `print_list_notes_csv()` and `print_list_notes_jsonl()`:
+- CSV columns mirror `NoteListRowJson` field names
+- JSONL emits one note object per line for streaming pipelines
```
8. **Strengthen verification with idempotence + migration data-preservation checks**
Rationale: this feature touches ingestion, migrations, indexing, and regeneration. Add explicit idempotence/perf checks so regressions surface early.
```diff
@@ ## Verification Checklist
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check
+cargo test test_note_ingestion_idempotent_across_two_syncs
+cargo test test_note_document_count_stable_after_second_generate_docs_full
@@
+lore sync
+lore generate-docs --full
+lore -J stats > /tmp/stats1.json
+lore generate-docs --full
+lore -J stats > /tmp/stats2.json
+# assert note doc count unchanged and dirty queue drains to zero
```
If you want, I can turn this into a fully rewritten PRD v2 draft with these changes merged in-place and renumbered work chunks end-to-end.