docs: add per-note search PRD and user journey documentation
Per-note search PRD: Comprehensive product requirements for evolving the search system from document-level to note-level granularity. Includes 6 rounds of iterative feedback refining scope, ranking strategy, migration path, and robot mode integration. User journeys: Detailed walkthrough of 8 primary user workflows covering issue triage, MR review lookup, code archaeology, expert discovery, sync pipeline operation, and agent integration patterns.
This commit is contained in:
190
docs/prd-per-note-search.feedback-5.md
Normal file
190
docs/prd-per-note-search.feedback-5.md
Normal file
@@ -0,0 +1,190 @@
|
||||
Here are the highest-impact revisions I’d make. None of these repeat anything in your `## Rejected Recommendations`.
|
||||
|
||||
1. **Add immutable reviewer identity (`author_id`) as a first-class key**
|
||||
Why this improves the plan: the PRD’s core use case is year-scale reviewer profiling. Usernames are mutable in GitLab, so username-only filtering will fragment one reviewer into multiple identities over time. Adding `author_id` closes that correctness hole and makes historical analysis reliable.
|
||||
|
||||
```diff
|
||||
@@ Problem Statement
|
||||
-1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread
|
||||
+1. **Query individual notes by reviewer identity** — support both mutable username and immutable GitLab `author_id` for stable longitudinal analysis
|
||||
|
||||
@@ Phase 0: Stable Note Identity
|
||||
+### Work Chunk 0D: Immutable Author Identity Capture
|
||||
+**Files:** `migrations/025_notes_author_id.sql`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/cli/commands/list.rs`
|
||||
+
|
||||
+#### Implementation
|
||||
+- Add nullable `notes.author_id INTEGER` and backfill from future syncs.
|
||||
+- Populate `author_id` from GitLab note payload (`note.author.id`) on both issue and MR note ingestion paths.
|
||||
+- Add `--author-id <int>` filter to `lore notes`.
|
||||
+- Keep `--author` for ergonomics; when both provided, require both to match.
|
||||
+
|
||||
+#### Indexing
|
||||
+- Add `idx_notes_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0;`
|
||||
+
|
||||
+#### Tests
|
||||
+- `test_query_notes_filter_author_id_survives_username_change`
|
||||
+- `test_query_notes_author_and_author_id_intersection`
|
||||
```
|
||||
|
||||
2. **Strengthen partial-fetch safety from a boolean to an explicit fetch state contract**
|
||||
Why this improves the plan: `fetch_complete: bool` is easy to misuse and fragile under retries/crashes. A run-scoped state model makes sweep correctness auditable and prevents accidental deletions when ingestion aborts midway.
|
||||
|
||||
```diff
|
||||
@@ Phase 0: Stable Note Identity
|
||||
-### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
|
||||
+### Work Chunk 0C: Sweep Safety Guard with Run-Scoped Fetch State
|
||||
|
||||
@@ Implementation
|
||||
-Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:
|
||||
+Add a run-scoped fetch state:
|
||||
+- `FetchState::Complete`
|
||||
+- `FetchState::Partial`
|
||||
+- `FetchState::Failed`
|
||||
+
|
||||
+Only run sweep on `FetchState::Complete`.
|
||||
+Persist `run_seen_at` once per sync run and pass unchanged through all discussion/note upserts.
|
||||
+Require `run_seen_at` monotonicity per discussion before sweep (skip and warn otherwise).
|
||||
|
||||
@@ Tests to Write First
|
||||
+#[test]
|
||||
+fn test_failed_fetch_never_sweeps_even_after_partial_upserts() { ... }
|
||||
+#[test]
|
||||
+fn test_non_monotonic_run_seen_at_skips_sweep() { ... }
|
||||
+#[test]
|
||||
+fn test_retry_after_failed_fetch_then_complete_sweeps_correctly() { ... }
|
||||
```
|
||||
|
||||
3. **Add DB-level cleanup triggers for note-document referential integrity**
|
||||
Why this improves the plan: Work Chunk 0B handles the sweep path, but not every possible delete path. DB triggers give defense-in-depth so stale note docs cannot survive even if a future code path deletes notes differently.
|
||||
|
||||
```diff
|
||||
@@ Work Chunk 0B: Immediate Deletion Propagation
|
||||
-Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL
|
||||
+Keep set-based SQL in sweep functions, and add DB-level cleanup triggers as a safety net.
|
||||
|
||||
@@ Work Chunk 2A: Schema Migration (023)
|
||||
+-- Cleanup trigger: deleting a non-system note must delete note document + dirty queue row
|
||||
+CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes
|
||||
+WHEN old.is_system = 0
|
||||
+BEGIN
|
||||
+ DELETE FROM documents
|
||||
+ WHERE source_type = 'note' AND source_id = old.id;
|
||||
+ DELETE FROM dirty_sources
|
||||
+ WHERE source_type = 'note' AND source_id = old.id;
|
||||
+END;
|
||||
+
|
||||
+-- Cleanup trigger: if note flips to system, remove its document artifacts
|
||||
+CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes
|
||||
+WHEN old.is_system = 0 AND new.is_system = 1
|
||||
+BEGIN
|
||||
+ DELETE FROM documents
|
||||
+ WHERE source_type = 'note' AND source_id = new.id;
|
||||
+ DELETE FROM dirty_sources
|
||||
+ WHERE source_type = 'note' AND source_id = new.id;
|
||||
+END;
|
||||
```
|
||||
|
||||
4. **Eliminate N+1 extraction cost with parent metadata caching in regeneration**
|
||||
Why this improves the plan: backfilling ~8k notes with per-note parent/label lookups creates avoidable query amplification. Batch caching turns repeated joins into one-time lookups per parent entity and materially reduces rebuild time.
|
||||
|
||||
```diff
|
||||
@@ Phase 2: Per-Note Documents
|
||||
+### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration
|
||||
+**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs`
|
||||
+
|
||||
+#### Implementation
|
||||
+- Add `NoteExtractionContext` cache keyed by `(noteable_type, parent_id)` containing:
|
||||
+ - parent iid/title/url
|
||||
+ - parent labels
|
||||
+ - project path
|
||||
+- In batch regeneration, prefetch parent metadata for note IDs in the current chunk.
|
||||
+- Use cached metadata in `extract_note_document()` to avoid repeated parent/label queries.
|
||||
+
|
||||
+#### Tests
|
||||
+- `test_note_regeneration_uses_parent_cache_consistently`
|
||||
+- `test_note_regeneration_cache_hit_preserves_hash_determinism`
|
||||
```
|
||||
|
||||
5. **Add embedding dedup cache keyed by semantic text hash**
|
||||
Why this improves the plan: note docs will contain repeated short comments (“LGTM”, “nit: …”). Current doc-level hashing includes metadata, so identical semantic comments still re-embed many times. A semantic embedding hash cache cuts cost and speeds full rebuild/backfill without changing search behavior.
|
||||
|
||||
```diff
|
||||
@@ Phase 2: Per-Note Documents
|
||||
+### Work Chunk 2J: Semantic Embedding Dedup for Notes
|
||||
+**Files:** `migrations/026_embedding_cache.sql`, embedding pipeline module(s), `src/documents/extractor.rs`
|
||||
+
|
||||
+#### Implementation
|
||||
+- Compute `embedding_text` for notes as: normalized note body + compact stable context (`parent_type`, `path`, `resolution`), excluding volatile fields.
|
||||
+- Compute `embedding_hash = sha256(embedding_text)`.
|
||||
+- Before embedding generation, lookup existing vector by `(model, embedding_hash)`.
|
||||
+- Reuse cached vector when present; only call embedding model on misses.
|
||||
+
|
||||
+#### Tests
|
||||
+- `test_identical_note_bodies_reuse_embedding_vector`
|
||||
+- `test_embedding_hash_changes_when_semantic_context_changes`
|
||||
```
|
||||
|
||||
6. **Add deterministic review-signal tags as derived labels**
|
||||
Why this improves the plan: this makes output immediately more useful for reviewer-pattern analysis without adding a profile command (which is explicitly out of scope). It increases practical value of both `lore notes` and `lore search --type note` with low complexity.
|
||||
|
||||
```diff
|
||||
@@ Non-Goals
|
||||
-- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
|
||||
+- Adding a "reviewer profile" report command (downstream), while allowing low-level derived signal tags as indexing primitives
|
||||
|
||||
@@ Phase 2: Per-Note Documents
|
||||
+### Work Chunk 2K: Derived Review Signal Labels
|
||||
+**Files:** `src/documents/extractor.rs`
|
||||
+
|
||||
+#### Implementation
|
||||
+- Derive deterministic labels from note text + metadata:
|
||||
+ - `signal:nit`
|
||||
+ - `signal:blocking`
|
||||
+ - `signal:security`
|
||||
+ - `signal:performance`
|
||||
+ - `signal:testing`
|
||||
+- Attach via existing `document_labels` flow for note documents.
|
||||
+- No new CLI mode required; existing label filters can consume these labels.
|
||||
+
|
||||
+#### Tests
|
||||
+- `test_note_document_derives_signal_labels_nit`
|
||||
+- `test_note_document_derives_signal_labels_security`
|
||||
+- `test_signal_label_derivation_is_deterministic`
|
||||
```
|
||||
|
||||
7. **Add high-precision note targeting filters (`--note-id`, `--gitlab-note-id`, `--discussion-id`)**
|
||||
Why this improves the plan: debugging, incident response, and reproducibility all benefit from exact addressing. This is especially useful when validating sync correctness and cross-checking a specific note/document lifecycle.
|
||||
|
||||
```diff
|
||||
@@ Work Chunk 1B: CLI Arguments & Command Wiring
|
||||
pub struct NotesArgs {
|
||||
+ /// Filter by local note row id
|
||||
+ #[arg(long = "note-id", help_heading = "Filters")]
|
||||
+ pub note_id: Option<i64>,
|
||||
+
|
||||
+ /// Filter by GitLab note id
|
||||
+ #[arg(long = "gitlab-note-id", help_heading = "Filters")]
|
||||
+ pub gitlab_note_id: Option<i64>,
|
||||
+
|
||||
+ /// Filter by local discussion id
|
||||
+ #[arg(long = "discussion-id", help_heading = "Filters")]
|
||||
+ pub discussion_id: Option<i64>,
|
||||
}
|
||||
|
||||
@@ Work Chunk 1A: Filter struct
|
||||
pub struct NoteListFilters<'a> {
|
||||
+ pub note_id: Option<i64>,
|
||||
+ pub gitlab_note_id: Option<i64>,
|
||||
+ pub discussion_id: Option<i64>,
|
||||
}
|
||||
|
||||
@@ Tests to Write First
|
||||
+#[test]
|
||||
+fn test_query_notes_filter_note_id_exact() { ... }
|
||||
+#[test]
|
||||
+fn test_query_notes_filter_gitlab_note_id_exact() { ... }
|
||||
+#[test]
|
||||
+fn test_query_notes_filter_discussion_id_exact() { ... }
|
||||
```
|
||||
|
||||
If you want, I can produce a single consolidated “iteration 5” PRD diff that merges these into your exact section ordering and updates the dependency graph/migration numbering end-to-end.
|
||||
Reference in New Issue
Block a user