Per-note search PRD: Comprehensive product requirements for evolving the search system from document-level to note-level granularity. Includes 6 rounds of iterative feedback refining scope, ranking strategy, migration path, and robot mode integration. User journeys: Detailed walkthrough of 8 primary user workflows covering issue triage, MR review lookup, code archaeology, expert discovery, sync pipeline operation, and agent integration patterns.
190 lines
8.8 KiB
Markdown
190 lines
8.8 KiB
Markdown
Here are the highest-impact revisions I’d make. None of these repeat anything in your `## Rejected Recommendations`.
|
||
|
||
1. **Add immutable reviewer identity (`author_id`) as a first-class key**
|
||
Why this improves the plan: the PRD’s core use case is year-scale reviewer profiling. Usernames are mutable in GitLab, so username-only filtering will fragment one reviewer into multiple identities over time. Adding `author_id` closes that correctness hole and makes historical analysis reliable.
|
||
|
||
```diff
|
||
@@ Problem Statement
|
||
-1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread
|
||
+1. **Query individual notes by reviewer identity** — support both mutable username and immutable GitLab `author_id` for stable longitudinal analysis
|
||
|
||
@@ Phase 0: Stable Note Identity
|
||
+### Work Chunk 0D: Immutable Author Identity Capture
|
||
+**Files:** `migrations/025_notes_author_id.sql`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/cli/commands/list.rs`
|
||
+
|
||
+#### Implementation
|
||
+- Add nullable `notes.author_id INTEGER` and backfill from future syncs.
|
||
+- Populate `author_id` from GitLab note payload (`note.author.id`) on both issue and MR note ingestion paths.
|
||
+- Add `--author-id <int>` filter to `lore notes`.
|
||
+- Keep `--author` for ergonomics; when both provided, require both to match.
|
||
+
|
||
+#### Indexing
|
||
+- Add `idx_notes_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0;`
|
||
+
|
||
+#### Tests
|
||
+- `test_query_notes_filter_author_id_survives_username_change`
|
||
+- `test_query_notes_author_and_author_id_intersection`
|
||
```
|
||
|
||
2. **Strengthen partial-fetch safety from a boolean to an explicit fetch state contract**
|
||
Why this improves the plan: `fetch_complete: bool` is easy to misuse and fragile under retries/crashes. A run-scoped state model makes sweep correctness auditable and prevents accidental deletions when ingestion aborts midway.
|
||
|
||
```diff
|
||
@@ Phase 0: Stable Note Identity
|
||
-### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
|
||
+### Work Chunk 0C: Sweep Safety Guard with Run-Scoped Fetch State
|
||
|
||
@@ Implementation
|
||
-Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:
|
||
+Add a run-scoped fetch state:
|
||
+- `FetchState::Complete`
|
||
+- `FetchState::Partial`
|
||
+- `FetchState::Failed`
|
||
+
|
||
+Only run sweep on `FetchState::Complete`.
|
||
+Persist `run_seen_at` once per sync run and pass unchanged through all discussion/note upserts.
|
||
+Require `run_seen_at` monotonicity per discussion before sweep (skip and warn otherwise).
|
||
|
||
@@ Tests to Write First
|
||
+#[test]
|
||
+fn test_failed_fetch_never_sweeps_even_after_partial_upserts() { ... }
|
||
+#[test]
|
||
+fn test_non_monotonic_run_seen_at_skips_sweep() { ... }
|
||
+#[test]
|
||
+fn test_retry_after_failed_fetch_then_complete_sweeps_correctly() { ... }
|
||
```
|
||
|
||
3. **Add DB-level cleanup triggers for note-document referential integrity**
|
||
Why this improves the plan: Work Chunk 0B handles the sweep path, but not every possible delete path. DB triggers give defense-in-depth so stale note docs cannot survive even if a future code path deletes notes differently.
|
||
|
||
```diff
|
||
@@ Work Chunk 0B: Immediate Deletion Propagation
|
||
-Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL
|
||
+Keep set-based SQL in sweep functions, and add DB-level cleanup triggers as a safety net.
|
||
|
||
@@ Work Chunk 2A: Schema Migration (023)
|
||
+-- Cleanup trigger: deleting a non-system note must delete note document + dirty queue row
|
||
+CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes
|
||
+WHEN old.is_system = 0
|
||
+BEGIN
|
||
+ DELETE FROM documents
|
||
+ WHERE source_type = 'note' AND source_id = old.id;
|
||
+ DELETE FROM dirty_sources
|
||
+ WHERE source_type = 'note' AND source_id = old.id;
|
||
+END;
|
||
+
|
||
+-- Cleanup trigger: if note flips to system, remove its document artifacts
|
||
+CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes
|
||
+WHEN old.is_system = 0 AND new.is_system = 1
|
||
+BEGIN
|
||
+ DELETE FROM documents
|
||
+ WHERE source_type = 'note' AND source_id = new.id;
|
||
+ DELETE FROM dirty_sources
|
||
+ WHERE source_type = 'note' AND source_id = new.id;
|
||
+END;
|
||
```
|
||
|
||
4. **Eliminate N+1 extraction cost with parent metadata caching in regeneration**
|
||
Why this improves the plan: backfilling ~8k notes with per-note parent/label lookups creates avoidable query amplification. Batch caching turns repeated joins into one-time lookups per parent entity and materially reduces rebuild time.
|
||
|
||
```diff
|
||
@@ Phase 2: Per-Note Documents
|
||
+### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration
|
||
+**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs`
|
||
+
|
||
+#### Implementation
|
||
+- Add `NoteExtractionContext` cache keyed by `(noteable_type, parent_id)` containing:
|
||
+ - parent iid/title/url
|
||
+ - parent labels
|
||
+ - project path
|
||
+- In batch regeneration, prefetch parent metadata for note IDs in the current chunk.
|
||
+- Use cached metadata in `extract_note_document()` to avoid repeated parent/label queries.
|
||
+
|
||
+#### Tests
|
||
+- `test_note_regeneration_uses_parent_cache_consistently`
|
||
+- `test_note_regeneration_cache_hit_preserves_hash_determinism`
|
||
```
|
||
|
||
5. **Add embedding dedup cache keyed by semantic text hash**
|
||
Why this improves the plan: note docs will contain repeated short comments (“LGTM”, “nit: …”). Current doc-level hashing includes metadata, so identical semantic comments still re-embed many times. A semantic embedding hash cache cuts cost and speeds full rebuild/backfill without changing search behavior.
|
||
|
||
```diff
|
||
@@ Phase 2: Per-Note Documents
|
||
+### Work Chunk 2J: Semantic Embedding Dedup for Notes
|
||
+**Files:** `migrations/026_embedding_cache.sql`, embedding pipeline module(s), `src/documents/extractor.rs`
|
||
+
|
||
+#### Implementation
|
||
+- Compute `embedding_text` for notes as: normalized note body + compact stable context (`parent_type`, `path`, `resolution`), excluding volatile fields.
|
||
+- Compute `embedding_hash = sha256(embedding_text)`.
|
||
+- Before embedding generation, lookup existing vector by `(model, embedding_hash)`.
|
||
+- Reuse cached vector when present; only call embedding model on misses.
|
||
+
|
||
+#### Tests
|
||
+- `test_identical_note_bodies_reuse_embedding_vector`
|
||
+- `test_embedding_hash_changes_when_semantic_context_changes`
|
||
```
|
||
|
||
6. **Add deterministic review-signal tags as derived labels**
|
||
Why this improves the plan: this makes output immediately more useful for reviewer-pattern analysis without adding a profile command (which is explicitly out of scope). It increases practical value of both `lore notes` and `lore search --type note` with low complexity.
|
||
|
||
```diff
|
||
@@ Non-Goals
|
||
-- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
|
||
+- Adding a "reviewer profile" report command (downstream), while allowing low-level derived signal tags as indexing primitives
|
||
|
||
@@ Phase 2: Per-Note Documents
|
||
+### Work Chunk 2K: Derived Review Signal Labels
|
||
+**Files:** `src/documents/extractor.rs`
|
||
+
|
||
+#### Implementation
|
||
+- Derive deterministic labels from note text + metadata:
|
||
+ - `signal:nit`
|
||
+ - `signal:blocking`
|
||
+ - `signal:security`
|
||
+ - `signal:performance`
|
||
+ - `signal:testing`
|
||
+- Attach via existing `document_labels` flow for note documents.
|
||
+- No new CLI mode required; existing label filters can consume these labels.
|
||
+
|
||
+#### Tests
|
||
+- `test_note_document_derives_signal_labels_nit`
|
||
+- `test_note_document_derives_signal_labels_security`
|
||
+- `test_signal_label_derivation_is_deterministic`
|
||
```
|
||
|
||
7. **Add high-precision note targeting filters (`--note-id`, `--gitlab-note-id`, `--discussion-id`)**
|
||
Why this improves the plan: debugging, incident response, and reproducibility all benefit from exact addressing. This is especially useful when validating sync correctness and cross-checking a specific note/document lifecycle.
|
||
|
||
```diff
|
||
@@ Work Chunk 1B: CLI Arguments & Command Wiring
|
||
pub struct NotesArgs {
|
||
+ /// Filter by local note row id
|
||
+ #[arg(long = "note-id", help_heading = "Filters")]
|
||
+ pub note_id: Option<i64>,
|
||
+
|
||
+ /// Filter by GitLab note id
|
||
+ #[arg(long = "gitlab-note-id", help_heading = "Filters")]
|
||
+ pub gitlab_note_id: Option<i64>,
|
||
+
|
||
+ /// Filter by local discussion id
|
||
+ #[arg(long = "discussion-id", help_heading = "Filters")]
|
||
+ pub discussion_id: Option<i64>,
|
||
}
|
||
|
||
@@ Work Chunk 1A: Filter struct
|
||
pub struct NoteListFilters<'a> {
|
||
+ pub note_id: Option<i64>,
|
||
+ pub gitlab_note_id: Option<i64>,
|
||
+ pub discussion_id: Option<i64>,
|
||
}
|
||
|
||
@@ Tests to Write First
|
||
+#[test]
|
||
+fn test_query_notes_filter_note_id_exact() { ... }
|
||
+#[test]
|
||
+fn test_query_notes_filter_gitlab_note_id_exact() { ... }
|
||
+#[test]
|
||
+fn test_query_notes_filter_discussion_id_exact() { ... }
|
||
```
|
||
|
||
If you want, I can produce a single consolidated “iteration 5” PRD diff that merges these into your exact section ordering and updates the dependency graph/migration numbering end-to-end. |