Here are the highest-impact revisions I’d make. None of these repeat anything in your `## Rejected Recommendations`. 1. **Add immutable reviewer identity (`author_id`) as a first-class key** Why this improves the plan: the PRD’s core use case is year-scale reviewer profiling. Usernames are mutable in GitLab, so username-only filtering will fragment one reviewer into multiple identities over time. Adding `author_id` closes that correctness hole and makes historical analysis reliable. ```diff @@ Problem Statement -1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread +1. **Query individual notes by reviewer identity** — support both mutable username and immutable GitLab `author_id` for stable longitudinal analysis @@ Phase 0: Stable Note Identity +### Work Chunk 0D: Immutable Author Identity Capture +**Files:** `migrations/025_notes_author_id.sql`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/cli/commands/list.rs` + +#### Implementation +- Add nullable `notes.author_id INTEGER` and backfill from future syncs. +- Populate `author_id` from GitLab note payload (`note.author.id`) on both issue and MR note ingestion paths. +- Add `--author-id ` filter to `lore notes`. +- Keep `--author` for ergonomics; when both provided, require both to match. + +#### Indexing +- Add `idx_notes_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0;` + +#### Tests +- `test_query_notes_filter_author_id_survives_username_change` +- `test_query_notes_author_and_author_id_intersection` ``` 2. **Strengthen partial-fetch safety from a boolean to an explicit fetch state contract** Why this improves the plan: `fetch_complete: bool` is easy to misuse and fragile under retries/crashes. A run-scoped state model makes sweep correctness auditable and prevents accidental deletions when ingestion aborts midway. ```diff @@ Phase 0: Stable Note Identity -### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection) +### Work Chunk 0C: Sweep Safety Guard with Run-Scoped Fetch State @@ Implementation -Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully: +Add a run-scoped fetch state: +- `FetchState::Complete` +- `FetchState::Partial` +- `FetchState::Failed` + +Only run sweep on `FetchState::Complete`. +Persist `run_seen_at` once per sync run and pass unchanged through all discussion/note upserts. +Require `run_seen_at` monotonicity per discussion before sweep (skip and warn otherwise). @@ Tests to Write First +#[test] +fn test_failed_fetch_never_sweeps_even_after_partial_upserts() { ... } +#[test] +fn test_non_monotonic_run_seen_at_skips_sweep() { ... } +#[test] +fn test_retry_after_failed_fetch_then_complete_sweeps_correctly() { ... } ``` 3. **Add DB-level cleanup triggers for note-document referential integrity** Why this improves the plan: Work Chunk 0B handles the sweep path, but not every possible delete path. DB triggers give defense-in-depth so stale note docs cannot survive even if a future code path deletes notes differently. ```diff @@ Work Chunk 0B: Immediate Deletion Propagation -Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL +Keep set-based SQL in sweep functions, and add DB-level cleanup triggers as a safety net. @@ Work Chunk 2A: Schema Migration (023) +-- Cleanup trigger: deleting a non-system note must delete note document + dirty queue row +CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes +WHEN old.is_system = 0 +BEGIN + DELETE FROM documents + WHERE source_type = 'note' AND source_id = old.id; + DELETE FROM dirty_sources + WHERE source_type = 'note' AND source_id = old.id; +END; + +-- Cleanup trigger: if note flips to system, remove its document artifacts +CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes +WHEN old.is_system = 0 AND new.is_system = 1 +BEGIN + DELETE FROM documents + WHERE source_type = 'note' AND source_id = new.id; + DELETE FROM dirty_sources + WHERE source_type = 'note' AND source_id = new.id; +END; ``` 4. **Eliminate N+1 extraction cost with parent metadata caching in regeneration** Why this improves the plan: backfilling ~8k notes with per-note parent/label lookups creates avoidable query amplification. Batch caching turns repeated joins into one-time lookups per parent entity and materially reduces rebuild time. ```diff @@ Phase 2: Per-Note Documents +### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration +**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs` + +#### Implementation +- Add `NoteExtractionContext` cache keyed by `(noteable_type, parent_id)` containing: + - parent iid/title/url + - parent labels + - project path +- In batch regeneration, prefetch parent metadata for note IDs in the current chunk. +- Use cached metadata in `extract_note_document()` to avoid repeated parent/label queries. + +#### Tests +- `test_note_regeneration_uses_parent_cache_consistently` +- `test_note_regeneration_cache_hit_preserves_hash_determinism` ``` 5. **Add embedding dedup cache keyed by semantic text hash** Why this improves the plan: note docs will contain repeated short comments (“LGTM”, “nit: …”). Current doc-level hashing includes metadata, so identical semantic comments still re-embed many times. A semantic embedding hash cache cuts cost and speeds full rebuild/backfill without changing search behavior. ```diff @@ Phase 2: Per-Note Documents +### Work Chunk 2J: Semantic Embedding Dedup for Notes +**Files:** `migrations/026_embedding_cache.sql`, embedding pipeline module(s), `src/documents/extractor.rs` + +#### Implementation +- Compute `embedding_text` for notes as: normalized note body + compact stable context (`parent_type`, `path`, `resolution`), excluding volatile fields. +- Compute `embedding_hash = sha256(embedding_text)`. +- Before embedding generation, lookup existing vector by `(model, embedding_hash)`. +- Reuse cached vector when present; only call embedding model on misses. + +#### Tests +- `test_identical_note_bodies_reuse_embedding_vector` +- `test_embedding_hash_changes_when_semantic_context_changes` ``` 6. **Add deterministic review-signal tags as derived labels** Why this improves the plan: this makes output immediately more useful for reviewer-pattern analysis without adding a profile command (which is explicitly out of scope). It increases practical value of both `lore notes` and `lore search --type note` with low complexity. ```diff @@ Non-Goals -- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure) +- Adding a "reviewer profile" report command (downstream), while allowing low-level derived signal tags as indexing primitives @@ Phase 2: Per-Note Documents +### Work Chunk 2K: Derived Review Signal Labels +**Files:** `src/documents/extractor.rs` + +#### Implementation +- Derive deterministic labels from note text + metadata: + - `signal:nit` + - `signal:blocking` + - `signal:security` + - `signal:performance` + - `signal:testing` +- Attach via existing `document_labels` flow for note documents. +- No new CLI mode required; existing label filters can consume these labels. + +#### Tests +- `test_note_document_derives_signal_labels_nit` +- `test_note_document_derives_signal_labels_security` +- `test_signal_label_derivation_is_deterministic` ``` 7. **Add high-precision note targeting filters (`--note-id`, `--gitlab-note-id`, `--discussion-id`)** Why this improves the plan: debugging, incident response, and reproducibility all benefit from exact addressing. This is especially useful when validating sync correctness and cross-checking a specific note/document lifecycle. ```diff @@ Work Chunk 1B: CLI Arguments & Command Wiring pub struct NotesArgs { + /// Filter by local note row id + #[arg(long = "note-id", help_heading = "Filters")] + pub note_id: Option, + + /// Filter by GitLab note id + #[arg(long = "gitlab-note-id", help_heading = "Filters")] + pub gitlab_note_id: Option, + + /// Filter by local discussion id + #[arg(long = "discussion-id", help_heading = "Filters")] + pub discussion_id: Option, } @@ Work Chunk 1A: Filter struct pub struct NoteListFilters<'a> { + pub note_id: Option, + pub gitlab_note_id: Option, + pub discussion_id: Option, } @@ Tests to Write First +#[test] +fn test_query_notes_filter_note_id_exact() { ... } +#[test] +fn test_query_notes_filter_gitlab_note_id_exact() { ... } +#[test] +fn test_query_notes_filter_discussion_id_exact() { ... } ``` If you want, I can produce a single consolidated “iteration 5” PRD diff that merges these into your exact section ordering and updates the dependency graph/migration numbering end-to-end.