gitlore/docs/prd-per-note-search.feedback-5.md at 740607e06d4d7160c16177826d7488cac9467567

Files

teernisse 740607e06d perf: force partial index for DiffNote queries (26-75x), batch stats counts (1.7x)

who.rs: Add INDEXED BY idx_notes_diffnote_path_created to all DiffNote
query paths (expert, expert_details, reviews, path probes, suffix_probe).
SQLite planner was choosing idx_notes_system (106K rows, 38%) over the
partial index (26K rows, 9.3%) when LIKE predicates are present.
Measured: expert 1561ms->59ms (26x), reviews ~1200ms->16ms (75x).

stats.rs: Replace 12+ sequential COUNT(*) queries with conditional
aggregates (SUM(CASE WHEN...)) and use FTS5 shadow table
(documents_fts_docsize) instead of virtual table for counting.
Measured: warm 109ms->65ms (1.68x).

2026-02-12 10:02:48 -05:00

8.8 KiB

Raw Blame History

Here are the highest-impact revisions I’d make. None of these repeat anything in your ## Rejected Recommendations.

Add immutable reviewer identity (author_id) as a first-class key Why this improves the plan: the PRD’s core use case is year-scale reviewer profiling. Usernames are mutable in GitLab, so username-only filtering will fragment one reviewer into multiple identities over time. Adding author_id closes that correctness hole and makes historical analysis reliable.

@@ Problem Statement
-1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread
+1. **Query individual notes by reviewer identity** — support both mutable username and immutable GitLab `author_id` for stable longitudinal analysis

@@ Phase 0: Stable Note Identity
+### Work Chunk 0D: Immutable Author Identity Capture
+**Files:** `migrations/025_notes_author_id.sql`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`, `src/cli/commands/list.rs`
+
+#### Implementation
+- Add nullable `notes.author_id INTEGER` and backfill from future syncs.
+- Populate `author_id` from GitLab note payload (`note.author.id`) on both issue and MR note ingestion paths.
+- Add `--author-id <int>` filter to `lore notes`.
+- Keep `--author` for ergonomics; when both provided, require both to match.
+
+#### Indexing
+- Add `idx_notes_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0;`
+
+#### Tests
+- `test_query_notes_filter_author_id_survives_username_change`
+- `test_query_notes_author_and_author_id_intersection`

Strengthen partial-fetch safety from a boolean to an explicit fetch state contract Why this improves the plan: fetch_complete: bool is easy to misuse and fragile under retries/crashes. A run-scoped state model makes sweep correctness auditable and prevents accidental deletions when ingestion aborts midway.

@@ Phase 0: Stable Note Identity
-### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
+### Work Chunk 0C: Sweep Safety Guard with Run-Scoped Fetch State

@@ Implementation
-Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:
+Add a run-scoped fetch state:
+- `FetchState::Complete`
+- `FetchState::Partial`
+- `FetchState::Failed`
+
+Only run sweep on `FetchState::Complete`.
+Persist `run_seen_at` once per sync run and pass unchanged through all discussion/note upserts.
+Require `run_seen_at` monotonicity per discussion before sweep (skip and warn otherwise).

@@ Tests to Write First
+#[test]
+fn test_failed_fetch_never_sweeps_even_after_partial_upserts() { ... }
+#[test]
+fn test_non_monotonic_run_seen_at_skips_sweep() { ... }
+#[test]
+fn test_retry_after_failed_fetch_then_complete_sweeps_correctly() { ... }

Add DB-level cleanup triggers for note-document referential integrity Why this improves the plan: Work Chunk 0B handles the sweep path, but not every possible delete path. DB triggers give defense-in-depth so stale note docs cannot survive even if a future code path deletes notes differently.

@@ Work Chunk 0B: Immediate Deletion Propagation
-Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL
+Keep set-based SQL in sweep functions, and add DB-level cleanup triggers as a safety net.

@@ Work Chunk 2A: Schema Migration (023)
+-- Cleanup trigger: deleting a non-system note must delete note document + dirty queue row
+CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes
+WHEN old.is_system = 0
+BEGIN
+  DELETE FROM documents
+   WHERE source_type = 'note' AND source_id = old.id;
+  DELETE FROM dirty_sources
+   WHERE source_type = 'note' AND source_id = old.id;
+END;
+
+-- Cleanup trigger: if note flips to system, remove its document artifacts
+CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes
+WHEN old.is_system = 0 AND new.is_system = 1
+BEGIN
+  DELETE FROM documents
+   WHERE source_type = 'note' AND source_id = new.id;
+  DELETE FROM dirty_sources
+   WHERE source_type = 'note' AND source_id = new.id;
+END;

Eliminate N+1 extraction cost with parent metadata caching in regeneration Why this improves the plan: backfilling ~8k notes with per-note parent/label lookups creates avoidable query amplification. Batch caching turns repeated joins into one-time lookups per parent entity and materially reduces rebuild time.

@@ Phase 2: Per-Note Documents
+### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration
+**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs`
+
+#### Implementation
+- Add `NoteExtractionContext` cache keyed by `(noteable_type, parent_id)` containing:
+  - parent iid/title/url
+  - parent labels
+  - project path
+- In batch regeneration, prefetch parent metadata for note IDs in the current chunk.
+- Use cached metadata in `extract_note_document()` to avoid repeated parent/label queries.
+
+#### Tests
+- `test_note_regeneration_uses_parent_cache_consistently`
+- `test_note_regeneration_cache_hit_preserves_hash_determinism`

Add embedding dedup cache keyed by semantic text hash Why this improves the plan: note docs will contain repeated short comments (“LGTM”, “nit: …”). Current doc-level hashing includes metadata, so identical semantic comments still re-embed many times. A semantic embedding hash cache cuts cost and speeds full rebuild/backfill without changing search behavior.

@@ Phase 2: Per-Note Documents
+### Work Chunk 2J: Semantic Embedding Dedup for Notes
+**Files:** `migrations/026_embedding_cache.sql`, embedding pipeline module(s), `src/documents/extractor.rs`
+
+#### Implementation
+- Compute `embedding_text` for notes as: normalized note body + compact stable context (`parent_type`, `path`, `resolution`), excluding volatile fields.
+- Compute `embedding_hash = sha256(embedding_text)`.
+- Before embedding generation, lookup existing vector by `(model, embedding_hash)`.
+- Reuse cached vector when present; only call embedding model on misses.
+
+#### Tests
+- `test_identical_note_bodies_reuse_embedding_vector`
+- `test_embedding_hash_changes_when_semantic_context_changes`

Add deterministic review-signal tags as derived labels Why this improves the plan: this makes output immediately more useful for reviewer-pattern analysis without adding a profile command (which is explicitly out of scope). It increases practical value of both lore notes and lore search --type note with low complexity.

@@ Non-Goals
-- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
+- Adding a "reviewer profile" report command (downstream), while allowing low-level derived signal tags as indexing primitives

@@ Phase 2: Per-Note Documents
+### Work Chunk 2K: Derived Review Signal Labels
+**Files:** `src/documents/extractor.rs`
+
+#### Implementation
+- Derive deterministic labels from note text + metadata:
+  - `signal:nit`
+  - `signal:blocking`
+  - `signal:security`
+  - `signal:performance`
+  - `signal:testing`
+- Attach via existing `document_labels` flow for note documents.
+- No new CLI mode required; existing label filters can consume these labels.
+
+#### Tests
+- `test_note_document_derives_signal_labels_nit`
+- `test_note_document_derives_signal_labels_security`
+- `test_signal_label_derivation_is_deterministic`

Add high-precision note targeting filters (--note-id, --gitlab-note-id, --discussion-id) Why this improves the plan: debugging, incident response, and reproducibility all benefit from exact addressing. This is especially useful when validating sync correctness and cross-checking a specific note/document lifecycle.

@@ Work Chunk 1B: CLI Arguments & Command Wiring
 pub struct NotesArgs {
+    /// Filter by local note row id
+    #[arg(long = "note-id", help_heading = "Filters")]
+    pub note_id: Option<i64>,
+
+    /// Filter by GitLab note id
+    #[arg(long = "gitlab-note-id", help_heading = "Filters")]
+    pub gitlab_note_id: Option<i64>,
+
+    /// Filter by local discussion id
+    #[arg(long = "discussion-id", help_heading = "Filters")]
+    pub discussion_id: Option<i64>,
 }

@@ Work Chunk 1A: Filter struct
 pub struct NoteListFilters<'a> {
+    pub note_id: Option<i64>,
+    pub gitlab_note_id: Option<i64>,
+    pub discussion_id: Option<i64>,
 }

@@ Tests to Write First
+#[test]
+fn test_query_notes_filter_note_id_exact() { ... }
+#[test]
+fn test_query_notes_filter_gitlab_note_id_exact() { ... }
+#[test]
+fn test_query_notes_filter_discussion_id_exact() { ... }

If you want, I can produce a single consolidated “iteration 5” PRD diff that merges these into your exact section ordering and updates the dependency graph/migration numbering end-to-end.

8.8 KiB Raw Blame History Unescape Escape

8.8 KiB

Raw Blame History