gitlore/docs/prd-per-note-search.feedback-2.md at e9bacc94e1148bf867e5961d61f7a2009a84812d

Files

teernisse e9bacc94e1 perf: force partial index for DiffNote queries (26-75x), batch stats counts (1.7x)

who.rs: Add INDEXED BY idx_notes_diffnote_path_created to all DiffNote
query paths (expert, expert_details, reviews, path probes, suffix_probe).
SQLite planner was choosing idx_notes_system (106K rows, 38%) over the
partial index (26K rows, 9.3%) when LIKE predicates are present.
Measured: expert 1561ms->59ms (26x), reviews ~1200ms->16ms (75x).

stats.rs: Replace 12+ sequential COUNT(*) queries with conditional
aggregates (SUM(CASE WHEN...)) and use FTS5 shadow table
(documents_fts_docsize) instead of virtual table for counting.
Measured: warm 109ms->65ms (1.68x).

2026-02-12 09:59:46 -05:00

8.8 KiB

Raw Blame History

Below are the strongest revisions I’d make, excluding everything in your ## Rejected Recommendations list.

Add a Phase 0 for stable note identity before any note-doc generation Rationale: your current plan still allows note document churn because Issue discussion ingestion is delete/reinsert-based. That makes local notes.id unstable, causing unnecessary dirtying/regeneration and potential stale-doc edge cases. Stabilizing identity first (upsert-by-GitLab-ID + sweep stale) improves correctness and cuts repeated work.

@@ ## Design
-Two phases, shipped together as one feature:
+Three phases, shipped together as one feature:
+- **Phase 0 (Foundation):** Stable note identity in local DB (upsert + sweep, no delete/reinsert churn)
 - **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering
 - **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline
@@
+## Phase 0: Stable Note Identity
+
+### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
+**Files:** `src/ingestion/discussions.rs`, `migrations/022_notes_identity_index.sql`, `src/core/db.rs`
+**Implementation:**
+- Add unique index: `UNIQUE(project_id, gitlab_id)` on `notes`
+- Replace delete/reinsert issue-note flow with upsert + `last_seen_at` sweep (same durability model as MR note sweep)
+- Ensure `insert_note/upsert_note` returns the stable local row id for both insert and update paths

Replace source_type CHECK constraints with a registry table + FK in migration Rationale: table CHECKs force full table rebuild for every new source type forever. A source_types table with FK keeps DB-level integrity and future extensibility without rebuilding documents/dirty_sources every time. This is a major architecture hardening win.

@@ ### Work Chunk 2A: Schema Migration (023)
-Current migration ... CHECK constraints limiting `source_type` ...
+Current migration ... CHECK constraints limiting `source_type` ...
+Revision: migrate to `source_types` registry table + FK constraints.
@@
-1. `dirty_sources` — add `'note'` to source_type CHECK
-2. `documents` — add `'note'` to source_type CHECK
+1. Create `source_types(name TEXT PRIMARY KEY)` and seed: `issue, merge_request, discussion, note`
+2. Rebuild `dirty_sources` and `documents` to replace CHECK with `REFERENCES source_types(name)`
+3. Future source-type additions become `INSERT INTO source_types(name) VALUES (?)` (no table rebuild)
@@
+#### Additional integrity tests
+#[test]
+fn test_source_types_registry_contains_note() { ... }
+#[test]
+fn test_documents_source_type_fk_enforced() { ... }
+#[test]
+fn test_dirty_sources_source_type_fk_enforced() { ... }

Mark note documents dirty only when note semantics actually changed Rationale: current loops mark every non-system note dirty every sync. With 8k+ notes this creates avoidable queue pressure and regeneration time. Change-aware dirtying (inserted/changed only) gives major performance and stability improvements.

@@ ### Work Chunk 2D: Regenerator & Dirty Tracking Integration
-for note in notes {
-    let local_note_id = insert_note(&tx, local_discussion_id, &note, None)?;
-    if !note.is_system {
-        dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, local_note_id)?;
-    }
-}
+for note in notes {
+    let outcome = upsert_note(&tx, local_discussion_id, &note, None)?;
+    if !note.is_system && outcome.changed_semantics {
+        dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
+    }
+}
@@
+// changed_semantics should include: body, note_type, path/line positions, resolvable/resolved/resolved_by, updated_at

Expand filters to support real analysis windows and resolution state Rationale: reviewer profiling usually needs bounded windows and both resolved/unresolved views. Current unresolved: bool is too narrow and one-sided. Add --until and tri-state resolution filtering for better analytical power.

@@ pub struct NoteListFilters<'a> {
-    pub since: Option<&'a str>,
+    pub since: Option<&'a str>,
+    pub until: Option<&'a str>,
@@
-    pub unresolved: bool,
+    pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved"
@@
-    pub author: Option<&'a str>,
+    pub author: Option<&'a str>, // case-insensitive match
@@
-    // Filter by time (7d, 2w, 1m, or YYYY-MM-DD)
+    // Filter by start time (7d, 2w, 1m, or YYYY-MM-DD)
     pub since: Option<String>,
+    /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD)
+    #[arg(long, help_heading = "Filters")]
+    pub until: Option<String>,
@@
-    /// Only show unresolved review comments
-    pub unresolved: bool,
+    /// Resolution filter: any, unresolved, resolved
+    #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")]
+    pub resolution: String,

Broaden index strategy to match actual query shapes, not just author queries Rationale: idx_notes_user_created helps one path, but common usage also includes project+time scans and unresolved filters. Add two more partial composites for high-selectivity paths.

@@ ### Work Chunk 1E: Composite Query Index
 CREATE INDEX IF NOT EXISTS idx_notes_user_created
 ON notes(project_id, author_username, created_at DESC, id DESC)
 WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0;
+
+CREATE INDEX IF NOT EXISTS idx_notes_unresolved_project_created
+ON notes(project_id, created_at DESC, id DESC)
+WHERE is_system = 0 AND resolvable = 1 AND resolved = 0;
@@
+#[test]
+fn test_notes_query_plan_uses_project_created_index_for_default_listing() { ... }
+#[test]
+fn test_notes_query_plan_uses_unresolved_index_when_resolution_unresolved() { ... }

Improve per-note document payload with structured metadata header + minimal thread context Rationale: isolated single-note docs can lose meaning. A small structured header plus lightweight context (parent + one preceding note excerpt) improves semantic retrieval quality substantially without re-bundling full threads.

@@ ### Work Chunk 2C: Note Document Extractor
-// 6. Format content:
-//    [[Note]] {note_type or "Comment"} on {parent_type_prefix}: {parent_title}
-//    Project: {path_with_namespace}
-//    URL: {url}
-//    Author: @{author}
-//    Date: {format_date(created_at)}
-//    Labels: {labels_json}
-//    File: {position_new_path}:{position_new_line}  (if DiffNote)
-//
-//    --- Body ---
-//
-//    {body}
+// 6. Format content with machine-readable header:
+//    [[Note]]
+//    source_type: note
+//    note_gitlab_id: {gitlab_id}
+//    project: {path_with_namespace}
+//    parent_type: {Issue|MergeRequest}
+//    parent_iid: {iid}
+//    note_type: {DiffNote|DiscussionNote|Comment}
+//    author: @{author}
+//    created_at: {iso8601}
+//    resolved: {true|false}
+//    path: {position_new_path}:{position_new_line}
+//    url: {url}
+//
+//    --- Context ---
+//    parent_title: {title}
+//    previous_note_excerpt: {optional, max 200 chars}
+//
+//    --- Body ---
+//    {body}

Add first-class export modes for downstream profiling pipelines Rationale: this makes the feature much more useful immediately (LLM prompts, notebook analysis, external scripts) without adding a profiling command. It stays within your non-goals and increases adoption.

@@ pub struct NotesArgs {
+    /// Output format
+    #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")]
+    pub format: String,
@@
-    if robot_mode {
+    if robot_mode || args.format == "json" || args.format == "jsonl" || args.format == "csv" {
         print_list_notes_json(...)
     } else {
         print_list_notes(&result);
     }
@@ ### Work Chunk 1C: Human & Robot Output Formatting
+Add `print_list_notes_csv()` and `print_list_notes_jsonl()`:
+- CSV columns mirror `NoteListRowJson` field names
+- JSONL emits one note object per line for streaming pipelines

Strengthen verification with idempotence + migration data-preservation checks Rationale: this feature touches ingestion, migrations, indexing, and regeneration. Add explicit idempotence/perf checks so regressions surface early.

@@ ## Verification Checklist
 cargo test
 cargo clippy --all-targets -- -D warnings
 cargo fmt --check
+cargo test test_note_ingestion_idempotent_across_two_syncs
+cargo test test_note_document_count_stable_after_second_generate_docs_full
@@
+lore sync
+lore generate-docs --full
+lore -J stats > /tmp/stats1.json
+lore generate-docs --full
+lore -J stats > /tmp/stats2.json
+# assert note doc count unchanged and dirty queue drains to zero

If you want, I can turn this into a fully rewritten PRD v2 draft with these changes merged in-place and renumbered work chunks end-to-end.

8.8 KiB Raw Blame History Unescape Escape

8.8 KiB

Raw Blame History