--- plan: true title: "" status: iterating iteration: 6 target_iterations: 8 beads_revision: 0 related_plans: [] created: 2026-02-11 updated: 2026-02-12 --- # PRD: Per-Note Search & Reviewer Profiling ## Problem Statement Lore ingests all GitLab discussion notes with full metadata (author, body, diff positions, timestamps), but the data is only accessible through aggregated discussion documents. There is no way to: 1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread, and relies solely on mutable usernames (no immutable author identity for longitudinal analysis) 2. **List raw notes with metadata** — no CLI surface exposes the `notes` table directly 3. **Semantically search individual comments** — notes are bundled into thread documents, diluting per-note relevance **Use case:** "Search through jdefting's code review comments over the past year to build a comprehensive report of their code smell preferences and review patterns." ## Design Three phases, shipped together as one feature: - **Phase 0 (Foundation):** Stable note identity — unify issue discussion note ingestion to use upsert+sweep (matching the MR pattern), ensuring `notes.id` is stable across syncs, capturing immutable `author_id` for longitudinal analysis, and enabling change-aware dirty marking - **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering and multiple export formats (table, JSON, JSONL, CSV) - **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline All three phases are required. Phase 0 gives stable identity for reliable document tracking; Phase 1 gives structured data extraction; Phase 2 gives semantic search. ## Non-Goals - Changing existing discussion document behavior (those remain as-is) - Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure) - Modifying the ingestion pipeline from GitLab (data is already captured — Phase 0 only changes local storage strategy) - Adding pagination/cursor support (no existing list command has this; high `--limit` covers the year-long analysis use case) - Feature flag gating (no external users; single-user CLI in early dev) --- ## Phase 0: Stable Note Identity ### Rationale Issue discussion note ingestion currently uses a delete/reinsert pattern (`DELETE FROM notes WHERE discussion_id = ?` then re-insert). This makes `notes.id` (the local row ID) unstable across syncs — every sync assigns new IDs to the same notes. MR discussion notes already use an upsert pattern (`ON CONFLICT(gitlab_id) DO UPDATE`), producing stable IDs. Phase 2 depends on `notes.id` as the `source_id` for note documents. Unstable IDs would cause: - Unnecessary document churn (old ID deleted, new ID created for identical content) - Stale document accumulation (orphaned documents from old IDs) - Wasted regeneration cycles on every sync Unifying both paths to upsert+sweep gives stable identity, enables change-aware dirty marking, and reduces sync overhead. ### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes **Files:** `src/ingestion/discussions.rs` **Depends on:** Nothing (standalone) **Context:** `src/ingestion/mr_discussions.rs` already has `upsert_note()` (line ~470) which uses `ON CONFLICT(gitlab_id) DO UPDATE` and `sweep_stale_notes()` (line ~551) which deletes notes with `last_seen_at < run_seen_at`. The `notes` table already has `UNIQUE(gitlab_id)`. We need to bring the issue discussion path to the same pattern. #### Tests to Write First ```rust #[test] fn test_issue_note_upsert_stable_id() { // Insert a discussion with 2 notes via the new upsert path // Record their local IDs // "Re-sync" the same notes (same gitlab_ids, same content) // Assert: local IDs are unchanged } #[test] fn test_issue_note_upsert_detects_body_change() { // Insert note with body "old text" // Re-sync same gitlab_id with body "new text" // Assert: upsert returns changed_semantics = true // Assert: local ID is unchanged } #[test] fn test_issue_note_upsert_unchanged_returns_false() { // Insert note, re-sync identical note // Assert: upsert returns changed_semantics = false } #[test] fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() { // Insert note with body "text", updated_at = 1000 // Re-sync same gitlab_id with body "text", updated_at = 2000 // Assert: upsert returns changed_semantics = false // Assert: updated_at is updated in the DB (housekeeping fields always refresh) } #[test] fn test_issue_note_sweep_removes_stale() { // Insert 3 notes for a discussion // Re-sync with only 2 of the 3 (different last_seen_at) // Run sweep // Assert: stale note is deleted, 2 remain } #[test] fn test_issue_note_upsert_returns_local_id() { // Insert a note via upsert // Assert: returned local_id matches conn.last_insert_rowid() // Or for update path: matches the existing row's id } ``` #### Implementation **1. Create shared `NoteUpsertOutcome` struct** (in `src/ingestion/discussions.rs` or a shared module): ```rust pub struct NoteUpsertOutcome { pub local_note_id: i64, pub changed_semantics: bool, } ``` **2. Refactor `insert_note()` → `upsert_note_for_issue()`:** Replace the current `DELETE FROM notes WHERE discussion_id = ?` + loop insert pattern (lines 132-139) with: ```rust for note in &normalized_notes { let outcome = upsert_note_for_issue(&tx, local_discussion_id, note, last_seen_at)?; // outcome.local_note_id and outcome.changed_semantics available for Phase 2 } // After loop: sweep stale notes for this discussion sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?; ``` The upsert SQL follows the MR pattern: ```sql INSERT INTO notes (gitlab_id, discussion_id, project_id, author_username, body, note_type, is_system, created_at, updated_at, last_seen_at, ...) VALUES (?1, ?2, ?3, ...) ON CONFLICT(gitlab_id) DO UPDATE SET body = excluded.body, note_type = excluded.note_type, updated_at = excluded.updated_at, last_seen_at = excluded.last_seen_at, ... ``` **Change detection:** Semantic change is computed separately from housekeeping updates. The upsert always updates persistence fields (`updated_at`, `last_seen_at`), but `changed_semantics` is derived only from fields that affect note documents and search filters: ```sql ON CONFLICT(gitlab_id) DO UPDATE SET body = excluded.body, note_type = excluded.note_type, updated_at = excluded.updated_at, last_seen_at = excluded.last_seen_at, resolved = excluded.resolved, resolved_by = excluded.resolved_by, position_new_path = excluded.position_new_path, position_new_line = excluded.position_new_line, position_old_path = excluded.position_old_path, position_old_line = excluded.position_old_line, ... ``` Then detect semantic change with a separate check that excludes `updated_at` and `last_seen_at` (housekeeping-only fields): ```sql WHERE notes.body IS NOT excluded.body OR notes.note_type IS NOT excluded.note_type OR notes.author_username IS NOT excluded.author_username OR notes.resolved IS NOT excluded.resolved OR notes.resolved_by IS NOT excluded.resolved_by OR notes.position_new_path IS NOT excluded.position_new_path OR notes.position_new_line IS NOT excluded.position_new_line ``` **Why `author_username` is semantic:** Note documents embed the username in both the content header (`author: @{author}`) and the title (`Note by @{author} on Issue #42`). If a GitLab user changes their username (e.g., `jdefting` -> `jd-engineering`), the existing note documents become stale — search results show the old username, inconsistent with what the API returns. Treating username changes as semantic ensures documents stay accurate. **Note:** `author_id` changes do NOT trigger `changed_semantics`. The `author_id` is an immutable identity anchor — it never changes in practice, and even if it did (data migration), it doesn't affect document content. **Rationale:** `updated_at` changes alone (e.g., GitLab touching the timestamp without modifying content) should NOT trigger document regeneration. This avoids unnecessary dirty queue churn on large datasets. The WHERE clause fires the DO UPDATE unconditionally (to refresh `last_seen_at`), and `changed_semantics` is derived from `conn.changes()` after a second query that checks only semantic fields: ```rust // Two-step: always upsert (refreshes housekeeping), then check semantic change let upserted = /* run upsert SQL */; let local_id = conn.query_row("SELECT id FROM notes WHERE gitlab_id = ?", [gitlab_id], |r| r.get(0))?; let changed = conn.query_row( "SELECT COUNT(*) FROM notes WHERE id = ? AND (body IS NOT ? OR note_type IS NOT ? OR ...)", params![local_id, body, note_type, ...], |r| r.get::<_, i64>(0), )? == 0 && /* was an update, not insert */; ``` Actually, simpler approach: use `conn.changes()` from the initial upsert (which always runs the SET clause), then separately track whether the note already existed: ```rust // Check if note exists before upsert let existed = conn.query_row( "SELECT id, body, note_type, resolved, resolved_by, position_new_path, position_new_line FROM notes WHERE gitlab_id = ?", [gitlab_id], |r| Ok((r.get::<_, i64>(0)?, r.get::<_, Option>(1)?, /* ... */)), ).optional()?; // Run upsert (always updates housekeeping fields) conn.execute(upsert_sql, params![...])?; let local_id = match &existed { Some((id, ..)) => *id, None => conn.last_insert_rowid(), }; let changed_semantics = match &existed { None => true, // New insert = always changed Some((_, old_body, old_note_type, old_author_username, old_resolved, old_path, old_line)) => { old_body.as_deref() != body || old_note_type.as_deref() != note_type || old_author_username.as_deref() != author_username || /* ... */ } }; ``` This pre-read approach is clearest and avoids any SQLite edge cases with `changes()` counting. The pre-read is a single-row lookup on the UNIQUE(gitlab_id) index — negligible cost. **3. Also update `upsert_note()` in `src/ingestion/mr_discussions.rs`** to return `NoteUpsertOutcome` instead of `Result<()>`. Same semantic-change-only detection (exclude `updated_at`). **4. Sweep function for issue notes:** ```rust fn sweep_stale_issue_notes(conn: &Connection, discussion_id: i64, last_seen_at: i64) -> Result<()> { conn.execute( "DELETE FROM notes WHERE discussion_id = ? AND last_seen_at < ?", rusqlite::params![discussion_id, last_seen_at], )?; Ok(()) } ``` --- ### Work Chunk 0B: Immediate Deletion Propagation **Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs` **Depends on:** Work Chunk 0A (uses sweep functions from 0A), Work Chunk 2A (documents table must accept `source_type='note'`) **Context:** When sweep deletes stale notes, the current plan relies on eventual cleanup via `generate-docs --full` for orphaned note documents. This creates a window where deleted notes still appear in search results, eroding trust in the dataset. Instead, propagate deletion to documents immediately in the same transaction. #### Tests to Write First ```rust #[test] fn test_issue_note_sweep_deletes_note_documents_immediately() { // Setup: project, issue, discussion, 3 non-system notes // Generate note documents for all 3 // Re-sync with only 2 of the 3 notes (different last_seen_at) // Run sweep // Assert: stale note row is deleted // Assert: stale note's document is deleted from documents table // Assert: stale note's dirty_sources entry (if any) is deleted // Assert: remaining 2 notes' documents are untouched } #[test] fn test_mr_note_sweep_deletes_note_documents_immediately() { // Same pattern as above but for MR discussion notes } #[test] fn test_sweep_deletion_handles_note_without_document() { // Setup: note exists but was never turned into a document (e.g., system note) // Sweep deletes the note // Assert: no error (DELETE WHERE on non-existent document is a no-op) } #[test] fn test_set_based_deletion_atomicity() { // Setup: project, issue, discussion, 5 non-system notes with documents // Mark 3 as stale (different last_seen_at) // Run sweep // Assert: exactly 3 note rows deleted // Assert: exactly 3 documents deleted // Assert: exactly 3 dirty_sources entries deleted (if any existed) // Assert: remaining 2 note rows, documents, and dirty_sources untouched } ``` #### Implementation Update both sweep functions to propagate deletion to documents and dirty_sources using **set-based SQL** (not per-note loops). This is both faster on large threads and simpler to reason about atomically: ```rust fn sweep_stale_issue_notes(conn: &Connection, discussion_id: i64, last_seen_at: i64) -> Result<()> { // Set-based: identify stale non-system note IDs, delete their documents // and dirty_sources entries, then delete the note rows themselves. // All in one transaction scope — no per-note loop needed. // Step 1: Delete documents for stale non-system notes (cascades to // document_labels and document_paths via ON DELETE CASCADE; // FTS trigger documents_ad auto-removes FTS entry) conn.execute( "DELETE FROM documents WHERE source_type = 'note' AND source_id IN ( SELECT id FROM notes WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0 )", rusqlite::params![discussion_id, last_seen_at], )?; // Step 2: Delete dirty_sources entries for stale non-system notes conn.execute( "DELETE FROM dirty_sources WHERE source_type = 'note' AND source_id IN ( SELECT id FROM notes WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0 )", rusqlite::params![discussion_id, last_seen_at], )?; // Step 3: Delete all stale note rows (system and non-system) conn.execute( "DELETE FROM notes WHERE discussion_id = ? AND last_seen_at < ?", rusqlite::params![discussion_id, last_seen_at], )?; Ok(()) } ``` Same pattern for `sweep_stale_notes()` in `src/ingestion/mr_discussions.rs`. **Note:** The document DELETE cascades to `document_labels` and `document_paths` via ON DELETE CASCADE. The FTS trigger (`documents_ad`) automatically removes the FTS entry. No additional cleanup needed. **Why set-based:** The subquery `SELECT id FROM notes WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0` runs once per step against the UNIQUE(gitlab_id) index. This is O(1) SQL statements regardless of how many stale notes exist, vs O(N) individual DELETE statements in a loop. On large threads (100+ notes), this is measurably faster and avoids the risk of partial completion if the loop is interrupted. **Defense-in-depth:** Work Chunk 2A's migration also creates DB-level cleanup triggers (`notes_ad_cleanup`, `notes_au_system_cleanup`) that fire on ANY note deletion/system-flip, not just sweep. The sweep functions handle the common path with explicit set-based SQL; the triggers are a safety net for any future code path that deletes notes outside the sweep functions. Both mechanisms coexist — the explicit SQL in sweep is preferred (clearer intent, predictable cost), and the triggers catch edge cases. --- ### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection) **Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs` **Depends on:** Work Chunk 0A (modifies the sweep call site from 0A) **Context:** The sweep-based deletion pattern (delete notes where `last_seen_at < run_seen_at`) is correct only when a discussion's notes were fully fetched from GitLab. If a page fails mid-fetch (network timeout, rate limit, partial API response), the current logic would incorrectly delete valid notes that simply weren't seen during the incomplete fetch. This is especially dangerous for long threads with many notes that span multiple API pages. #### Tests to Write First ```rust #[test] fn test_partial_fetch_does_not_sweep_notes() { // Setup: project, issue, discussion, 5 notes already in DB // Simulate a partial fetch: only 2 of 5 notes returned // (set last_seen_at for 2 notes to current run, 3 to previous run) // Call the ingestion function with fetch_complete = false // Assert: all 5 notes still exist (sweep was skipped) // Assert: the 2 re-synced notes have updated last_seen_at } #[test] fn test_complete_fetch_runs_sweep_normally() { // Setup: project, issue, discussion, 5 notes // Simulate a complete fetch: all 5 notes returned // Call the ingestion function with fetch_complete = true // Assert: sweep runs normally (no stale notes in this case) } #[test] fn test_partial_fetch_then_complete_fetch_cleans_up() { // Setup: project, issue, discussion, 5 notes // First sync: partial fetch (3 of 5), sweep skipped // Second sync: complete fetch (only 3 notes exist on GitLab now) // Assert: sweep runs and removes the 2 notes no longer on GitLab } ``` #### Implementation Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully: ```rust // In the discussion ingestion loop, after upserting all notes: if fetch_complete { sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?; } else { tracing::warn!( discussion_id = local_discussion_id, "Skipping stale note sweep due to partial/incomplete fetch" ); } ``` **Determining `fetch_complete`:** The discussion notes come from the GitLab API response. When the API returns all notes for a discussion in a single response (no pagination error, no timeout), `fetch_complete = true`. When the fetch encounters a network error, rate limit, or is interrupted, `fetch_complete = false`. The exact signaling mechanism depends on how the existing ingestion pipeline handles partial API responses — look at the MR discussion ingestion path for the existing pattern. **Note:** This is a safety guard, not a completeness guarantee. The sweep will still run on the next successful full fetch. The guard prevents data loss during transient failures, not during permanent API changes. --- ### Work Chunk 0D: Immutable Author Identity Capture **Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs` **Depends on:** Work Chunk 0A (modifies the upsert functions from 0A) **Context:** The core use case is year-scale reviewer profiling ("search through jdefting's code review comments over the past year"). GitLab usernames are mutable — a user can change their username at any time. If a reviewer changes their username from `jdefting` to `jd-engineering` mid-year, author-based queries fragment their identity into two separate result sets. The `notes` table already captures `author_username` from the API response, but this only reflects the username at ingestion time. GitLab note payloads include `note.author.id` (an immutable integer). Capturing this alongside the username provides a stable identity anchor for longitudinal analysis, even across username changes. **Scope:** This chunk adds the column and populates it during ingestion. A `--author-id` CLI filter for `lore notes` is wired up in Phase 1 (Work Chunk 1A/1B) to make the immutable identity immediately usable for the core longitudinal analysis use case. The value here is data capture and query foundation: once `author_id` is stored, it can never be retroactively recovered if we don't capture it now. #### Tests to Write First ```rust #[test] fn test_issue_note_upsert_captures_author_id() { // Insert a note with author_id = 12345 // Assert: notes.author_id == 12345 // Assert: notes.author_username == "jdefting" } #[test] fn test_mr_note_upsert_captures_author_id() { // Same pattern for MR notes } #[test] fn test_note_upsert_author_id_nullable() { // Insert a note with author_id = None (older API responses may lack this) // Assert: notes.author_id IS NULL // Assert: no error (column is nullable) } #[test] fn test_note_author_id_survives_username_change() { // Insert note with author_username = "jdefting", author_id = 12345 // Re-upsert same gitlab_id with author_username = "jd-engineering", author_id = 12345 // Assert: author_id unchanged (12345) // Assert: author_username updated to "jd-engineering" // Assert: changed_semantics = true (username is embedded in document content/title) } ``` #### Implementation **1. Migration** — Add `author_id` column to `notes` table. This goes in migration 022 (combined with the query index migration from Work Chunk 1E to avoid an extra migration): Add to the query index migration SQL: ```sql -- Add immutable author identity column (nullable for backcompat with pre-existing notes) ALTER TABLE notes ADD COLUMN author_id INTEGER; -- Composite index for author_id lookups — used by `lore notes --author-id` -- for immutable identity queries. Includes project_id and created_at for -- the common "all notes by this person in this project" pattern. CREATE INDEX IF NOT EXISTS idx_notes_project_author_id_created ON notes(project_id, author_id, created_at DESC, id DESC) WHERE is_system = 0 AND author_id IS NOT NULL; ``` **2. Populate `author_id` during upsert** — In both `upsert_note_for_issue()` (discussions.rs) and `upsert_note()` (mr_discussions.rs), add `author_id` to the INSERT and ON CONFLICT DO UPDATE SET clauses. Extract from the GitLab API note payload's `author.id` field. **3. Semantic change detection** — `author_id` changes should NOT trigger `changed_semantics = true`. The `author_id` is an identity anchor, not a content field. It's excluded from the semantic change comparison alongside `updated_at` and `last_seen_at`. However, `author_username` changes DO trigger `changed_semantics = true` because the username appears in document content and title (see Work Chunk 0A semantic detection). **4. Note document extraction** — Work Chunk 2C's `extract_note_document()` function includes both `author_username` (in the document content header and title) and `author_id` (in the metadata header). The `author_id` field enables downstream tools to reliably identify the same person even after username changes. --- ## Phase 1: `lore notes` Command ### Work Chunk 1A: Data Types & Query Layer **Files:** `src/cli/commands/list.rs` **Context:** This file contains `IssueListRow`, `MrListRow`, their JSON counterparts, `ListFilters`, `MrListFilters`, and the `query_issues()`/`query_mrs()` functions. The new code follows these exact patterns. **Depends on:** Nothing (standalone) #### Tests to Write First Add to `src/cli/commands/list.rs` in the `#[cfg(test)] mod tests` block. The test DB setup requires the `notes` and `discussions` tables — reuse the patterns from `src/documents/extractor.rs::setup_discussion_test_db()`. ```rust // Test helper — create in-memory DB with projects, issues, MRs, discussions, notes tables // Pattern: same as extractor.rs::setup_discussion_test_db() but also include // merge_requests, mr_labels, issue_labels, labels tables #[test] fn test_query_notes_empty_db() { // Setup DB with no notes // Call query_notes with default NoteListFilters // Assert: total_count == 0, notes.is_empty() } #[test] fn test_query_notes_returns_user_notes_only() { // Insert 2 user notes and 1 system note into same discussion // Call query_notes with default filters (no_system = true by default) // Assert: returns 2 notes, system note excluded } #[test] fn test_query_notes_include_system() { // Insert 2 user notes and 1 system note // Call query_notes with include_system = true // Assert: returns 3 notes } #[test] fn test_query_notes_filter_author() { // Insert notes from "alice" and "bob" // Call query_notes with author = Some("alice") // Assert: only alice's notes returned } #[test] fn test_query_notes_filter_author_strips_at() { // Call query_notes with author = Some("@alice") // Assert: still matches "alice" (@ prefix stripped) } #[test] fn test_query_notes_filter_author_case_insensitive() { // Insert note from "Alice" (capital A) // Call query_notes with author = Some("alice") // Assert: matches (COLLATE NOCASE) } #[test] fn test_query_notes_filter_author_id() { // Insert notes from author_id = 100 (username "alice") and author_id = 200 (username "bob") // Call query_notes with author_id = Some(100) // Assert: only alice's notes returned (by immutable identity) } #[test] fn test_query_notes_filter_author_id_and_author_combined() { // Insert notes from author_id=100/username="alice" and author_id=100/username="alice-renamed" // Call query_notes with author_id = Some(100), author = Some("alice") // Assert: only notes where BOTH match (AND semantics) — returns alice's notes before rename } #[test] fn test_query_notes_filter_note_type() { // Insert notes with note_type = Some("DiffNote") and Some("DiscussionNote") and None // Call query_notes with note_type = Some("DiffNote") // Assert: only DiffNote notes returned } #[test] fn test_query_notes_filter_project() { // Insert 2 projects, notes in each // Call query_notes with project = Some("group/project-one") // Assert: only project-one notes returned (uses resolve_project()) } #[test] fn test_query_notes_filter_project_uses_default() { // Insert 2 projects, notes in each // Call query_notes with project = None, config.default_project = Some("group/project-one") // Assert: only project-one notes returned when for_issue_iid or for_mr_iid is set } #[test] fn test_query_notes_filter_since() { // Insert notes at created_at = 1000, 2000, 3000 // Call with since cutoff that excludes the first // Assert: only notes after cutoff returned } #[test] fn test_query_notes_filter_until() { // Insert notes at created_at = 1000, 2000, 3000 // Call with until cutoff that excludes the last // Assert: only notes before cutoff returned } #[test] fn test_query_notes_filter_since_and_until_combined() { // Insert notes at created_at = 1000, 2000, 3000, 4000 // Call with since=1500, until=3500 // Assert: only notes at 2000 and 3000 returned } #[test] fn test_query_notes_invalid_time_window_rejected() { // Call with since pointing to a time AFTER until // (e.g., since = "30d", until = "90d" — 30 days ago is after 90 days ago) // Assert: returns a clear error, not an empty result set } #[test] fn test_query_notes_until_date_uses_end_of_day() { // Insert notes at various times on 2025-06-15 // Call with until = "2025-06-15" // Assert: all notes on that day are included (end-of-day, not start-of-day) } #[test] fn test_query_notes_filter_contains() { // Insert notes with body "This function is too complex" and "LGTM" // Call with contains = Some("complex") // Assert: only the first note returned } #[test] fn test_query_notes_filter_contains_case_insensitive() { // Insert note with body "Use EXPECT instead of unwrap" // Call with contains = Some("expect") // Assert: matches (COLLATE NOCASE) } #[test] fn test_query_notes_filter_contains_escapes_like_wildcards() { // Insert notes with body "100% coverage" and "100 tests" // Call with contains = Some("100%") // Assert: only "100% coverage" returned (% is literal, not wildcard) } #[test] fn test_query_notes_filter_path() { // Insert DiffNotes on "src/auth.rs" and "src/config.rs" // Call with path = Some("src/auth.rs") // Assert: only auth.rs notes returned } #[test] fn test_query_notes_filter_path_prefix() { // Insert DiffNotes on "src/auth/login.rs" and "test/auth_test.rs" // Call with path = Some("src/") (trailing slash = prefix) // Assert: only src/ notes returned } #[test] fn test_query_notes_filter_for_issue_requires_project() { // Insert issue with iid=42 in project-one, same iid=42 in project-two // Call with for_issue_iid = Some(42), project = Some("group/project-one") // Assert: only notes from project-one's issue #42 } #[test] fn test_query_notes_filter_for_mr_requires_project() { // Insert MR with iid=10 in project-one, same iid=10 in project-two // Call with for_mr_iid = Some(10), project = Some("group/project-one") // Assert: only notes from project-one's MR !10 } #[test] fn test_query_notes_filter_for_issue_uses_default_project() { // Insert issue with iid=42 in project-one // Call with for_issue_iid = Some(42), project = None, config.default_project = Some("group/project-one") // Assert: resolves via defaultProject fallback — returns notes from project-one's issue #42 } #[test] fn test_query_notes_filter_for_mr_uses_default_project() { // Insert MR with iid=10 in project-one // Call with for_mr_iid = Some(10), project = None, config.default_project = Some("group/project-one") // Assert: resolves via defaultProject fallback } #[test] fn test_query_notes_filter_for_issue_without_project_context_errors() { // Call with for_issue_iid = Some(42), project = None, no defaultProject // Assert: returns error (IID requires project context) } #[test] fn test_query_notes_filter_resolution_unresolved() { // Insert 2 notes: one with resolvable=1,resolved=0 and one with resolvable=1,resolved=1 // Call with resolution = "unresolved" // Assert: only the unresolved note returned } #[test] fn test_query_notes_filter_resolution_resolved() { // Same setup as above // Call with resolution = "resolved" // Assert: only the resolved note returned } #[test] fn test_query_notes_filter_resolution_any() { // Same setup as above // Call with resolution = "any" (default) // Assert: both notes returned } #[test] fn test_query_notes_sort_created_desc() { // Insert notes with created_at = 1000, 3000, 2000 // Call with sort = "created", order = "desc" // Assert: notes ordered 3000, 2000, 1000 } #[test] fn test_query_notes_sort_created_asc() { // Same data, order = "asc" // Assert: ordered 1000, 2000, 3000 } #[test] fn test_query_notes_deterministic_tiebreak() { // Insert 3 notes with identical created_at timestamps // Call twice with same sort/order // Assert: order is identical both times (n.id tiebreak) } #[test] fn test_query_notes_limit() { // Insert 10 notes // Call with limit = 3 // Assert: notes.len() == 3, total_count == 10 } #[test] fn test_query_notes_combined_filters() { // Insert notes from multiple authors, types, projects, paths // Call with author + note_type + project + since combined // Assert: intersection of all filters } #[test] fn test_query_notes_filter_note_id_exact() { // Insert 3 notes with known local IDs // Call with note_id = Some(2) // Assert: only the note with local id 2 returned } #[test] fn test_query_notes_filter_gitlab_note_id_exact() { // Insert notes with gitlab_id = 12345 and gitlab_id = 67890 // Call with gitlab_note_id = Some(12345) // Assert: only the note with gitlab_id 12345 returned } #[test] fn test_query_notes_filter_discussion_id_exact() { // Insert 2 discussions, each with 2 notes // Call with discussion_id = Some(1) // Assert: only notes from discussion 1 returned } #[test] fn test_note_list_row_json_conversion() { // Create NoteListRow with known ms timestamps // Convert to NoteListRowJson // Assert: created_at_iso and updated_at_iso are correct ISO strings // Assert: all fields carry over } ``` #### Implementation **Data structures** (add in `src/cli/commands/list.rs` after `MrListResultJson`): ```rust // NoteListRow — raw query result, ms timestamps pub struct NoteListRow { pub id: i64, // notes.id (local) pub gitlab_id: i64, // notes.gitlab_id pub author_username: Option, pub body: Option, pub note_type: Option, // "DiffNote" | "DiscussionNote" | null pub is_system: bool, pub created_at: i64, pub updated_at: i64, pub position_new_path: Option, pub position_new_line: Option, pub position_old_path: Option, pub position_old_line: Option, pub resolvable: bool, pub resolved: bool, pub resolved_by: Option, pub noteable_type: String, // "Issue" | "MergeRequest" pub parent_iid: i64, // parent issue/MR iid pub parent_title: Option, pub project_path: String, } // NoteListRowJson — ISO timestamps, serde for JSON output pub struct NoteListRowJson { ... } // with created_at_iso, updated_at_iso impl From<&NoteListRow> for NoteListRowJson { ... } // NoteListResult pub struct NoteListResult { pub notes: Vec, pub total_count: usize, } // NoteListResultJson pub struct NoteListResultJson { pub notes: Vec, pub total_count: usize, pub showing: usize, } impl From<&NoteListResult> for NoteListResultJson { ... } ``` **Filter struct:** ```rust pub struct NoteListFilters<'a> { pub limit: usize, pub project: Option<&'a str>, pub author: Option<&'a str>, // display-name filter, case-insensitive via COLLATE NOCASE pub author_id: Option, // immutable identity filter (exact match) pub note_type: Option<&'a str>, // "DiffNote" | "DiscussionNote" pub include_system: bool, // default false pub for_issue_iid: Option, // filter by parent issue iid pub for_mr_iid: Option, // filter by parent MR iid pub note_id: Option, // filter by local note row id (exact) pub gitlab_note_id: Option, // filter by GitLab note id (exact) pub discussion_id: Option, // filter by local discussion id (exact) pub since: Option<&'a str>, pub until: Option<&'a str>, // end of time window pub path: Option<&'a str>, // exact or prefix (trailing /) pub contains: Option<&'a str>, // case-insensitive body substring match pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved" pub sort: &'a str, // "created" (default) | "updated" pub order: &'a str, // "desc" (default) | "asc" } ``` **Query function** `query_notes(conn, filters) -> Result`: **Time window parsing:** Parse `since` and `until` relative to a single anchored `now_ms` captured once at the start of `query_notes()`. This prevents subtle drift if parsing happens at different times (e.g., across midnight boundary). When `--until` is a date string (`YYYY-MM-DD`), interpret as end-of-day (`23:59:59.999 UTC`) so that `--until 2025-06-15` includes all events on June 15th, not just those before midnight. After both values are parsed, validate `since_ms <= until_ms` — if the window is inverted (e.g., `--since 30d --until 90d`, which means "30 days ago to 90 days ago" — an inverted range), return a clear error: ```rust let now_ms = Utc::now().timestamp_millis(); let since_ms = filters.since.map(|s| parse_since_with_anchor(s, now_ms)).transpose()?; let until_ms = filters.until.map(|s| parse_until_with_anchor(s, now_ms)).transpose()?; if let (Some(s), Some(u)) = (since_ms, until_ms) { if s > u { return Err(LoreError::Usage(format!( "Invalid time window: --since ({}) is after --until ({}). \ Did you mean --since {} --until {}?", format_iso(s), format_iso(u), filters.until.unwrap(), filters.since.unwrap(), ))); } } ``` **`parse_until_with_anchor`** differs from `parse_since_with_anchor` in one way: when the input is a `YYYY-MM-DD` date, it returns end-of-day (23:59:59.999 UTC) instead of start-of-day. For relative formats (`7d`, `2w`, `1m`), it behaves identically to `parse_since_with_anchor`. Core SQL shape: ```sql SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type, n.is_system, n.created_at, n.updated_at, n.position_new_path, n.position_new_line, n.position_old_path, n.position_old_line, n.resolvable, n.resolved, n.resolved_by, d.noteable_type, COALESCE(i.iid, m.iid) AS parent_iid, COALESCE(i.title, m.title) AS parent_title, p.path_with_namespace FROM notes n JOIN discussions d ON n.discussion_id = d.id JOIN projects p ON n.project_id = p.id LEFT JOIN issues i ON d.issue_id = i.id LEFT JOIN merge_requests m ON d.merge_request_id = m.id WHERE {dynamic_filters} ORDER BY {sort_column} {order}, n.id {order} LIMIT ? ``` **Important:** The `ORDER BY` includes `n.id` as a deterministic tiebreaker. Notes with identical timestamps will always sort in the same order. This follows SQLite best practice for reproducible result sets. Dynamic WHERE clauses follow the same `where_clauses` + `params` vec pattern as `query_issues()` (see `list.rs:287-374`). Filter mappings: - `include_system = false` (default): `n.is_system = 0` - `author`: strip `@` prefix, `n.author_username = ? COLLATE NOCASE` - `author_id`: `n.author_id = ?` (exact immutable identity match). If both `author` and `author_id` are provided, both are applied (AND) for precision — this lets users query "notes by user 12345 when they were known as jdefting" - `note_type`: `n.note_type = ?` - `project`: `resolve_project(conn, project)?` then `n.project_id = ?` - `note_id`: `n.id = ?` (exact local row ID match — useful for debugging sync correctness) - `gitlab_note_id`: `n.gitlab_id = ?` (exact GitLab note ID match — cross-reference with GitLab API) - `discussion_id`: `n.discussion_id = ?` (all notes in a specific discussion thread) - `since`: parsed via `parse_since_with_anchor(since_str, now_ms)` then `n.created_at >= ?` - `until`: parsed via `parse_until_with_anchor(until_str, now_ms)` then `n.created_at <= ?` - `path` with trailing `/`: `n.position_new_path LIKE ? ESCAPE '\'` (use `escape_like` from `filters.rs`) - `path` without trailing `/`: `n.position_new_path = ?` - `resolution = "unresolved"`: `n.resolvable = 1 AND n.resolved = 0` - `resolution = "resolved"`: `n.resolvable = 1 AND n.resolved = 1` - `resolution = "any"`: no filter (default) - `for_issue_iid`: requires resolved project_id (from `--project` flag or `defaultProject` config). SQL: `d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?)` — the project_id param comes from the already-resolved project context - `for_mr_iid`: same pattern — `d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?)` — requires resolved project_id **IID scoping rule:** `for_issue_iid` and `for_mr_iid` require a project context because IIDs are only unique within a project. The query layer validates this: if `for_issue_iid` or `for_mr_iid` is set without a resolved project_id, return an error. The project can come from either `--project` flag or `defaultProject` in config (resolved via the existing `resolve_project()` which already handles `defaultProject` fallback). Note: the CLI does NOT use clap's `requires = "project"` constraint for these flags, because that would block `defaultProject` resolution — the validation happens at the query layer instead. COUNT query first (same pattern as issues), then SELECT with LIMIT. **Public entry points:** ```rust /// Buffered query — materializes full result set. Used by table and JSON output. pub fn run_list_notes(config: &Config, filters: NoteListFilters) -> Result { let db_path = get_db_path(config.storage.db_path.as_deref()); let conn = create_connection(&db_path)?; query_notes(&conn, &filters) } /// Streaming query — calls row_handler for each row without full materialization. /// Used by JSONL and CSV output (Work Chunk 1C). Skips COUNT query. pub fn query_notes_stream(conn: &Connection, filters: &NoteListFilters, mut row_handler: F) -> Result<()> where F: FnMut(NoteListRow) -> Result<()>, { // Same SQL as query_notes() but iterates with Statement::query_map() // instead of collecting into Vec } ``` --- ### Work Chunk 1B: CLI Arguments & Command Wiring **Files:** `src/cli/mod.rs`, `src/main.rs`, `src/cli/commands/mod.rs`, `src/cli/robot.rs` **Depends on:** Work Chunk 1A #### Tests to Write First No unit tests for CLI arg parsing (clap handles this). Integration-level assertions: ```rust // In src/cli/robot.rs tests (or new test module): #[test] fn test_expand_fields_preset_notes() { let fields = vec!["minimal".to_string()]; let expanded = expand_fields_preset(&fields, "notes"); assert_eq!(expanded, vec!["id", "author_username", "body", "created_at_iso"]); } ``` #### Implementation **1. Add `NotesArgs` to `src/cli/mod.rs`** (after `MrsArgs`, around line 472): ```rust #[derive(Parser)] #[command(after_help = "\x1b[1mExamples:\x1b[0m lore notes --author jdefting --since 365d # All of jdefting's notes in past year lore notes --author jdefting --note-type DiffNote # Only code review comments lore notes --path src/auth/ --resolution unresolved # Unresolved comments on auth code lore notes --for-mr 456 -p group/repo # All notes on MR !456 lore notes --since 180d --until 90d # Notes from 180 to 90 days ago lore notes --author jdefting --format jsonl # Stream notes for LLM analysis lore notes --contains \"unwrap\" --note-type DiffNote # Find review comments mentioning unwrap")] pub struct NotesArgs { /// Maximum results #[arg(short = 'n', long = "limit", default_value = "50", help_heading = "Output")] pub limit: usize, /// Select output fields (comma-separated, or 'minimal' preset) #[arg(long, help_heading = "Output", value_delimiter = ',')] pub fields: Option>, /// Output format (table, json, jsonl, csv) #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")] pub format: String, /// Filter by author username (case-insensitive) #[arg(short = 'a', long, help_heading = "Filters")] pub author: Option, /// Filter by immutable GitLab author id (stable across username changes) #[arg(long = "author-id", help_heading = "Filters")] pub author_id: Option, /// Filter by note type (DiffNote, DiscussionNote) #[arg(long = "note-type", value_parser = ["DiffNote", "DiscussionNote"], help_heading = "Filters")] pub note_type: Option, /// Filter by case-insensitive substring in note body #[arg(long, help_heading = "Filters")] pub contains: Option, /// Filter by local note row id (exact match, for debugging) #[arg(long = "note-id", help_heading = "Filters")] pub note_id: Option, /// Filter by GitLab note id (exact match, for cross-referencing) #[arg(long = "gitlab-note-id", help_heading = "Filters")] pub gitlab_note_id: Option, /// Filter by local discussion id (all notes in a thread) #[arg(long = "discussion-id", help_heading = "Filters")] pub discussion_id: Option, /// Include system-generated notes (excluded by default) #[arg(long = "include-system", help_heading = "Filters", overrides_with = "no_include_system")] pub include_system: bool, #[arg(long = "no-include-system", hide = true, overrides_with = "include_system")] pub no_include_system: bool, /// Filter to notes on a specific issue IID (requires --project or defaultProject) #[arg(long = "for-issue", help_heading = "Filters", conflicts_with = "for_mr")] pub for_issue: Option, /// Filter to notes on a specific MR IID (requires --project or defaultProject) #[arg(long = "for-mr", help_heading = "Filters", conflicts_with = "for_issue")] pub for_mr: Option, /// Filter by project path #[arg(short = 'p', long, help_heading = "Filters")] pub project: Option, /// Filter by start time (7d, 2w, 1m, or YYYY-MM-DD) #[arg(long, help_heading = "Filters")] pub since: Option, /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD) #[arg(long, help_heading = "Filters")] pub until: Option, /// Filter by file path (trailing / for prefix match) #[arg(long, help_heading = "Filters")] pub path: Option, /// Resolution filter: any (default), unresolved, resolved #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")] pub resolution: String, /// Sort field (created, updated) #[arg(long, value_parser = ["created", "updated"], default_value = "created", help_heading = "Sorting")] pub sort: String, /// Sort ascending (default: descending) #[arg(long, help_heading = "Sorting", overrides_with = "no_asc")] pub asc: bool, #[arg(long = "no-asc", hide = true, overrides_with = "asc")] pub no_asc: bool, } ``` **Note on `--for-issue` / `--for-mr`:** These flags do NOT use clap's `requires = "project"` constraint. The `defaultProject` config option provides the project context without the `--project` flag being explicitly passed. Validation happens at the query layer (Work Chunk 1A) — if neither `--project` nor `defaultProject` resolves a project, the query returns a clear error. **2. Add `Notes` variant to `Commands` enum** in `src/cli/mod.rs` (around line 113): ```rust /// List discussion notes with filtering Notes(NotesArgs), ``` **3. Add `"notes"` minimal preset to `expand_fields_preset()`** in `src/cli/robot.rs` (around line 42): ```rust "notes" => ["id", "author_username", "body", "created_at_iso"] .iter() .map(|s| (*s).to_string()) .collect(), ``` **4. Add handler in `src/main.rs`** (follow `handle_issues`/`handle_mrs` pattern): ```rust fn handle_notes(config_path: Option<&str>, args: NotesArgs, robot_mode: bool) -> Result<()> { let config = load_config(config_path)?; let start = std::time::Instant::now(); let filters = NoteListFilters { limit: args.limit, project: args.project.as_deref(), author: args.author.as_deref(), author_id: args.author_id, note_type: args.note_type.as_deref(), include_system: args.include_system, for_issue_iid: args.for_issue, for_mr_iid: args.for_mr, note_id: args.note_id, gitlab_note_id: args.gitlab_note_id, discussion_id: args.discussion_id, since: args.since.as_deref(), until: args.until.as_deref(), path: args.path.as_deref(), contains: args.contains.as_deref(), resolution: &args.resolution, sort: &args.sort, order: if args.asc { "asc" } else { "desc" }, }; // JSONL and CSV use streaming path (no full materialization in memory) // Table and JSON use buffered path (need total_count for envelope/summary) match (robot_mode, args.format.as_str()) { (_, "jsonl") => { let conn = open_db(&config)?; print_list_notes_jsonl_stream(&conn, &filters)?; } (_, "csv") => { let conn = open_db(&config)?; print_list_notes_csv_stream(&conn, &filters)?; } _ => { let result = run_list_notes(&config, filters)?; match (robot_mode, args.format.as_str()) { (true, _) | (_, "json") => { print_list_notes_json(&result, start.elapsed().as_millis() as u64, args.fields.as_deref()); } _ => { print_list_notes(&result); } } } } Ok(()) } ``` Add dispatch in main match (around line 175): ```rust Some(Commands::Notes(args)) => handle_notes(cli.config.as_deref(), args, robot_mode), ``` **5. Re-export in `src/cli/commands/mod.rs`:** ```rust pub use list::{run_list_notes, query_notes_stream, print_list_notes, print_list_notes_json, print_list_notes_jsonl_stream, print_list_notes_csv_stream}; ``` --- ### Work Chunk 1C: Human & Robot Output Formatting **Files:** `src/cli/commands/list.rs` **Depends on:** Work Chunk 1A #### Tests to Write First ```rust #[test] fn test_truncate_note_body() { // Body with 200 chars should truncate to 80 + "..." let body = "x".repeat(200); let truncated = truncate_with_ellipsis(&body, 80); assert_eq!(truncated.len(), 80); assert!(truncated.ends_with("...")); } #[test] fn test_csv_stream_output_roundtrip() { // Setup DB with notes containing commas, quotes, newlines, and multi-byte chars in body // Run print_list_notes_csv_stream, capture stdout, parse back with csv::ReaderBuilder // Assert: all fields roundtrip correctly // Assert: header row present } #[test] fn test_jsonl_stream_output_one_per_line() { // Setup DB with 3 non-system notes // Run print_list_notes_jsonl_stream, capture stdout, split by newline // Assert: each line parses as valid JSON // Assert: 3 lines total (no envelope, no metadata line) } #[test] fn test_streaming_matches_buffered_content() { // Setup DB with 5 non-system notes // Run buffered query_notes() and streaming query_notes_stream() // Assert: identical note data in same order (streaming omits total_count, but // the content of each row must match the buffered path) } ``` #### Implementation **`print_list_notes(result: &NoteListResult)`** — human-readable table: Table columns: `ID | Author | Type | Body (truncated 60) | Path:Line | Parent | Created` - ID: `colored_cell(note.gitlab_id, Color::Cyan)` - Author: `colored_cell(format!("@{}", author), Color::Magenta)` - Type: "Diff" or "Disc" or "-" (colored) - Body: first line, truncated to 60 chars - Path:Line: `position_new_path:position_new_line` or "-" - Parent: `Issue #42` or `MR !456` (from noteable_type + parent_iid) - Created: `format_relative_time(created_at)` **`print_list_notes_json(result, elapsed_ms, fields)`** — robot JSON: Follows exact envelope pattern: ```json { "ok": true, "data": { "notes": [...], "total_count": N, "showing": M }, "meta": { "elapsed_ms": U64 } } ``` Supports `--fields` via `filter_fields(&mut output, "notes", &expanded)`. **`print_list_notes_jsonl` / `print_list_notes_csv`** — streaming output: For JSONL and CSV formats, use a **streaming path** that writes rows directly to stdout as they're read from the database, avoiding full materialization in memory. This matters for the year-long analysis use case where `--limit 10000` or higher is common, and for piped workflows where downstream consumers (jq, LLM ingestion) can begin processing before the query completes. **`print_list_notes_jsonl_stream(conn, filters)`** — streaming JSONL: ```rust // Execute query, iterate over rows with a callback query_notes_stream(&conn, &filters, |row| { let json_row = NoteListRowJson::from(&row); println!("{}", serde_json::to_string(&json_row).unwrap()); Ok(()) })?; ``` Each line is a complete `NoteListRowJson` object. No envelope, no metadata. This format is ideal for streaming into LLM prompts, `jq` pipelines, or notebook ingestion. **`print_list_notes_csv_stream(conn, filters)`** — streaming CSV: ```rust let mut wtr = csv::Writer::from_writer(std::io::stdout()); wtr.write_record(&["id", "gitlab_id", "author_username", "body", "note_type", ...])?; query_notes_stream(&conn, &filters, |row| { let json_row = NoteListRowJson::from(&row); wtr.write_record(&[json_row.id.to_string(), ...])?; Ok(()) })?; wtr.flush()?; ``` Columns mirror `NoteListRowJson` field names. Uses the `csv` crate (`csv::Writer`) for RFC 4180-compliant escaping, handling commas, quotes, newlines, and multi-byte characters correctly. **`query_notes_stream(conn, filters, row_handler)`** — forward-only row iteration that calls `row_handler` for each row. Uses the same SQL as `query_notes()` but iterates with `rusqlite::Statement::query_map()` instead of collecting into a Vec. The table and JSON formats continue to use the buffered `query_notes()` path since they need `total_count` and `showing` metadata. **Note:** The streaming path skips the COUNT query since there's no envelope to report `total_count` in. For JSONL, this is expected — consumers count lines themselves. For CSV, the header row provides column names; row count is implicit. **Dependency:** Add `csv = "1"` to `Cargo.toml` under `[dependencies]`. The `csv` crate is well-maintained, widely adopted (~100M downloads), and has zero unsafe code. --- ### Work Chunk 1D: robot-docs Integration **Files:** Wherever `robot-docs` manifest is generated (search for `robot-docs` or `RobotDocs` command handler) **Depends on:** Work Chunks 1A-1C complete Add the `notes` command to the robot-docs manifest with: - Command name, description, flags (including `--format`, `--until`, `--resolution`, `--contains`) - Response schema for robot mode - Exit codes Also update `--type` value_parser on `SearchArgs` (line 542 of `src/cli/mod.rs`) to include `"note"` and `"notes"` as valid values for `--source-type` (this is forward-prep for Phase 2 but doesn't break anything until Phase 2 lands). --- ### Work Chunk 1E: Composite Query Index **Files:** `migrations/022_notes_query_index.sql`, `src/core/db.rs` **Depends on:** Nothing (standalone, can run in parallel with 1A) **Context:** The `notes` table already has single-column indexes on `author_username`, `discussion_id`, `note_type`, `position_new_path`, and a composite `idx_notes_diffnote_path_created`. However, the new `query_notes()` function's most common query patterns would benefit from composite covering indexes. #### Tests to Write First ```rust #[test] fn test_migration_022_indexes_exist() { let conn = create_connection(Path::new(":memory:")).unwrap(); run_migrations(&conn).unwrap(); // Verify all indexes were created let count: i64 = conn.query_row( "SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND name IN ( 'idx_notes_user_created', 'idx_notes_project_created', 'idx_notes_project_path_created', 'idx_discussions_issue_id', 'idx_discussions_mr_id' )", [], |r| r.get(0), ).unwrap(); assert_eq!(count, 5); } ``` #### Implementation **Migration SQL** (`migrations/022_notes_query_index.sql`): ```sql -- Composite index for the common "notes by author" query pattern: -- non-system notes filtered by author, sorted by created_at DESC with id tiebreaker. -- The is_system partial index condition avoids indexing system notes (which are -- filtered out by default and typically comprise 30-50% of all notes). -- Uses COLLATE NOCASE to match the query's case-insensitive author comparison. CREATE INDEX IF NOT EXISTS idx_notes_user_created ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC) WHERE is_system = 0; -- Composite index for the common "all notes in project by date" query pattern: -- serves project-scoped listings without author filter. CREATE INDEX IF NOT EXISTS idx_notes_project_created ON notes(project_id, created_at DESC, id DESC) WHERE is_system = 0; -- Composite index for path-centric note queries (--path with project/date filters). -- DiffNote reviews on specific files are a stated hot path for the reviewer -- profiling use case. Only indexes rows where position_new_path is populated. CREATE INDEX IF NOT EXISTS idx_notes_project_path_created ON notes(project_id, position_new_path, created_at DESC, id DESC) WHERE is_system = 0 AND position_new_path IS NOT NULL; -- Index on discussions.issue_id for efficient JOIN when filtering by parent issue. -- The query_notes() function JOINs discussions to reach parent entities. CREATE INDEX IF NOT EXISTS idx_discussions_issue_id ON discussions(issue_id); -- Index on discussions.merge_request_id for efficient JOIN when filtering by parent MR. CREATE INDEX IF NOT EXISTS idx_discussions_mr_id ON discussions(merge_request_id); ``` The first partial index serves the primary use case (author-scoped queries) with `COLLATE NOCASE` matching the query's case-insensitive author comparison. The second serves project-scoped date-range queries (`--since`/`--until` without `--author`). The third serves path-centric DiffNote queries (`--path src/auth/` combined with project and date filters). All three exclude system notes, which are filtered out by default. The discussion indexes accelerate the JOIN path used by all note queries. **Register in `src/core/db.rs`:** Add to the `MIGRATIONS` array (after migration 021): ```rust ( "022", include_str!("../../migrations/022_notes_query_index.sql"), ), ``` **Note:** This bumps the migration number, so Work Chunk 2A's schema migration (which was originally numbered 022) becomes migration **023** instead. --- ## Phase 2: Per-Note Documents ### Work Chunk 2A: Schema Migration (023) **Files:** `migrations/023_note_documents.sql`, `src/core/db.rs` **Depends on:** Work Chunk 1E (must come after migration 022) **Context:** Current migration is 021 (022 after Work Chunk 1E). The `documents` and `dirty_sources` tables have CHECK constraints limiting `source_type` to `('issue','merge_request','discussion')`. SQLite doesn't support `ALTER TABLE ... ALTER CONSTRAINT`, so we use the table-rebuild pattern. #### Tests to Write First ```rust // In src/core/db.rs tests or a new migration test: #[test] fn test_migration_023_allows_note_source_type() { let conn = create_connection(Path::new(":memory:")).unwrap(); run_migrations(&conn).unwrap(); // Should NOT error — note is now a valid source_type conn.execute( "INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('note', 1, 1000)", [], ).unwrap(); conn.execute( "INSERT INTO documents (source_type, source_id, project_id, content_text, content_hash, is_truncated) VALUES ('note', 1, 1, 'test', 'abc123', 0)", [], ).unwrap(); } #[test] fn test_migration_023_preserves_existing_data() { let conn = create_connection(Path::new(":memory:")).unwrap(); run_migrations(&conn).unwrap(); // Insert with old source types still works conn.execute( "INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('issue', 1, 1000)", [], ).unwrap(); conn.execute( "INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('discussion', 2, 1000)", [], ).unwrap(); let count: i64 = conn.query_row("SELECT COUNT(*) FROM dirty_sources", [], |r| r.get(0)).unwrap(); assert_eq!(count, 2); } #[test] fn test_migration_023_fts_triggers_intact() { let conn = create_connection(Path::new(":memory:")).unwrap(); run_migrations(&conn).unwrap(); // Insert a note document conn.execute( "INSERT INTO documents (source_type, source_id, project_id, title, content_text, content_hash, is_truncated) VALUES ('note', 1, 1, 'Test Note', 'This is the note body', 'hash123', 0)", [], ).unwrap(); // FTS should auto-sync via trigger let count: i64 = conn.query_row( "SELECT COUNT(*) FROM documents_fts WHERE documents_fts MATCH 'note'", [], |r| r.get(0), ).unwrap(); assert_eq!(count, 1); } #[test] fn test_migration_023_row_counts_preserved() { // This test verifies the migration doesn't lose data during table rebuild. // It runs all migrations up to version 22, inserts test data into documents/dirty_sources/ // document_labels/document_paths BEFORE migration 023, then verifies // counts are identical after migration 023 runs. // (Implementation: create_connection_at_version(22) + insert data + run_migration(23) + assert counts) // Note: This may require a test helper that runs migrations up to a specific version. } ``` #### Implementation **Migration SQL** (`migrations/023_note_documents.sql`): The tables with CHECK constraints that need rebuilding: 1. `dirty_sources` — add `'note'` to source_type CHECK 2. `documents` — add `'note'` to source_type CHECK Pattern: create new table, copy data, drop old, rename. Must also recreate FTS triggers (they reference the table by name) and all indexes. **CRITICAL:** The `documents_fts` external content table references `documents` by rowid. Rebuilding `documents` changes rowids unless we preserve them. Use `INSERT INTO documents_new SELECT * FROM documents` to preserve the `id` (PRIMARY KEY = rowid). **CRITICAL:** The FTS triggers (`documents_ai`, `documents_ad`, `documents_au`) must be dropped and recreated after the table rebuild because they reference `documents` which was dropped/renamed. **Migration safety requirements:** - The migration executes as a single transaction (SQLite migration runner wraps each migration in a transaction). - After the table rebuild, verify row counts match: `SELECT COUNT(*) FROM documents` must equal the pre-rebuild count. The migration SQL captures counts into temp variables and asserts equality. - Run `PRAGMA foreign_key_check` after the rebuild and abort on any violation. - Rebuild FTS index and verify `documents_fts` row count matches `documents` row count. The migration must: 1. Drop FTS triggers 2. Create `documents_new` with updated CHECK (adding `'note'`) 3. `INSERT INTO documents_new SELECT * FROM documents` 4. Drop `documents` (cascades `document_labels`, `document_paths` due to ON DELETE CASCADE — so save those first!) 5. Actually: disable foreign keys, copy document_labels and document_paths data, drop old tables, rename new, recreate junction tables, restore data, recreate FTS triggers, recreate all indexes 6. Same pattern for `dirty_sources` (simpler — no dependents) ```sql -- Backfill: seed all existing non-system notes into the dirty queue -- so the next generate-docs run creates documents for them. -- Uses LEFT JOIN to skip notes that already have documents (idempotent). -- ON CONFLICT DO NOTHING handles notes already in the dirty queue. INSERT INTO dirty_sources (source_type, source_id, queued_at) SELECT 'note', n.id, CAST(strftime('%s', 'now') AS INTEGER) * 1000 FROM notes n LEFT JOIN documents d ON d.source_type = 'note' AND d.source_id = n.id WHERE n.is_system = 0 AND d.id IS NULL ON CONFLICT(source_type, source_id) DO NOTHING; ``` **Register in `src/core/db.rs`:** Add to the `MIGRATIONS` array (after migration 022): ```rust ( "023", include_str!("../../migrations/023_note_documents.sql"), ), ``` **Note:** This is a data-only migration — no schema changes. It's safe to run on empty databases (no notes = no-op). On databases with existing notes, it queues them for document generation on the next `lore generate-docs` or `lore sync` run. --- ### Work Chunk 2B: SourceType Enum Extension **Files:** `src/documents/extractor.rs` **Depends on:** Work Chunk 2A (migration must exist so test DBs have the right schema) #### Tests to Write First Add to `src/documents/extractor.rs` in the existing test module: ```rust #[test] fn test_source_type_parse_note() { assert_eq!(SourceType::parse("note"), Some(SourceType::Note)); assert_eq!(SourceType::parse("notes"), Some(SourceType::Note)); assert_eq!(SourceType::parse("NOTE"), Some(SourceType::Note)); } #[test] fn test_source_type_note_as_str() { assert_eq!(SourceType::Note.as_str(), "note"); } #[test] fn test_source_type_note_display() { assert_eq!(format!("{}", SourceType::Note), "note"); } #[test] fn test_source_type_note_serde_roundtrip() { let st = SourceType::Note; let json = serde_json::to_string(&st).unwrap(); assert_eq!(json, "\"note\""); let parsed: SourceType = serde_json::from_str(&json).unwrap(); assert_eq!(parsed, SourceType::Note); } ``` #### Implementation In `src/documents/extractor.rs`: 1. Add `Note` variant to `SourceType` enum (line 18): ```rust Note, ``` 2. Add match arm to `as_str()` (line 27): ```rust Self::Note => "note", ``` 3. Add parse aliases (line 35): ```rust "note" | "notes" => Some(Self::Note), ``` --- ### Work Chunk 2C: Note Document Extractor **Files:** `src/documents/extractor.rs` **Depends on:** Work Chunk 2B **Context:** Follows the exact pattern of `extract_issue_document()` (lines 85-184) and `extract_discussion_document()` (lines 302-516). The new function extracts a single non-system note into a `DocumentData` struct. #### Tests to Write First Add to `src/documents/extractor.rs` test module. Uses `setup_discussion_test_db()` (line 1025) and `insert_note()` (line 1086) helpers that already exist. ```rust #[test] fn test_note_document_basic_format() { let conn = setup_discussion_test_db(); insert_issue(&conn, 1, 42, Some("Auth redesign"), Some("desc"), "opened", Some("alice"), Some("https://gitlab.example.com/group/project-one/-/issues/42")); insert_discussion(&conn, 1, "Issue", Some(1), None); insert_note(&conn, 1, 12345, 1, Some("jdefting"), Some("This function is too complex, consider extracting the validation logic."), 1710460800000, false, None, None); let doc = extract_note_document(&conn, 1).unwrap().unwrap(); assert_eq!(doc.source_type, SourceType::Note); assert_eq!(doc.source_id, 1); assert_eq!(doc.project_id, 1); assert_eq!(doc.author_username, Some("jdefting".to_string())); assert!(doc.content_text.contains("[[Note]]")); assert!(doc.content_text.contains("author: @jdefting")); assert!(doc.content_text.contains("This function is too complex")); assert!(doc.content_text.contains("Issue #42: Auth redesign")); assert!(doc.content_text.contains("group/project-one")); assert_eq!(doc.title, Some("Note by @jdefting on Issue #42".to_string())); assert!(!doc.is_truncated); } #[test] fn test_note_document_diffnote_with_path() { let conn = setup_discussion_test_db(); insert_mr(&conn, 1, 99, Some("JWT Auth"), Some("desc"), Some("opened"), Some("alice"), Some("feat/jwt"), Some("main"), Some("https://gitlab.example.com/group/project-one/-/merge_requests/99")); insert_discussion(&conn, 1, "MergeRequest", None, Some(1)); insert_note(&conn, 1, 54321, 1, Some("jdefting"), Some("This should use a match statement"), 1710460800000, false, Some("src/old_auth.rs"), Some("src/auth.rs")); let doc = extract_note_document(&conn, 1).unwrap().unwrap(); assert_eq!(doc.paths, vec!["src/auth.rs", "src/old_auth.rs"]); assert!(doc.content_text.contains("path: src/auth.rs")); assert!(doc.content_text.contains("MR !99: JWT Auth")); assert_eq!(doc.title, Some("Note by @jdefting on MR !99".to_string())); assert!(doc.url.unwrap().contains("#note_54321")); } #[test] fn test_note_document_inherits_parent_labels() { let conn = setup_discussion_test_db(); insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None); insert_label(&conn, 1, "backend"); insert_label(&conn, 2, "security"); link_issue_label(&conn, 1, 1); link_issue_label(&conn, 1, 2); insert_discussion(&conn, 1, "Issue", Some(1), None); insert_note(&conn, 1, 100, 1, Some("alice"), Some("Hello"), 1710460800000, false, None, None); let doc = extract_note_document(&conn, 1).unwrap().unwrap(); assert_eq!(doc.labels, vec!["backend", "security"]); } #[test] fn test_note_document_mr_labels() { let conn = setup_discussion_test_db(); insert_mr(&conn, 1, 10, Some("Test"), None, Some("opened"), None, None, None, None); insert_label(&conn, 1, "review"); link_mr_label(&conn, 1, 1); insert_discussion(&conn, 1, "MergeRequest", None, Some(1)); insert_note(&conn, 1, 100, 1, Some("reviewer"), Some("LGTM"), 1710460800000, false, None, None); let doc = extract_note_document(&conn, 1).unwrap().unwrap(); assert_eq!(doc.labels, vec!["review"]); } #[test] fn test_note_document_system_note_returns_none() { let conn = setup_discussion_test_db(); insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None); insert_discussion(&conn, 1, "Issue", Some(1), None); insert_note(&conn, 1, 100, 1, Some("bot"), Some("assigned to @alice"), 1710460800000, true, None, None); let result = extract_note_document(&conn, 1).unwrap(); assert!(result.is_none()); } #[test] fn test_note_document_not_found() { let conn = setup_discussion_test_db(); let result = extract_note_document(&conn, 999).unwrap(); assert!(result.is_none()); } #[test] fn test_note_document_orphaned_discussion() { // Discussion exists but parent issue was deleted let conn = setup_discussion_test_db(); insert_issue(&conn, 99, 10, Some("Deleted"), None, "opened", None, None); insert_discussion(&conn, 1, "Issue", Some(99), None); insert_note(&conn, 1, 100, 1, Some("alice"), Some("Hello"), 1710460800000, false, None, None); conn.execute("PRAGMA foreign_keys = OFF", []).unwrap(); conn.execute("DELETE FROM issues WHERE id = 99", []).unwrap(); conn.execute("PRAGMA foreign_keys = ON", []).unwrap(); let result = extract_note_document(&conn, 1).unwrap(); assert!(result.is_none()); } #[test] fn test_note_document_hash_deterministic() { let conn = setup_discussion_test_db(); insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None); insert_discussion(&conn, 1, "Issue", Some(1), None); insert_note(&conn, 1, 100, 1, Some("alice"), Some("Comment"), 1710460800000, false, None, None); let doc1 = extract_note_document(&conn, 1).unwrap().unwrap(); let doc2 = extract_note_document(&conn, 1).unwrap().unwrap(); assert_eq!(doc1.content_hash, doc2.content_hash); assert_eq!(doc1.labels_hash, doc2.labels_hash); assert_eq!(doc1.paths_hash, doc2.paths_hash); } #[test] fn test_note_document_empty_body() { let conn = setup_discussion_test_db(); insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None); insert_discussion(&conn, 1, "Issue", Some(1), None); insert_note(&conn, 1, 100, 1, Some("alice"), Some(""), 1710460800000, false, None, None); // Should still produce a document (body is optional in schema) let doc = extract_note_document(&conn, 1).unwrap().unwrap(); assert!(doc.content_text.contains("[[Note]]")); } #[test] fn test_note_document_null_body() { let conn = setup_discussion_test_db(); insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None); insert_discussion(&conn, 1, "Issue", Some(1), None); insert_note(&conn, 1, 100, 1, Some("alice"), None, 1710460800000, false, None, None); // Should still produce a document (body is optional in schema) let doc = extract_note_document(&conn, 1).unwrap().unwrap(); assert!(doc.content_text.contains("[[Note]]")); } ``` #### Implementation Add `extract_note_document()` to `src/documents/extractor.rs` (after `extract_discussion_document`, around line 516): ```rust pub fn extract_note_document( conn: &Connection, note_id: i64, ) -> Result> { // 1. Fetch the note let note_row = conn.query_row( "SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type, n.is_system, n.created_at, n.updated_at, n.position_old_path, n.position_new_path, n.position_new_line, n.resolvable, n.resolved, d.noteable_type, d.issue_id, d.merge_request_id, p.path_with_namespace, p.id AS project_id FROM notes n JOIN discussions d ON n.discussion_id = d.id JOIN projects p ON n.project_id = p.id WHERE n.id = ?1", rusqlite::params![note_id], |row| { /* map all fields */ }, ); // Handle QueryReturnedNoRows -> Ok(None) // 2. Skip system notes if is_system { return Ok(None); } // 3. Fetch parent entity (Issue or MR) — same pattern as extract_discussion_document lines 332-401 // Get parent_iid, parent_title, parent_web_url, labels // 4. Build paths BTreeSet from position_old_path, position_new_path // 5. Build URL: parent_web_url + "#note_{gitlab_id}" // 6. Format content with structured metadata header: // [[Note]] // source_type: note // note_gitlab_id: {gitlab_id} // project: {path_with_namespace} // parent_type: {Issue|MergeRequest} // parent_iid: {iid} // parent_title: {title} // note_type: {DiffNote|DiscussionNote|Comment} // author: @{author} // author_id: {author_id} (only if non-null) // created_at: {iso8601} // resolved: {true|false} (only if resolvable) // path: {position_new_path}:{position_new_line} (only if DiffNote) // labels: {comma-separated} // url: {url} // // --- Body --- // // {body} // 7. Title: "Note by @{author} on {parent_type_prefix}" // 8. Compute hashes, apply truncate_hard_cap, return DocumentData } ``` The content format uses a structured key-value header optimized for machine parsing and semantic search, followed by the raw note body. This is deliberately different from discussion documents — it's optimized for individual note semantics rather than thread context. **Structured header rationale:** The key-value format allows the embedding model and FTS to index structured fields (author, project, parent reference) alongside the free-text body, improving search precision for queries like "jdefting's comments on authentication issues." --- ### Work Chunk 2D: Regenerator & Dirty Tracking Integration **Files:** `src/documents/regenerator.rs`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs` **Depends on:** Work Chunks 0A, 2B, 2C #### Tests to Write First **In `src/documents/regenerator.rs` tests:** ```rust #[test] fn test_regenerate_note_document() { let conn = setup_db(); // Add discussions + notes tables to setup_db() (or use a richer setup) // Insert: project, issue, discussion, non-system note // mark_dirty(SourceType::Note, note_id) // regenerate_dirty_documents() // Assert: document created with source_type = 'note' // Assert: document content contains note body } #[test] fn test_regenerate_note_system_note_deletes() { // Insert system note, mark dirty // regenerate_dirty_documents() // Assert: no document created (extract returns None -> delete path) } #[test] fn test_regenerate_note_unchanged() { // Create note, regenerate, mark dirty again, regenerate // Assert: second run returns unchanged = 1 } #[test] fn test_note_ingestion_idempotent_across_two_syncs() { // Setup: project, issue, discussion, 3 non-system notes // Run ingestion once -> verify 3 dirty notes queued // Regenerate documents -> verify 3 note documents created // Run ingestion again with identical data // Assert: no new dirty entries (changed_semantics = false for all) } ``` **In `src/ingestion/dirty_tracker.rs` tests:** ```rust #[test] fn test_mark_dirty_note_type() { // Update the test DB setup to include 'note' in CHECK constraint let conn = setup_db(); // This needs the new CHECK mark_dirty(&conn, SourceType::Note, 1).unwrap(); let results = get_dirty_sources(&conn).unwrap(); assert_eq!(results.len(), 1); assert_eq!(results[0].0, SourceType::Note); } ``` #### Implementation **1. Update `regenerate_one()` in `src/documents/regenerator.rs`** (line 90): ```rust SourceType::Note => extract_note_document(conn, source_id)?, ``` And add the import at line 8: ```rust use crate::documents::{ DocumentData, SourceType, extract_discussion_document, extract_issue_document, extract_mr_document, extract_note_document, }; ``` **2. Add change-aware dirty marking in `src/ingestion/discussions.rs`** (in the new upsert loop from Phase 0): ```rust for note in &normalized_notes { let outcome = upsert_note_for_issue(&tx, local_discussion_id, ¬e, last_seen_at)?; if !note.is_system && outcome.changed_semantics { dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?; } } sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?; ``` **3. Same change-aware dirty marking in `src/ingestion/mr_discussions.rs`** (update the existing upsert loop): ```rust let outcome = upsert_note(&tx, local_discussion_id, ¬e, last_seen_at, None)?; if !note.is_system && outcome.changed_semantics { dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?; } ``` **4. Update `dirty_tracker.rs` test `setup_db()`** to include `'note'` in the CHECK constraint (line 134). **5. Update `regenerator.rs` test `setup_db()`** to include the discussions + notes tables so note-type regeneration tests can run. --- ### Work Chunk 2E: Generate-Docs Full Rebuild Support **Files:** Search for where `robot-docs` manifest is generated (search for `robot-docs` or `RobotDocs` command handler) **Depends on:** Work Chunk 2D **Context:** When `lore generate-docs --full` runs, it seeds ALL issues, MRs, and discussions into the dirty queue. Notes must be seeded too. #### Tests to Write First ```rust #[test] fn test_full_seed_includes_notes() { // Setup DB with project, issue, discussion, 3 non-system notes, 1 system note // Call seed_all_dirty(conn) or whatever the full-rebuild seeder is named // Assert: dirty_sources contains 3 entries with source_type = 'note' // Assert: system note is NOT in dirty_sources } #[test] fn test_note_document_count_stable_after_second_generate_docs_full() { // Setup DB with project, issue, discussion, 5 non-system notes // Run generate-docs --full equivalent (seed + regenerate) // Record document count // Run generate-docs --full again // Assert: document count unchanged (idempotent) // Assert: dirty queue is empty after second run } ``` #### Implementation Find the function that seeds dirty_sources for `--full` mode (likely in the generate-docs handler or a dedicated seeder function). Add: ```sql INSERT INTO dirty_sources (source_type, source_id, queued_at) SELECT 'note', n.id, ?1 FROM notes n WHERE n.is_system = 0 ON CONFLICT(source_type, source_id) DO UPDATE SET queued_at = excluded.queued_at, attempt_count = 0, last_attempt_at = NULL, last_error = NULL, next_attempt_at = NULL ``` --- ### Work Chunk 2F: Search CLI `--type note` Support **Files:** `src/cli/mod.rs`, `src/cli/commands/search.rs` (display code) **Depends on:** Work Chunks 2A-2E (documents must exist to be searched) #### Tests to Write First Integration/smoke test: ```rust #[test] fn test_search_source_type_note_filter() { // This is essentially testing that SourceType::Note flows through // the existing search pipeline correctly. Since the search filter // code is generic (filters.rs:70-73), the main test is that // SourceType::parse("note") works — already covered in 2B. // Add a smoke test that the CLI accepts --type note. } ``` #### Implementation 1. Update `SearchArgs.source_type` value_parser in `src/cli/mod.rs` (line 542): ```rust #[arg(long = "type", value_name = "TYPE", value_parser = ["issue", "mr", "discussion", "note", "notes"], help_heading = "Filters")] pub source_type: Option, ``` 2. Update the search results display to show `"Note"` prefix for note-type results (check `print_search_results` in `src/cli/commands/search.rs`). --- ### Work Chunk 2G: Parent Metadata Change Propagation **Files:** `src/ingestion/orchestrator.rs` (or wherever parent entity updates trigger dirty marking), `src/documents/regenerator.rs` **Depends on:** Work Chunk 2D **Context:** Note documents inherit metadata from their parent issue/MR — specifically labels and title. When a parent's title or labels change, the note documents derived from that parent become stale. The existing ingestion pipeline already marks discussion documents dirty when parent metadata changes. Note documents need the same treatment. #### Problem If issue #42's title changes from "Auth redesign" to "Auth overhaul", all note documents under that issue still say "Issue #42: Auth redesign" until their content is regenerated. Similarly, label changes on the parent propagate into the note document's `labels` field and `label_names` text. #### Tests to Write First ```rust #[test] fn test_parent_title_change_marks_notes_dirty() { // Setup: project, issue, discussion, 2 non-system notes // Generate note documents (verify they exist) // Change the issue title // Trigger the parent-change propagation // Assert: both note documents are in dirty_sources } #[test] fn test_parent_label_change_marks_notes_dirty() { // Setup: project, issue with label "backend", discussion, note // Generate note document (verify labels = ["backend"]) // Add label "security" to the issue // Trigger the parent-change propagation // Assert: note document is in dirty_sources // Regenerate and verify labels = ["backend", "security"] } ``` #### Implementation Find where the ingestion pipeline detects parent entity changes and marks discussion documents dirty. Add the same logic for note documents: ```sql -- When an issue's title or labels change, mark all its non-system notes dirty INSERT INTO dirty_sources (source_type, source_id, queued_at) SELECT 'note', n.id, ?1 FROM notes n JOIN discussions d ON n.discussion_id = d.id WHERE d.issue_id = ?2 AND n.is_system = 0 ON CONFLICT(source_type, source_id) DO UPDATE SET queued_at = excluded.queued_at, attempt_count = 0 ``` Same pattern for MR parent changes. The exact integration point depends on how the existing discussion dirty-marking works — it should be adjacent to that code. **Note on deletion handling:** Note deletion is handled by two complementary mechanisms: 1. **Immediate propagation (Work Chunk 0B):** When sweep deletes stale notes, documents and dirty_sources entries are cleaned up in the same transaction. No stale search results. 2. **Eventual consistency (generate-docs --full):** For edge cases where a note was deleted outside the normal sweep path, the full rebuild catches orphaned documents since the note row no longer exists and `extract_note_document()` returns `None` -> document deleted. No additional deletion logic is needed beyond Work Chunk 0B + the existing regenerator orphan cleanup. --- ### Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024) **Files:** `migrations/024_note_dirty_backfill.sql`, `src/core/db.rs` **Depends on:** Work Chunk 2A (migration 023 must exist so `dirty_sources` accepts `source_type='note'`) **Context:** When a user upgrades to a version with note document support, existing notes in the database have no corresponding documents. Without a backfill, only notes that change after the upgrade would get documents — historical notes remain invisible to search. This migration seeds all existing non-system notes into the dirty queue so the next `generate-docs` run creates documents for them. #### Tests to Write First ```rust #[test] fn test_migration_024_backfills_existing_notes() { let conn = create_connection(Path::new(":memory:")).unwrap(); // Run migrations up through 023 // Insert: project, issue, discussion, 5 non-system notes, 2 system notes // Run migration 024 // Assert: dirty_sources contains 5 entries with source_type = 'note' // Assert: system notes are NOT in dirty_sources } #[test] fn test_migration_024_idempotent_with_existing_documents() { let conn = create_connection(Path::new(":memory:")).unwrap(); // Run all migrations including 024 // Insert: project, issue, discussion, 3 non-system notes // Create note documents for 2 of 3 notes (simulate partial state) // Re-run the backfill SQL manually // Assert: only the 1 note without a document is in dirty_sources // Assert: ON CONFLICT DO NOTHING prevents duplicates } #[test] fn test_migration_024_skips_notes_already_in_dirty_queue() { let conn = create_connection(Path::new(":memory:")).unwrap(); // Run all migrations // Insert note and manually add to dirty_sources // Re-run backfill SQL // Assert: no duplicate entries (ON CONFLICT DO NOTHING) } ``` #### Implementation **Migration SQL** (`migrations/024_note_dirty_backfill.sql`): ```sql -- Backfill: seed all existing non-system notes into the dirty queue -- so the next generate-docs run creates documents for them. -- Uses LEFT JOIN to skip notes that already have documents (idempotent). -- ON CONFLICT DO NOTHING handles notes already in the dirty queue. INSERT INTO dirty_sources (source_type, source_id, queued_at) SELECT 'note', n.id, CAST(strftime('%s', 'now') AS INTEGER) * 1000 FROM notes n LEFT JOIN documents d ON d.source_type = 'note' AND d.source_id = n.id WHERE n.is_system = 0 AND d.id IS NULL ON CONFLICT(source_type, source_id) DO NOTHING; ``` **Register in `src/core/db.rs`:** Add to the `MIGRATIONS` array (after migration 023): ```rust ( "024", include_str!("../../migrations/024_note_dirty_backfill.sql"), ), ``` **Note:** This is a data-only migration — no schema changes. It's safe to run on empty databases (no notes = no-op). On databases with existing notes, it queues them for document generation on the next `lore generate-docs` or `lore sync` run. --- ### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration **Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs` **Depends on:** Work Chunk 2C (extractor function must exist) **Context:** The `extract_note_document()` function fetches parent entity metadata (issue/MR title, labels, project path) via individual SQL queries per note. During the initial backfill of ~8,000 existing notes, this creates N+1 query amplification: each note triggers its own parent metadata lookup, even though many notes share the same parent entity. For example, 50 notes on the same MR would execute 50 identical parent metadata queries. This is a performance optimization for batch regeneration, not a correctness change. Individual note regeneration (dirty tracking during incremental sync) is unaffected — the N+1 cost is negligible for the typical 1-10 dirty notes per sync. #### Tests to Write First ```rust #[test] fn test_note_regeneration_batch_uses_cache() { // Setup: project, issue with 10 non-system notes // Mark all 10 as dirty // Run regenerate_dirty_documents() // Assert: all 10 documents created correctly // Assert: parent metadata query count == 1 (not 10) // (Use a query counter or verify via cache hit metrics) } #[test] fn test_note_regeneration_cache_consistent_with_direct_extraction() { // Setup: project, issue with labels, discussion, 3 notes // Extract note document directly (no cache) // Extract via cached batch path // Assert: content_hash is identical for both paths // Assert: labels_hash is identical for both paths } #[test] fn test_note_regeneration_cache_invalidates_across_parents() { // Setup: 2 issues, each with notes // Regenerate notes from both issues in one batch // Assert: each issue's notes get correct parent metadata // (cache keyed by (noteable_type, parent_id), not globally shared) } ``` #### Implementation **1. Add `ParentMetadataCache` struct** in `src/documents/extractor.rs`: ```rust use std::collections::HashMap; /// Cache for parent entity metadata during batch note document extraction. /// Keyed by (noteable_type, parent_local_id) to avoid repeated lookups /// when multiple notes share the same parent issue/MR. pub struct ParentMetadataCache { cache: HashMap<(String, i64), ParentMetadata>, } pub struct ParentMetadata { pub iid: i64, pub title: Option, pub web_url: Option, pub labels: Vec, pub project_path: String, } impl ParentMetadataCache { pub fn new() -> Self { Self { cache: HashMap::new() } } pub fn get_or_fetch( &mut self, conn: &Connection, noteable_type: &str, parent_id: i64, ) -> Result<&ParentMetadata> { // HashMap entry API: fetch from DB on miss, return cached on hit } } ``` **2. Add `extract_note_document_cached()` variant** that accepts `&mut ParentMetadataCache` and uses it instead of inline parent metadata queries. The uncached `extract_note_document()` remains for single-note regeneration. **3. Update batch regeneration loop** in `src/documents/regenerator.rs`: ```rust // In the regeneration loop, when processing a batch of dirty sources: let mut parent_cache = ParentMetadataCache::new(); for (source_type, source_id) in dirty_batch { match source_type { SourceType::Note => extract_note_document_cached(conn, source_id, &mut parent_cache)?, // Other source types use existing extraction functions (no cache needed) _ => regenerate_one(conn, source_type, source_id)?, }; } ``` **Scope limit:** The cache is created fresh per regeneration batch and discarded after. No cross-batch persistence, no invalidation complexity. The cache is purely an optimization for batch processing where many notes share parents. --- ## Verification Checklist After all chunks are complete, run the full quality gate: ```bash cargo test cargo clippy --all-targets -- -D warnings cargo fmt --check ``` Then functional smoke tests: ```bash # Phase 0 verification # Sync twice and verify note IDs are stable: lore sync lore -J notes --limit 5 # Record gitlab_ids and local ids lore sync lore -J notes --limit 5 # Verify same local ids for same gitlab_ids # Phase 1 verification lore -J notes --author jdefting --since 365d --limit 5 lore -J notes --note-type DiffNote --path src/ --limit 10 lore notes --for-mr 456 -p group/repo lore notes --for-issue 42 -p group/repo # Verify project-scoping works lore notes --since 180d --until 90d # Bounded window (180 days ago to 90 days ago) lore notes --resolution unresolved # Tri-state resolution filter lore notes --contains "unwrap" --note-type DiffNote # Body substring + type filter lore notes --author jdefting --format jsonl | wc -l # JSONL streaming lore notes --format csv > /tmp/notes.csv && head -1 /tmp/notes.csv # CSV header lore -J notes --author-id 12345 --since 365d # Immutable identity filter lore -J notes --author-id 12345 --author jdefting # Combined: both must match (AND) lore -J notes --gitlab-note-id 12345 # Precision filter: exact GitLab note lore -J notes --discussion-id 42 # Precision filter: all notes in thread # Phase 2 verification lore sync # Should generate note documents lore -J stats # Should show note document count in source_type breakdown lore -J search "code complexity" --type note --author jdefting lore -J search "error handling" --type note --since 180d ``` Idempotence checks: ```bash # Verify generate-docs --full is idempotent lore generate-docs --full lore -J stats > /tmp/stats1.json lore generate-docs --full lore -J stats > /tmp/stats2.json diff /tmp/stats1.json /tmp/stats2.json # Should be identical (modulo timing metadata) ``` Deletion propagation checks: ```bash # Verify that deleted notes don't leave stale documents # (Manual test: delete a note on GitLab, sync, verify document is gone) lore sync lore -J search "specific phrase from deleted note" --type note # Should return no results ``` Performance and query plan verification: ```bash # Verify indexes are used for common query patterns # Run EXPLAIN QUERY PLAN for the hot paths: sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type, n.is_system, n.created_at, n.updated_at, n.position_new_path, n.position_new_line, n.position_old_path, n.position_old_line, n.resolvable, n.resolved, n.resolved_by, d.noteable_type, COALESCE(i.iid, m.iid) AS parent_iid, COALESCE(i.title, m.title) AS parent_title, p.path_with_namespace FROM notes n JOIN discussions d ON n.discussion_id = d.id JOIN projects p ON n.project_id = p.id LEFT JOIN issues i ON d.issue_id = i.id LEFT JOIN merge_requests m ON d.merge_request_id = m.id WHERE n.is_system = 0 AND n.author_username = 'jdefting' COLLATE NOCASE AND n.created_at >= 1704067200000 ORDER BY n.created_at DESC, n.id DESC LIMIT 50;" # Should show SEARCH using idx_notes_user_created sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN SELECT n.id FROM notes n JOIN discussions d ON n.discussion_id = d.id WHERE n.is_system = 0 AND n.project_id = 1 AND n.created_at >= 1704067200000 ORDER BY n.created_at DESC, n.id DESC LIMIT 50;" # Should show SEARCH using idx_notes_project_created sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN SELECT n.id FROM notes n JOIN discussions d ON n.discussion_id = d.id WHERE n.is_system = 0 AND d.issue_id = (SELECT id FROM issues WHERE iid = 42 AND project_id = 1) ORDER BY n.created_at DESC, n.id DESC LIMIT 50;" # Should show SEARCH using idx_discussions_issue_id for the join sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN SELECT n.id FROM notes n JOIN discussions d ON n.discussion_id = d.id JOIN projects p ON n.project_id = p.id WHERE n.is_system = 0 AND n.project_id = 1 AND n.position_new_path LIKE 'src/auth/%' ESCAPE '\' AND n.created_at >= 1704067200000 ORDER BY n.created_at DESC, n.id DESC LIMIT 50;" # Should show SEARCH using idx_notes_project_path_created ``` Operational checks: - `lore -J stats` output includes `documents.notes` count (the stats command queries by hardcoded source_type strings — verify `'note'` is added) - Verify `lore -J count notes` still reports user vs system breakdown correctly after the changes - After a full `lore generate-docs --full`, verify note document count approximately matches non-system note count from `lore count notes` --- ## Work Chunk Dependency Graph ``` 0A (stable note identity) ──┬────────────────────────────────────┐ │ │ ├── 0B (deletion propagation) ◄──────┤── 2A (migration 023, + cleanup triggers) │ │ ├── 0C (sweep safety guard) │ │ │ ├── 0D (author_id capture) │ │ │ 1A (data types + query) ──┐ │ │ ├── 1B (CLI args + wiring) ──┐ │ ├── 1C (output formatting) ├── 1D (robot-docs) 1E (query index, mig 022 │ │ │ + author_id column) ──┘ │ │ │ │ 2A (migration 023) ───────┐ │ │ ├── 2B (SourceType enum) │ │ │ │ │ │ │ ├── 2C (extractor fn) │ │ │ │ │ │ │ │ │ ├── 2D (regenerator + dirty tracking) ◄─┘ │ │ │ │ │ │ │ ├── 2E (generate-docs --full) │ │ │ │ │ │ │ ├── 2F (search --type note) │ │ │ │ │ │ │ ├── 2G (parent change propagation) │ │ │ │ │ │ │ ├── 2H (backfill migration 024) │ │ │ │ │ │ ├── 2I (batch parent metadata cache) ``` **Parallelizable pairs:** - 0A, 1A, 1E, and 2A can all run simultaneously (no code overlap) - 0C and 0D can run immediately after 0A (both modify upsert functions from 0A) - 1C and 2B can run simultaneously - 2E, 2F, 2G, 2H, and 2I can run simultaneously after 2D (2I only needs 2C) - 0B depends on both 0A and 2A (needs sweep functions from 0A and documents table accepting 'note' from 2A) **Critical path:** 0A -> 0C -> 2D -> 2G (Phase 0 must land before dirty tracking integrates with upsert outcomes) **Secondary critical path:** 2A -> 2B -> 2C -> 2D (document pipeline chain) --- ## Estimated Document Volume Impact | Entity | Typical Count | Documents Before | Documents After | |--------|--------------|-----------------|-----------------| | Issues | 500 | 500 | 500 | | MRs | 300 | 300 | 300 | | Discussions | 2,000 | 2,000 | 2,000 | | Notes (non-system) | ~8,000 | 0 | **+8,000** | | **Total** | | **2,800** | **10,800** | FTS5 handles this comfortably. Embedding generation time scales linearly (~4x increase). The three-hash dedup means incremental syncs remain fast. With Phase 0's change-aware dirty marking, only genuinely modified notes trigger regeneration — typical incremental syncs will dirty a small fraction of the 8k total. --- ## Rejected Recommendations These recommendations were proposed during review and deliberately rejected. Documenting here to prevent re-proposal. - **Feature flag gating / gated rollout** — rejected because this is a single-user CLI tool in early development with no external users. Adding runtime feature gates (`feature.notes_cli`, `feature.note_documents`) for a feature we're building from scratch adds complexity with no benefit. Both phases ship together; there's no "blast radius" to manage. - **Keyset pagination / cursor support** — rejected because no existing list command (`lore issues`, `lore mrs`) has pagination. Adding it just for `notes` would be inconsistent. The year-long analysis use case works fine with `--limit 10000`. If pagination becomes needed across all list commands, that's a separate horizontal feature. - **Path filtering upgrade (`--path-mode exact|prefix|glob`, `--match-old-path`)** — rejected because the trailing-slash prefix convention is already established across the codebase (issues/MRs use the same pattern). Adding glob mode and old-path matching adds multiple CLI flags for a niche use case. Can be added later if users request it. - **Embedding policy knobs (`documents.note_embeddings.min_chars`, `documents.note_embeddings.enabled`, prioritize unresolved DiffNotes)** — rejected because the embedding pipeline already handles volume scaling. Adding per-source-type enable flags and minimum character thresholds is premature optimization. Short notes (e.g., "LGTM", "nit: use `expect()` here") are still semantically valuable for reviewer profiling. The existing embedding batch system handles the volume. - **Structured reviewer profiling command (`lore notes profile --author `)** — rejected because this is explicitly a non-goal in the PRD. The reviewer profile report is a downstream consumer of the infrastructure we're building. Adding it here is scope creep. It belongs in a separate PRD after this infrastructure lands. - **Operational SLOs / queue lag metrics** — rejected because this is a local CLI tool, not a service. "Oldest dirty note age" and "retry backlog size" are service-oriented metrics that don't apply. The existing `lore stats` and `lore -J count` commands provide sufficient observability. If the dirty queue becomes problematic, we add diagnostics then. - **Replace CHECK constraints with source_types registry table + FK** — rejected because the table rebuild for adding a new source type to a CHECK constraint is a rare, one-time cost (done 4 times across 23 migrations). A registry table adds per-insert FK lookup overhead, complicates the migration (still requires a table rebuild to change from CHECK to FK), and optimizes for a hypothetical future where we frequently add source types. The current CHECK approach is simpler, self-documenting, and sufficient. - **Unresolved-specific partial index (`idx_notes_unresolved_project_created`)** — rejected because the selectivity is too narrow. Unresolved notes are a small, unpredictable subset. The `idx_notes_project_created` index already covers the project+date scan; adding `WHERE resolvable = 1 AND resolved = 0` provides marginal benefit at the cost of index maintenance overhead. SQLite can filter the small remaining set in-memory efficiently. - **Previous note excerpt in document content (`previous_note_excerpt`)** — rejected because it adds a query per note extraction (fetch the preceding note in the same discussion), increases document size, creates coupling between note documents (changing one note's body would stale the next note's document), and the semantic benefit is marginal. The parent title and labels provide sufficient context. Full thread context is available via the existing discussion documents. - **Compact/slim metadata header for note documents** — rejected because the verbose key-value header is intentional. The structured fields (`source_type`, `note_gitlab_id`, `project`, `parent_type`, `parent_iid`, etc.) are what enable precise FTS and embedding search for queries like "jdefting's comments on authentication issues in project-one." The compact format (`@author on Issue#42 in project`) loses machine-parseable structure and reduces search precision. Metadata stored in document columns/labels/paths is not searchable via FTS — only `content_text` is FTS-indexed. The token cost of the header (~50 tokens) is negligible compared to typical note body length. - **Replace `fetch_complete: bool` with `FetchState` enum (`Complete`/`Partial`/`Failed`) and run_seen_at monotonicity checks** (feedback-5, rec #2) — rejected because the boolean captures the one bit of information that matters: did the fetch complete? `FetchState::Failed` is redundant with not reaching the sweep call site — if the fetch fails, we don't call sweep at all. The monotonicity check on `run_seen_at` adds complexity for a condition that can't occur in practice: `run_seen_at` is generated once per sync run and passed unchanged through all upserts. The boolean is sufficient and self-documenting. - **Embedding dedup cache keyed by semantic text hash** (feedback-5, rec #5) — rejected because the existing `content_hash` dedup already prevents re-embedding unchanged documents. A semantic-text-only hash that ignores metadata would conflate genuinely different review contexts: two "LGTM" notes from different authors on different MRs are semantically distinct for the reviewer profiling use case (who said it, where, and when matters). The embedding pipeline handles ~8k notes comfortably without dedup optimization. - **Derived review signal labels (`signal:nit`, `signal:blocking`, `signal:security`)** (feedback-5, rec #6) — rejected because (a) it encroaches on the explicitly excluded reviewer profiling scope, (b) heuristic signal derivation (regex for "nit:", keyword matching for "security") is inherently fragile and would require ongoing maintenance as review vocabulary evolves, and (c) the raw note text already supports downstream LLM-based analysis that produces far more accurate signal classification than static keyword matching. This belongs in the downstream profiling PRD where LLM-based classification can be done properly. - **Replace `last_seen_at` sweep marker with monotonic `sync_run_id`** (feedback-6, rec #3) — rejected because it introduces a new `sync_runs` table, a new column (`last_seen_run_id`), and changes sweep mechanics across both issue and MR ingestion paths. The `last_seen_at` approach is already battle-tested in the MR discussion path and works correctly for a single-user local CLI. Clock skew isn't a real concern when the same process generates timestamps within a single sync run. The engineering cost (new table, migration, plumbing through all callers) far exceeds the theoretical risk it mitigates. - **Materialize stale-note set with temp table during sweep** (feedback-6, rec #4) — rejected because the subquery `SELECT id FROM notes WHERE discussion_id = ? AND last_seen_at < ?` runs against a UNIQUE(gitlab_id) index and SQLite's query optimizer handles repeated identical subqueries efficiently. Adding `CREATE TEMP TABLE` / `DROP TABLE` DDL adds transaction complexity for negligible performance gain on typical thread sizes (< 100 notes per discussion). The defense-in-depth triggers from Work Chunk 2A already guarantee consistency even if a subquery somehow produced different results across statements (which it can't within a transaction). - **Move historical note backfill from migration to resumable runtime job** (feedback-6, rec #5) — rejected because the migration backfill is a single `INSERT...SELECT` that seeds dirty_sources from the notes table. On 8k notes, this executes in under a second on SQLite. Moving it to a runtime job adds resumability state tracking, a new code path in `generate-docs`/`sync`, and the risk that users forget to run the backfill. The migration approach is simpler, atomic, runs exactly once, and is guaranteed to execute on upgrade. If the note count were 1M+, runtime batching would be justified — at 8k it's premature. - **Property/invariant tests with proptest** (feedback-6, rec #8) — rejected because the plan already has extensive example-based tests covering all four invariants mentioned (stable local IDs across re-syncs, no orphan documents after sweeps, partial-fetch safety, idempotent full rebuilds). Adding `proptest` as a dependency for randomized testing introduces nondeterministic CI behavior, slower test runs, and harder-to-debug failures. The deterministic example-based tests provide equivalent coverage with better debuggability. If specific invariants prove fragile in practice, targeted property tests can be added later — but speculative fuzz testing at plan time is premature.