Files
gitlore/docs/prd-per-note-search.md
teernisse e9bacc94e1 perf: force partial index for DiffNote queries (26-75x), batch stats counts (1.7x)
who.rs: Add INDEXED BY idx_notes_diffnote_path_created to all DiffNote
query paths (expert, expert_details, reviews, path probes, suffix_probe).
SQLite planner was choosing idx_notes_system (106K rows, 38%) over the
partial index (26K rows, 9.3%) when LIKE predicates are present.
Measured: expert 1561ms->59ms (26x), reviews ~1200ms->16ms (75x).

stats.rs: Replace 12+ sequential COUNT(*) queries with conditional
aggregates (SUM(CASE WHEN...)) and use FTS5 shadow table
(documents_fts_docsize) instead of virtual table for counting.
Measured: warm 109ms->65ms (1.68x).
2026-02-12 09:59:46 -05:00

100 KiB

plan, title, status, iteration, target_iterations, beads_revision, related_plans, created, updated
plan title status iteration target_iterations beads_revision related_plans created updated
true iterating 6 8 0
2026-02-11 2026-02-12

PRD: Per-Note Search & Reviewer Profiling

Problem Statement

Lore ingests all GitLab discussion notes with full metadata (author, body, diff positions, timestamps), but the data is only accessible through aggregated discussion documents. There is no way to:

  1. Query individual notes by author — the --author filter on lore search only matches the first note's author per discussion thread, and relies solely on mutable usernames (no immutable author identity for longitudinal analysis)
  2. List raw notes with metadata — no CLI surface exposes the notes table directly
  3. Semantically search individual comments — notes are bundled into thread documents, diluting per-note relevance

Use case: "Search through jdefting's code review comments over the past year to build a comprehensive report of their code smell preferences and review patterns."

Design

Three phases, shipped together as one feature:

  • Phase 0 (Foundation): Stable note identity — unify issue discussion note ingestion to use upsert+sweep (matching the MR pattern), ensuring notes.id is stable across syncs, capturing immutable author_id for longitudinal analysis, and enabling change-aware dirty marking
  • Phase 1 (Option A): lore notes command — direct SQL query over the notes table with rich filtering and multiple export formats (table, JSON, JSONL, CSV)
  • Phase 2 (Option B): Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline

All three phases are required. Phase 0 gives stable identity for reliable document tracking; Phase 1 gives structured data extraction; Phase 2 gives semantic search.

Non-Goals

  • Changing existing discussion document behavior (those remain as-is)
  • Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
  • Modifying the ingestion pipeline from GitLab (data is already captured — Phase 0 only changes local storage strategy)
  • Adding pagination/cursor support (no existing list command has this; high --limit covers the year-long analysis use case)
  • Feature flag gating (no external users; single-user CLI in early dev)

Phase 0: Stable Note Identity

Rationale

Issue discussion note ingestion currently uses a delete/reinsert pattern (DELETE FROM notes WHERE discussion_id = ? then re-insert). This makes notes.id (the local row ID) unstable across syncs — every sync assigns new IDs to the same notes. MR discussion notes already use an upsert pattern (ON CONFLICT(gitlab_id) DO UPDATE), producing stable IDs.

Phase 2 depends on notes.id as the source_id for note documents. Unstable IDs would cause:

  • Unnecessary document churn (old ID deleted, new ID created for identical content)
  • Stale document accumulation (orphaned documents from old IDs)
  • Wasted regeneration cycles on every sync

Unifying both paths to upsert+sweep gives stable identity, enables change-aware dirty marking, and reduces sync overhead.

Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes

Files: src/ingestion/discussions.rs

Depends on: Nothing (standalone)

Context: src/ingestion/mr_discussions.rs already has upsert_note() (line ~470) which uses ON CONFLICT(gitlab_id) DO UPDATE and sweep_stale_notes() (line ~551) which deletes notes with last_seen_at < run_seen_at. The notes table already has UNIQUE(gitlab_id). We need to bring the issue discussion path to the same pattern.

Tests to Write First

#[test]
fn test_issue_note_upsert_stable_id() {
    // Insert a discussion with 2 notes via the new upsert path
    // Record their local IDs
    // "Re-sync" the same notes (same gitlab_ids, same content)
    // Assert: local IDs are unchanged
}

#[test]
fn test_issue_note_upsert_detects_body_change() {
    // Insert note with body "old text"
    // Re-sync same gitlab_id with body "new text"
    // Assert: upsert returns changed_semantics = true
    // Assert: local ID is unchanged
}

#[test]
fn test_issue_note_upsert_unchanged_returns_false() {
    // Insert note, re-sync identical note
    // Assert: upsert returns changed_semantics = false
}

#[test]
fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() {
    // Insert note with body "text", updated_at = 1000
    // Re-sync same gitlab_id with body "text", updated_at = 2000
    // Assert: upsert returns changed_semantics = false
    // Assert: updated_at is updated in the DB (housekeeping fields always refresh)
}

#[test]
fn test_issue_note_sweep_removes_stale() {
    // Insert 3 notes for a discussion
    // Re-sync with only 2 of the 3 (different last_seen_at)
    // Run sweep
    // Assert: stale note is deleted, 2 remain
}

#[test]
fn test_issue_note_upsert_returns_local_id() {
    // Insert a note via upsert
    // Assert: returned local_id matches conn.last_insert_rowid()
    // Or for update path: matches the existing row's id
}

Implementation

1. Create shared NoteUpsertOutcome struct (in src/ingestion/discussions.rs or a shared module):

pub struct NoteUpsertOutcome {
    pub local_note_id: i64,
    pub changed_semantics: bool,
}

2. Refactor insert_note()upsert_note_for_issue():

Replace the current DELETE FROM notes WHERE discussion_id = ? + loop insert pattern (lines 132-139) with:

for note in &normalized_notes {
    let outcome = upsert_note_for_issue(&tx, local_discussion_id, note, last_seen_at)?;
    // outcome.local_note_id and outcome.changed_semantics available for Phase 2
}
// After loop: sweep stale notes for this discussion
sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?;

The upsert SQL follows the MR pattern:

INSERT INTO notes (gitlab_id, discussion_id, project_id, author_username, body, note_type,
    is_system, created_at, updated_at, last_seen_at, ...)
VALUES (?1, ?2, ?3, ...)
ON CONFLICT(gitlab_id) DO UPDATE SET
    body = excluded.body,
    note_type = excluded.note_type,
    updated_at = excluded.updated_at,
    last_seen_at = excluded.last_seen_at,
    ...

Change detection: Semantic change is computed separately from housekeeping updates. The upsert always updates persistence fields (updated_at, last_seen_at), but changed_semantics is derived only from fields that affect note documents and search filters:

ON CONFLICT(gitlab_id) DO UPDATE SET
    body = excluded.body,
    note_type = excluded.note_type,
    updated_at = excluded.updated_at,
    last_seen_at = excluded.last_seen_at,
    resolved = excluded.resolved,
    resolved_by = excluded.resolved_by,
    position_new_path = excluded.position_new_path,
    position_new_line = excluded.position_new_line,
    position_old_path = excluded.position_old_path,
    position_old_line = excluded.position_old_line,
    ...

Then detect semantic change with a separate check that excludes updated_at and last_seen_at (housekeeping-only fields):

WHERE notes.body IS NOT excluded.body
   OR notes.note_type IS NOT excluded.note_type
   OR notes.author_username IS NOT excluded.author_username
   OR notes.resolved IS NOT excluded.resolved
   OR notes.resolved_by IS NOT excluded.resolved_by
   OR notes.position_new_path IS NOT excluded.position_new_path
   OR notes.position_new_line IS NOT excluded.position_new_line

Why author_username is semantic: Note documents embed the username in both the content header (author: @{author}) and the title (Note by @{author} on Issue #42). If a GitLab user changes their username (e.g., jdefting -> jd-engineering), the existing note documents become stale — search results show the old username, inconsistent with what the API returns. Treating username changes as semantic ensures documents stay accurate.

Note: author_id changes do NOT trigger changed_semantics. The author_id is an immutable identity anchor — it never changes in practice, and even if it did (data migration), it doesn't affect document content.

Rationale: updated_at changes alone (e.g., GitLab touching the timestamp without modifying content) should NOT trigger document regeneration. This avoids unnecessary dirty queue churn on large datasets. The WHERE clause fires the DO UPDATE unconditionally (to refresh last_seen_at), and changed_semantics is derived from conn.changes() after a second query that checks only semantic fields:

// Two-step: always upsert (refreshes housekeeping), then check semantic change
let upserted = /* run upsert SQL */;
let local_id = conn.query_row("SELECT id FROM notes WHERE gitlab_id = ?", [gitlab_id], |r| r.get(0))?;
let changed = conn.query_row(
    "SELECT COUNT(*) FROM notes WHERE id = ? AND (body IS NOT ? OR note_type IS NOT ? OR ...)",
    params![local_id, body, note_type, ...],
    |r| r.get::<_, i64>(0),
)? == 0 && /* was an update, not insert */;

Actually, simpler approach: use conn.changes() from the initial upsert (which always runs the SET clause), then separately track whether the note already existed:

// Check if note exists before upsert
let existed = conn.query_row(
    "SELECT id, body, note_type, resolved, resolved_by, position_new_path, position_new_line FROM notes WHERE gitlab_id = ?",
    [gitlab_id],
    |r| Ok((r.get::<_, i64>(0)?, r.get::<_, Option<String>>(1)?, /* ... */)),
).optional()?;

// Run upsert (always updates housekeeping fields)
conn.execute(upsert_sql, params![...])?;

let local_id = match &existed {
    Some((id, ..)) => *id,
    None => conn.last_insert_rowid(),
};

let changed_semantics = match &existed {
    None => true, // New insert = always changed
    Some((_, old_body, old_note_type, old_author_username, old_resolved, old_path, old_line)) => {
        old_body.as_deref() != body || old_note_type.as_deref() != note_type || old_author_username.as_deref() != author_username || /* ... */
    }
};

This pre-read approach is clearest and avoids any SQLite edge cases with changes() counting. The pre-read is a single-row lookup on the UNIQUE(gitlab_id) index — negligible cost.

3. Also update upsert_note() in src/ingestion/mr_discussions.rs to return NoteUpsertOutcome instead of Result<()>. Same semantic-change-only detection (exclude updated_at).

4. Sweep function for issue notes:

fn sweep_stale_issue_notes(conn: &Connection, discussion_id: i64, last_seen_at: i64) -> Result<()> {
    conn.execute(
        "DELETE FROM notes WHERE discussion_id = ? AND last_seen_at < ?",
        rusqlite::params![discussion_id, last_seen_at],
    )?;
    Ok(())
}

Work Chunk 0B: Immediate Deletion Propagation

Files: src/ingestion/discussions.rs, src/ingestion/mr_discussions.rs

Depends on: Work Chunk 0A (uses sweep functions from 0A), Work Chunk 2A (documents table must accept source_type='note')

Context: When sweep deletes stale notes, the current plan relies on eventual cleanup via generate-docs --full for orphaned note documents. This creates a window where deleted notes still appear in search results, eroding trust in the dataset. Instead, propagate deletion to documents immediately in the same transaction.

Tests to Write First

#[test]
fn test_issue_note_sweep_deletes_note_documents_immediately() {
    // Setup: project, issue, discussion, 3 non-system notes
    // Generate note documents for all 3
    // Re-sync with only 2 of the 3 notes (different last_seen_at)
    // Run sweep
    // Assert: stale note row is deleted
    // Assert: stale note's document is deleted from documents table
    // Assert: stale note's dirty_sources entry (if any) is deleted
    // Assert: remaining 2 notes' documents are untouched
}

#[test]
fn test_mr_note_sweep_deletes_note_documents_immediately() {
    // Same pattern as above but for MR discussion notes
}

#[test]
fn test_sweep_deletion_handles_note_without_document() {
    // Setup: note exists but was never turned into a document (e.g., system note)
    // Sweep deletes the note
    // Assert: no error (DELETE WHERE on non-existent document is a no-op)
}

#[test]
fn test_set_based_deletion_atomicity() {
    // Setup: project, issue, discussion, 5 non-system notes with documents
    // Mark 3 as stale (different last_seen_at)
    // Run sweep
    // Assert: exactly 3 note rows deleted
    // Assert: exactly 3 documents deleted
    // Assert: exactly 3 dirty_sources entries deleted (if any existed)
    // Assert: remaining 2 note rows, documents, and dirty_sources untouched
}

Implementation

Update both sweep functions to propagate deletion to documents and dirty_sources using set-based SQL (not per-note loops). This is both faster on large threads and simpler to reason about atomically:

fn sweep_stale_issue_notes(conn: &Connection, discussion_id: i64, last_seen_at: i64) -> Result<()> {
    // Set-based: identify stale non-system note IDs, delete their documents
    // and dirty_sources entries, then delete the note rows themselves.
    // All in one transaction scope — no per-note loop needed.

    // Step 1: Delete documents for stale non-system notes (cascades to
    // document_labels and document_paths via ON DELETE CASCADE;
    // FTS trigger documents_ad auto-removes FTS entry)
    conn.execute(
        "DELETE FROM documents WHERE source_type = 'note' AND source_id IN (
            SELECT id FROM notes
            WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0
        )",
        rusqlite::params![discussion_id, last_seen_at],
    )?;

    // Step 2: Delete dirty_sources entries for stale non-system notes
    conn.execute(
        "DELETE FROM dirty_sources WHERE source_type = 'note' AND source_id IN (
            SELECT id FROM notes
            WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0
        )",
        rusqlite::params![discussion_id, last_seen_at],
    )?;

    // Step 3: Delete all stale note rows (system and non-system)
    conn.execute(
        "DELETE FROM notes WHERE discussion_id = ? AND last_seen_at < ?",
        rusqlite::params![discussion_id, last_seen_at],
    )?;

    Ok(())
}

Same pattern for sweep_stale_notes() in src/ingestion/mr_discussions.rs.

Note: The document DELETE cascades to document_labels and document_paths via ON DELETE CASCADE. The FTS trigger (documents_ad) automatically removes the FTS entry. No additional cleanup needed.

Why set-based: The subquery SELECT id FROM notes WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0 runs once per step against the UNIQUE(gitlab_id) index. This is O(1) SQL statements regardless of how many stale notes exist, vs O(N) individual DELETE statements in a loop. On large threads (100+ notes), this is measurably faster and avoids the risk of partial completion if the loop is interrupted.

Defense-in-depth: Work Chunk 2A's migration also creates DB-level cleanup triggers (notes_ad_cleanup, notes_au_system_cleanup) that fire on ANY note deletion/system-flip, not just sweep. The sweep functions handle the common path with explicit set-based SQL; the triggers are a safety net for any future code path that deletes notes outside the sweep functions. Both mechanisms coexist — the explicit SQL in sweep is preferred (clearer intent, predictable cost), and the triggers catch edge cases.


Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)

Files: src/ingestion/discussions.rs, src/ingestion/mr_discussions.rs

Depends on: Work Chunk 0A (modifies the sweep call site from 0A)

Context: The sweep-based deletion pattern (delete notes where last_seen_at < run_seen_at) is correct only when a discussion's notes were fully fetched from GitLab. If a page fails mid-fetch (network timeout, rate limit, partial API response), the current logic would incorrectly delete valid notes that simply weren't seen during the incomplete fetch. This is especially dangerous for long threads with many notes that span multiple API pages.

Tests to Write First

#[test]
fn test_partial_fetch_does_not_sweep_notes() {
    // Setup: project, issue, discussion, 5 notes already in DB
    // Simulate a partial fetch: only 2 of 5 notes returned
    // (set last_seen_at for 2 notes to current run, 3 to previous run)
    // Call the ingestion function with fetch_complete = false
    // Assert: all 5 notes still exist (sweep was skipped)
    // Assert: the 2 re-synced notes have updated last_seen_at
}

#[test]
fn test_complete_fetch_runs_sweep_normally() {
    // Setup: project, issue, discussion, 5 notes
    // Simulate a complete fetch: all 5 notes returned
    // Call the ingestion function with fetch_complete = true
    // Assert: sweep runs normally (no stale notes in this case)
}

#[test]
fn test_partial_fetch_then_complete_fetch_cleans_up() {
    // Setup: project, issue, discussion, 5 notes
    // First sync: partial fetch (3 of 5), sweep skipped
    // Second sync: complete fetch (only 3 notes exist on GitLab now)
    // Assert: sweep runs and removes the 2 notes no longer on GitLab
}

Implementation

Add a fetch_complete parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:

// In the discussion ingestion loop, after upserting all notes:
if fetch_complete {
    sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?;
} else {
    tracing::warn!(
        discussion_id = local_discussion_id,
        "Skipping stale note sweep due to partial/incomplete fetch"
    );
}

Determining fetch_complete: The discussion notes come from the GitLab API response. When the API returns all notes for a discussion in a single response (no pagination error, no timeout), fetch_complete = true. When the fetch encounters a network error, rate limit, or is interrupted, fetch_complete = false. The exact signaling mechanism depends on how the existing ingestion pipeline handles partial API responses — look at the MR discussion ingestion path for the existing pattern.

Note: This is a safety guard, not a completeness guarantee. The sweep will still run on the next successful full fetch. The guard prevents data loss during transient failures, not during permanent API changes.


Work Chunk 0D: Immutable Author Identity Capture

Files: src/ingestion/discussions.rs, src/ingestion/mr_discussions.rs

Depends on: Work Chunk 0A (modifies the upsert functions from 0A)

Context: The core use case is year-scale reviewer profiling ("search through jdefting's code review comments over the past year"). GitLab usernames are mutable — a user can change their username at any time. If a reviewer changes their username from jdefting to jd-engineering mid-year, author-based queries fragment their identity into two separate result sets. The notes table already captures author_username from the API response, but this only reflects the username at ingestion time.

GitLab note payloads include note.author.id (an immutable integer). Capturing this alongside the username provides a stable identity anchor for longitudinal analysis, even across username changes.

Scope: This chunk adds the column and populates it during ingestion. A --author-id CLI filter for lore notes is wired up in Phase 1 (Work Chunk 1A/1B) to make the immutable identity immediately usable for the core longitudinal analysis use case. The value here is data capture and query foundation: once author_id is stored, it can never be retroactively recovered if we don't capture it now.

Tests to Write First

#[test]
fn test_issue_note_upsert_captures_author_id() {
    // Insert a note with author_id = 12345
    // Assert: notes.author_id == 12345
    // Assert: notes.author_username == "jdefting"
}

#[test]
fn test_mr_note_upsert_captures_author_id() {
    // Same pattern for MR notes
}

#[test]
fn test_note_upsert_author_id_nullable() {
    // Insert a note with author_id = None (older API responses may lack this)
    // Assert: notes.author_id IS NULL
    // Assert: no error (column is nullable)
}

#[test]
fn test_note_author_id_survives_username_change() {
    // Insert note with author_username = "jdefting", author_id = 12345
    // Re-upsert same gitlab_id with author_username = "jd-engineering", author_id = 12345
    // Assert: author_id unchanged (12345)
    // Assert: author_username updated to "jd-engineering"
    // Assert: changed_semantics = true (username is embedded in document content/title)
}

Implementation

1. Migration — Add author_id column to notes table. This goes in migration 022 (combined with the query index migration from Work Chunk 1E to avoid an extra migration):

Add to the query index migration SQL:

-- Add immutable author identity column (nullable for backcompat with pre-existing notes)
ALTER TABLE notes ADD COLUMN author_id INTEGER;

-- Composite index for author_id lookups — used by `lore notes --author-id`
-- for immutable identity queries. Includes project_id and created_at for
-- the common "all notes by this person in this project" pattern.
CREATE INDEX IF NOT EXISTS idx_notes_project_author_id_created
ON notes(project_id, author_id, created_at DESC, id DESC)
WHERE is_system = 0 AND author_id IS NOT NULL;

2. Populate author_id during upsert — In both upsert_note_for_issue() (discussions.rs) and upsert_note() (mr_discussions.rs), add author_id to the INSERT and ON CONFLICT DO UPDATE SET clauses. Extract from the GitLab API note payload's author.id field.

3. Semantic change detectionauthor_id changes should NOT trigger changed_semantics = true. The author_id is an identity anchor, not a content field. It's excluded from the semantic change comparison alongside updated_at and last_seen_at. However, author_username changes DO trigger changed_semantics = true because the username appears in document content and title (see Work Chunk 0A semantic detection).

4. Note document extraction — Work Chunk 2C's extract_note_document() function includes both author_username (in the document content header and title) and author_id (in the metadata header). The author_id field enables downstream tools to reliably identify the same person even after username changes.


Phase 1: lore notes Command

Work Chunk 1A: Data Types & Query Layer

Files: src/cli/commands/list.rs

Context: This file contains IssueListRow, MrListRow, their JSON counterparts, ListFilters, MrListFilters, and the query_issues()/query_mrs() functions. The new code follows these exact patterns.

Depends on: Nothing (standalone)

Tests to Write First

Add to src/cli/commands/list.rs in the #[cfg(test)] mod tests block. The test DB setup requires the notes and discussions tables — reuse the patterns from src/documents/extractor.rs::setup_discussion_test_db().

// Test helper — create in-memory DB with projects, issues, MRs, discussions, notes tables
// Pattern: same as extractor.rs::setup_discussion_test_db() but also include
// merge_requests, mr_labels, issue_labels, labels tables

#[test]
fn test_query_notes_empty_db() {
    // Setup DB with no notes
    // Call query_notes with default NoteListFilters
    // Assert: total_count == 0, notes.is_empty()
}

#[test]
fn test_query_notes_returns_user_notes_only() {
    // Insert 2 user notes and 1 system note into same discussion
    // Call query_notes with default filters (no_system = true by default)
    // Assert: returns 2 notes, system note excluded
}

#[test]
fn test_query_notes_include_system() {
    // Insert 2 user notes and 1 system note
    // Call query_notes with include_system = true
    // Assert: returns 3 notes
}

#[test]
fn test_query_notes_filter_author() {
    // Insert notes from "alice" and "bob"
    // Call query_notes with author = Some("alice")
    // Assert: only alice's notes returned
}

#[test]
fn test_query_notes_filter_author_strips_at() {
    // Call query_notes with author = Some("@alice")
    // Assert: still matches "alice" (@ prefix stripped)
}

#[test]
fn test_query_notes_filter_author_case_insensitive() {
    // Insert note from "Alice" (capital A)
    // Call query_notes with author = Some("alice")
    // Assert: matches (COLLATE NOCASE)
}

#[test]
fn test_query_notes_filter_author_id() {
    // Insert notes from author_id = 100 (username "alice") and author_id = 200 (username "bob")
    // Call query_notes with author_id = Some(100)
    // Assert: only alice's notes returned (by immutable identity)
}

#[test]
fn test_query_notes_filter_author_id_and_author_combined() {
    // Insert notes from author_id=100/username="alice" and author_id=100/username="alice-renamed"
    // Call query_notes with author_id = Some(100), author = Some("alice")
    // Assert: only notes where BOTH match (AND semantics) — returns alice's notes before rename
}

#[test]
fn test_query_notes_filter_note_type() {
    // Insert notes with note_type = Some("DiffNote") and Some("DiscussionNote") and None
    // Call query_notes with note_type = Some("DiffNote")
    // Assert: only DiffNote notes returned
}

#[test]
fn test_query_notes_filter_project() {
    // Insert 2 projects, notes in each
    // Call query_notes with project = Some("group/project-one")
    // Assert: only project-one notes returned (uses resolve_project())
}

#[test]
fn test_query_notes_filter_project_uses_default() {
    // Insert 2 projects, notes in each
    // Call query_notes with project = None, config.default_project = Some("group/project-one")
    // Assert: only project-one notes returned when for_issue_iid or for_mr_iid is set
}

#[test]
fn test_query_notes_filter_since() {
    // Insert notes at created_at = 1000, 2000, 3000
    // Call with since cutoff that excludes the first
    // Assert: only notes after cutoff returned
}

#[test]
fn test_query_notes_filter_until() {
    // Insert notes at created_at = 1000, 2000, 3000
    // Call with until cutoff that excludes the last
    // Assert: only notes before cutoff returned
}

#[test]
fn test_query_notes_filter_since_and_until_combined() {
    // Insert notes at created_at = 1000, 2000, 3000, 4000
    // Call with since=1500, until=3500
    // Assert: only notes at 2000 and 3000 returned
}

#[test]
fn test_query_notes_invalid_time_window_rejected() {
    // Call with since pointing to a time AFTER until
    // (e.g., since = "30d", until = "90d" — 30 days ago is after 90 days ago)
    // Assert: returns a clear error, not an empty result set
}

#[test]
fn test_query_notes_until_date_uses_end_of_day() {
    // Insert notes at various times on 2025-06-15
    // Call with until = "2025-06-15"
    // Assert: all notes on that day are included (end-of-day, not start-of-day)
}

#[test]
fn test_query_notes_filter_contains() {
    // Insert notes with body "This function is too complex" and "LGTM"
    // Call with contains = Some("complex")
    // Assert: only the first note returned
}

#[test]
fn test_query_notes_filter_contains_case_insensitive() {
    // Insert note with body "Use EXPECT instead of unwrap"
    // Call with contains = Some("expect")
    // Assert: matches (COLLATE NOCASE)
}

#[test]
fn test_query_notes_filter_contains_escapes_like_wildcards() {
    // Insert notes with body "100% coverage" and "100 tests"
    // Call with contains = Some("100%")
    // Assert: only "100% coverage" returned (% is literal, not wildcard)
}

#[test]
fn test_query_notes_filter_path() {
    // Insert DiffNotes on "src/auth.rs" and "src/config.rs"
    // Call with path = Some("src/auth.rs")
    // Assert: only auth.rs notes returned
}

#[test]
fn test_query_notes_filter_path_prefix() {
    // Insert DiffNotes on "src/auth/login.rs" and "test/auth_test.rs"
    // Call with path = Some("src/") (trailing slash = prefix)
    // Assert: only src/ notes returned
}

#[test]
fn test_query_notes_filter_for_issue_requires_project() {
    // Insert issue with iid=42 in project-one, same iid=42 in project-two
    // Call with for_issue_iid = Some(42), project = Some("group/project-one")
    // Assert: only notes from project-one's issue #42
}

#[test]
fn test_query_notes_filter_for_mr_requires_project() {
    // Insert MR with iid=10 in project-one, same iid=10 in project-two
    // Call with for_mr_iid = Some(10), project = Some("group/project-one")
    // Assert: only notes from project-one's MR !10
}

#[test]
fn test_query_notes_filter_for_issue_uses_default_project() {
    // Insert issue with iid=42 in project-one
    // Call with for_issue_iid = Some(42), project = None, config.default_project = Some("group/project-one")
    // Assert: resolves via defaultProject fallback — returns notes from project-one's issue #42
}

#[test]
fn test_query_notes_filter_for_mr_uses_default_project() {
    // Insert MR with iid=10 in project-one
    // Call with for_mr_iid = Some(10), project = None, config.default_project = Some("group/project-one")
    // Assert: resolves via defaultProject fallback
}

#[test]
fn test_query_notes_filter_for_issue_without_project_context_errors() {
    // Call with for_issue_iid = Some(42), project = None, no defaultProject
    // Assert: returns error (IID requires project context)
}

#[test]
fn test_query_notes_filter_resolution_unresolved() {
    // Insert 2 notes: one with resolvable=1,resolved=0 and one with resolvable=1,resolved=1
    // Call with resolution = "unresolved"
    // Assert: only the unresolved note returned
}

#[test]
fn test_query_notes_filter_resolution_resolved() {
    // Same setup as above
    // Call with resolution = "resolved"
    // Assert: only the resolved note returned
}

#[test]
fn test_query_notes_filter_resolution_any() {
    // Same setup as above
    // Call with resolution = "any" (default)
    // Assert: both notes returned
}

#[test]
fn test_query_notes_sort_created_desc() {
    // Insert notes with created_at = 1000, 3000, 2000
    // Call with sort = "created", order = "desc"
    // Assert: notes ordered 3000, 2000, 1000
}

#[test]
fn test_query_notes_sort_created_asc() {
    // Same data, order = "asc"
    // Assert: ordered 1000, 2000, 3000
}

#[test]
fn test_query_notes_deterministic_tiebreak() {
    // Insert 3 notes with identical created_at timestamps
    // Call twice with same sort/order
    // Assert: order is identical both times (n.id tiebreak)
}

#[test]
fn test_query_notes_limit() {
    // Insert 10 notes
    // Call with limit = 3
    // Assert: notes.len() == 3, total_count == 10
}

#[test]
fn test_query_notes_combined_filters() {
    // Insert notes from multiple authors, types, projects, paths
    // Call with author + note_type + project + since combined
    // Assert: intersection of all filters
}

#[test]
fn test_query_notes_filter_note_id_exact() {
    // Insert 3 notes with known local IDs
    // Call with note_id = Some(2)
    // Assert: only the note with local id 2 returned
}

#[test]
fn test_query_notes_filter_gitlab_note_id_exact() {
    // Insert notes with gitlab_id = 12345 and gitlab_id = 67890
    // Call with gitlab_note_id = Some(12345)
    // Assert: only the note with gitlab_id 12345 returned
}

#[test]
fn test_query_notes_filter_discussion_id_exact() {
    // Insert 2 discussions, each with 2 notes
    // Call with discussion_id = Some(1)
    // Assert: only notes from discussion 1 returned
}

#[test]
fn test_note_list_row_json_conversion() {
    // Create NoteListRow with known ms timestamps
    // Convert to NoteListRowJson
    // Assert: created_at_iso and updated_at_iso are correct ISO strings
    // Assert: all fields carry over
}

Implementation

Data structures (add in src/cli/commands/list.rs after MrListResultJson):

// NoteListRow — raw query result, ms timestamps
pub struct NoteListRow {
    pub id: i64,              // notes.id (local)
    pub gitlab_id: i64,       // notes.gitlab_id
    pub author_username: Option<String>,
    pub body: Option<String>,
    pub note_type: Option<String>,   // "DiffNote" | "DiscussionNote" | null
    pub is_system: bool,
    pub created_at: i64,
    pub updated_at: i64,
    pub position_new_path: Option<String>,
    pub position_new_line: Option<i64>,
    pub position_old_path: Option<String>,
    pub position_old_line: Option<i64>,
    pub resolvable: bool,
    pub resolved: bool,
    pub resolved_by: Option<String>,
    pub noteable_type: String,        // "Issue" | "MergeRequest"
    pub parent_iid: i64,             // parent issue/MR iid
    pub parent_title: Option<String>,
    pub project_path: String,
}

// NoteListRowJson — ISO timestamps, serde for JSON output
pub struct NoteListRowJson { ... }  // with created_at_iso, updated_at_iso

impl From<&NoteListRow> for NoteListRowJson { ... }

// NoteListResult
pub struct NoteListResult {
    pub notes: Vec<NoteListRow>,
    pub total_count: usize,
}

// NoteListResultJson
pub struct NoteListResultJson {
    pub notes: Vec<NoteListRowJson>,
    pub total_count: usize,
    pub showing: usize,
}

impl From<&NoteListResult> for NoteListResultJson { ... }

Filter struct:

pub struct NoteListFilters<'a> {
    pub limit: usize,
    pub project: Option<&'a str>,
    pub author: Option<&'a str>,          // display-name filter, case-insensitive via COLLATE NOCASE
    pub author_id: Option<i64>,           // immutable identity filter (exact match)
    pub note_type: Option<&'a str>,       // "DiffNote" | "DiscussionNote"
    pub include_system: bool,             // default false
    pub for_issue_iid: Option<i64>,       // filter by parent issue iid
    pub for_mr_iid: Option<i64>,          // filter by parent MR iid
    pub note_id: Option<i64>,             // filter by local note row id (exact)
    pub gitlab_note_id: Option<i64>,      // filter by GitLab note id (exact)
    pub discussion_id: Option<i64>,       // filter by local discussion id (exact)
    pub since: Option<&'a str>,
    pub until: Option<&'a str>,           // end of time window
    pub path: Option<&'a str>,            // exact or prefix (trailing /)
    pub contains: Option<&'a str>,        // case-insensitive body substring match
    pub resolution: &'a str,              // "any" (default) | "unresolved" | "resolved"
    pub sort: &'a str,                    // "created" (default) | "updated"
    pub order: &'a str,                   // "desc" (default) | "asc"
}

Query function query_notes(conn, filters) -> Result<NoteListResult>:

Time window parsing: Parse since and until relative to a single anchored now_ms captured once at the start of query_notes(). This prevents subtle drift if parsing happens at different times (e.g., across midnight boundary). When --until is a date string (YYYY-MM-DD), interpret as end-of-day (23:59:59.999 UTC) so that --until 2025-06-15 includes all events on June 15th, not just those before midnight. After both values are parsed, validate since_ms <= until_ms — if the window is inverted (e.g., --since 30d --until 90d, which means "30 days ago to 90 days ago" — an inverted range), return a clear error:

let now_ms = Utc::now().timestamp_millis();

let since_ms = filters.since.map(|s| parse_since_with_anchor(s, now_ms)).transpose()?;
let until_ms = filters.until.map(|s| parse_until_with_anchor(s, now_ms)).transpose()?;

if let (Some(s), Some(u)) = (since_ms, until_ms) {
    if s > u {
        return Err(LoreError::Usage(format!(
            "Invalid time window: --since ({}) is after --until ({}). \
             Did you mean --since {} --until {}?",
            format_iso(s), format_iso(u),
            filters.until.unwrap(), filters.since.unwrap(),
        )));
    }
}

parse_until_with_anchor differs from parse_since_with_anchor in one way: when the input is a YYYY-MM-DD date, it returns end-of-day (23:59:59.999 UTC) instead of start-of-day. For relative formats (7d, 2w, 1m), it behaves identically to parse_since_with_anchor.

Core SQL shape:

SELECT
    n.id, n.gitlab_id, n.author_username, n.body, n.note_type,
    n.is_system, n.created_at, n.updated_at,
    n.position_new_path, n.position_new_line,
    n.position_old_path, n.position_old_line,
    n.resolvable, n.resolved, n.resolved_by,
    d.noteable_type,
    COALESCE(i.iid, m.iid) AS parent_iid,
    COALESCE(i.title, m.title) AS parent_title,
    p.path_with_namespace
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
JOIN projects p ON n.project_id = p.id
LEFT JOIN issues i ON d.issue_id = i.id
LEFT JOIN merge_requests m ON d.merge_request_id = m.id
WHERE {dynamic_filters}
ORDER BY {sort_column} {order}, n.id {order}
LIMIT ?

Important: The ORDER BY includes n.id as a deterministic tiebreaker. Notes with identical timestamps will always sort in the same order. This follows SQLite best practice for reproducible result sets.

Dynamic WHERE clauses follow the same where_clauses + params vec pattern as query_issues() (see list.rs:287-374).

Filter mappings:

  • include_system = false (default): n.is_system = 0
  • author: strip @ prefix, n.author_username = ? COLLATE NOCASE
  • author_id: n.author_id = ? (exact immutable identity match). If both author and author_id are provided, both are applied (AND) for precision — this lets users query "notes by user 12345 when they were known as jdefting"
  • note_type: n.note_type = ?
  • project: resolve_project(conn, project)? then n.project_id = ?
  • note_id: n.id = ? (exact local row ID match — useful for debugging sync correctness)
  • gitlab_note_id: n.gitlab_id = ? (exact GitLab note ID match — cross-reference with GitLab API)
  • discussion_id: n.discussion_id = ? (all notes in a specific discussion thread)
  • since: parsed via parse_since_with_anchor(since_str, now_ms) then n.created_at >= ?
  • until: parsed via parse_until_with_anchor(until_str, now_ms) then n.created_at <= ?
  • path with trailing /: n.position_new_path LIKE ? ESCAPE '\' (use escape_like from filters.rs)
  • path without trailing /: n.position_new_path = ?
  • resolution = "unresolved": n.resolvable = 1 AND n.resolved = 0
  • resolution = "resolved": n.resolvable = 1 AND n.resolved = 1
  • resolution = "any": no filter (default)
  • for_issue_iid: requires resolved project_id (from --project flag or defaultProject config). SQL: d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?) — the project_id param comes from the already-resolved project context
  • for_mr_iid: same pattern — d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?) — requires resolved project_id

IID scoping rule: for_issue_iid and for_mr_iid require a project context because IIDs are only unique within a project. The query layer validates this: if for_issue_iid or for_mr_iid is set without a resolved project_id, return an error. The project can come from either --project flag or defaultProject in config (resolved via the existing resolve_project() which already handles defaultProject fallback). Note: the CLI does NOT use clap's requires = "project" constraint for these flags, because that would block defaultProject resolution — the validation happens at the query layer instead.

COUNT query first (same pattern as issues), then SELECT with LIMIT.

Public entry point:

pub fn run_list_notes(config: &Config, filters: NoteListFilters) -> Result<NoteListResult> {
    let db_path = get_db_path(config.storage.db_path.as_deref());
    let conn = create_connection(&db_path)?;
    query_notes(&conn, &filters)
}

Work Chunk 1B: CLI Arguments & Command Wiring

Files: src/cli/mod.rs, src/main.rs, src/cli/commands/mod.rs, src/cli/robot.rs

Depends on: Work Chunk 1A

Tests to Write First

No unit tests for CLI arg parsing (clap handles this). Integration-level assertions:

// In src/cli/robot.rs tests (or new test module):

#[test]
fn test_expand_fields_preset_notes() {
    let fields = vec!["minimal".to_string()];
    let expanded = expand_fields_preset(&fields, "notes");
    assert_eq!(expanded, vec!["id", "author_username", "body", "created_at_iso"]);
}

Implementation

1. Add NotesArgs to src/cli/mod.rs (after MrsArgs, around line 472):

#[derive(Parser)]
#[command(after_help = "\x1b[1mExamples:\x1b[0m
  lore notes --author jdefting --since 365d           # All of jdefting's notes in past year
  lore notes --author jdefting --note-type DiffNote    # Only code review comments
  lore notes --path src/auth/ --resolution unresolved  # Unresolved comments on auth code
  lore notes --for-mr 456 -p group/repo               # All notes on MR !456
  lore notes --since 180d --until 90d                  # Notes from 180 to 90 days ago
  lore notes --author jdefting --format jsonl           # Stream notes for LLM analysis
  lore notes --contains \"unwrap\" --note-type DiffNote  # Find review comments mentioning unwrap")]
pub struct NotesArgs {
    /// Maximum results
    #[arg(short = 'n', long = "limit", default_value = "50", help_heading = "Output")]
    pub limit: usize,

    /// Select output fields (comma-separated, or 'minimal' preset)
    #[arg(long, help_heading = "Output", value_delimiter = ',')]
    pub fields: Option<Vec<String>>,

    /// Output format (table, json, jsonl, csv)
    #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")]
    pub format: String,

    /// Filter by author username (case-insensitive)
    #[arg(short = 'a', long, help_heading = "Filters")]
    pub author: Option<String>,

    /// Filter by immutable GitLab author id (stable across username changes)
    #[arg(long = "author-id", help_heading = "Filters")]
    pub author_id: Option<i64>,

    /// Filter by note type (DiffNote, DiscussionNote)
    #[arg(long = "note-type", value_parser = ["DiffNote", "DiscussionNote"], help_heading = "Filters")]
    pub note_type: Option<String>,

    /// Filter by case-insensitive substring in note body
    #[arg(long, help_heading = "Filters")]
    pub contains: Option<String>,

    /// Filter by local note row id (exact match, for debugging)
    #[arg(long = "note-id", help_heading = "Filters")]
    pub note_id: Option<i64>,

    /// Filter by GitLab note id (exact match, for cross-referencing)
    #[arg(long = "gitlab-note-id", help_heading = "Filters")]
    pub gitlab_note_id: Option<i64>,

    /// Filter by local discussion id (all notes in a thread)
    #[arg(long = "discussion-id", help_heading = "Filters")]
    pub discussion_id: Option<i64>,

    /// Include system-generated notes (excluded by default)
    #[arg(long = "include-system", help_heading = "Filters", overrides_with = "no_include_system")]
    pub include_system: bool,

    #[arg(long = "no-include-system", hide = true, overrides_with = "include_system")]
    pub no_include_system: bool,

    /// Filter to notes on a specific issue IID (requires --project or defaultProject)
    #[arg(long = "for-issue", help_heading = "Filters", conflicts_with = "for_mr")]
    pub for_issue: Option<i64>,

    /// Filter to notes on a specific MR IID (requires --project or defaultProject)
    #[arg(long = "for-mr", help_heading = "Filters", conflicts_with = "for_issue")]
    pub for_mr: Option<i64>,

    /// Filter by project path
    #[arg(short = 'p', long, help_heading = "Filters")]
    pub project: Option<String>,

    /// Filter by start time (7d, 2w, 1m, or YYYY-MM-DD)
    #[arg(long, help_heading = "Filters")]
    pub since: Option<String>,

    /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD)
    #[arg(long, help_heading = "Filters")]
    pub until: Option<String>,

    /// Filter by file path (trailing / for prefix match)
    #[arg(long, help_heading = "Filters")]
    pub path: Option<String>,

    /// Resolution filter: any (default), unresolved, resolved
    #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")]
    pub resolution: String,

    /// Sort field (created, updated)
    #[arg(long, value_parser = ["created", "updated"], default_value = "created", help_heading = "Sorting")]
    pub sort: String,

    /// Sort ascending (default: descending)
    #[arg(long, help_heading = "Sorting", overrides_with = "no_asc")]
    pub asc: bool,

    #[arg(long = "no-asc", hide = true, overrides_with = "asc")]
    pub no_asc: bool,
}

Note on --for-issue / --for-mr: These flags do NOT use clap's requires = "project" constraint. The defaultProject config option provides the project context without the --project flag being explicitly passed. Validation happens at the query layer (Work Chunk 1A) — if neither --project nor defaultProject resolves a project, the query returns a clear error.

2. Add Notes variant to Commands enum in src/cli/mod.rs (around line 113):

/// List discussion notes with filtering
Notes(NotesArgs),

3. Add "notes" minimal preset to expand_fields_preset() in src/cli/robot.rs (around line 42):

"notes" => ["id", "author_username", "body", "created_at_iso"]
    .iter()
    .map(|s| (*s).to_string())
    .collect(),

4. Add handler in src/main.rs (follow handle_issues/handle_mrs pattern):

fn handle_notes(config_path: Option<&str>, args: NotesArgs, robot_mode: bool) -> Result<()> {
    let config = load_config(config_path)?;
    let start = std::time::Instant::now();

    let filters = NoteListFilters {
        limit: args.limit,
        project: args.project.as_deref(),
        author: args.author.as_deref(),
        author_id: args.author_id,
        note_type: args.note_type.as_deref(),
        include_system: args.include_system,
        for_issue_iid: args.for_issue,
        for_mr_iid: args.for_mr,
        note_id: args.note_id,
        gitlab_note_id: args.gitlab_note_id,
        discussion_id: args.discussion_id,
        since: args.since.as_deref(),
        until: args.until.as_deref(),
        path: args.path.as_deref(),
        contains: args.contains.as_deref(),
        resolution: &args.resolution,
        sort: &args.sort,
        order: if args.asc { "asc" } else { "desc" },
    };

    // JSONL and CSV use streaming path (no full materialization in memory)
    // Table and JSON use buffered path (need total_count for envelope/summary)
    match (robot_mode, args.format.as_str()) {
        (_, "jsonl") => {
            let conn = open_db(&config)?;
            print_list_notes_jsonl_stream(&conn, &filters)?;
        }
        (_, "csv") => {
            let conn = open_db(&config)?;
            print_list_notes_csv_stream(&conn, &filters)?;
        }
        _ => {
            let result = run_list_notes(&config, filters)?;
            match (robot_mode, args.format.as_str()) {
                (true, _) | (_, "json") => {
                    print_list_notes_json(&result, start.elapsed().as_millis() as u64, args.fields.as_deref());
                }
                _ => {
                    print_list_notes(&result);
                }
            }
        }
    }
    Ok(())
}

Add dispatch in main match (around line 175):

Some(Commands::Notes(args)) => handle_notes(cli.config.as_deref(), args, robot_mode),

5. Re-export in src/cli/commands/mod.rs:

pub use list::{run_list_notes, print_list_notes, print_list_notes_json, print_list_notes_jsonl, print_list_notes_csv};

Work Chunk 1C: Human & Robot Output Formatting

Files: src/cli/commands/list.rs

Depends on: Work Chunk 1A

Tests to Write First

#[test]
fn test_truncate_note_body() {
    // Body with 200 chars should truncate to 80 + "..."
    let body = "x".repeat(200);
    let truncated = truncate_with_ellipsis(&body, 80);
    assert_eq!(truncated.len(), 80);
    assert!(truncated.ends_with("..."));
}

#[test]
fn test_csv_output_roundtrip() {
    // NoteListRow with body containing commas, quotes, newlines, and multi-byte chars
    // Write via print_list_notes_csv, parse back with csv::ReaderBuilder
    // Assert: all fields roundtrip correctly
}

#[test]
fn test_jsonl_output_one_per_line() {
    // NoteListResult with 3 notes
    // Capture stdout, split by newline
    // Assert: each line parses as valid JSON
    // Assert: 3 lines total
}

Implementation

print_list_notes(result: &NoteListResult) — human-readable table:

Table columns: ID | Author | Type | Body (truncated 60) | Path:Line | Parent | Created

  • ID: colored_cell(note.gitlab_id, Color::Cyan)
  • Author: colored_cell(format!("@{}", author), Color::Magenta)
  • Type: "Diff" or "Disc" or "-" (colored)
  • Body: first line, truncated to 60 chars
  • Path:Line: position_new_path:position_new_line or "-"
  • Parent: Issue #42 or MR !456 (from noteable_type + parent_iid)
  • Created: format_relative_time(created_at)

print_list_notes_json(result, elapsed_ms, fields) — robot JSON:

Follows exact envelope pattern:

{
  "ok": true,
  "data": {
    "notes": [...],
    "total_count": N,
    "showing": M
  },
  "meta": { "elapsed_ms": U64 }
}

Supports --fields via filter_fields(&mut output, "notes", &expanded).

print_list_notes_jsonl / print_list_notes_csv — streaming output:

For JSONL and CSV formats, use a streaming path that writes rows directly to stdout as they're read from the database, avoiding full materialization in memory. This matters for the year-long analysis use case where --limit 10000 or higher is common, and for piped workflows where downstream consumers (jq, LLM ingestion) can begin processing before the query completes.

print_list_notes_jsonl_stream(conn, filters) — streaming JSONL:

// Execute query, iterate over rows with a callback
query_notes_stream(&conn, &filters, |row| {
    let json_row = NoteListRowJson::from(&row);
    println!("{}", serde_json::to_string(&json_row).unwrap());
    Ok(())
})?;

Each line is a complete NoteListRowJson object. No envelope, no metadata. This format is ideal for streaming into LLM prompts, jq pipelines, or notebook ingestion.

print_list_notes_csv_stream(conn, filters) — streaming CSV:

let mut wtr = csv::Writer::from_writer(std::io::stdout());
wtr.write_record(&["id", "gitlab_id", "author_username", "body", "note_type", ...])?;
query_notes_stream(&conn, &filters, |row| {
    let json_row = NoteListRowJson::from(&row);
    wtr.write_record(&[json_row.id.to_string(), ...])?;
    Ok(())
})?;
wtr.flush()?;

Columns mirror NoteListRowJson field names. Uses the csv crate (csv::Writer) for RFC 4180-compliant escaping, handling commas, quotes, newlines, and multi-byte characters correctly.

query_notes_stream(conn, filters, row_handler) — forward-only row iteration that calls row_handler for each row. Uses the same SQL as query_notes() but iterates with rusqlite::Statement::query_map() instead of collecting into a Vec. The table and JSON formats continue to use the buffered query_notes() path since they need total_count and showing metadata.

Note: The streaming path skips the COUNT query since there's no envelope to report total_count in. For JSONL, this is expected — consumers count lines themselves. For CSV, the header row provides column names; row count is implicit.

Dependency: Add csv = "1" to Cargo.toml under [dependencies]. The csv crate is well-maintained, widely adopted (~100M downloads), and has zero unsafe code.


Work Chunk 1D: robot-docs Integration

Files: Wherever robot-docs manifest is generated (search for robot-docs or RobotDocs command handler)

Depends on: Work Chunks 1A-1C complete

Add the notes command to the robot-docs manifest with:

  • Command name, description, flags (including --format, --until, --resolution, --contains)
  • Response schema for robot mode
  • Exit codes

Also update --type value_parser on SearchArgs (line 542 of src/cli/mod.rs) to include "note" and "notes" as valid values for --source-type (this is forward-prep for Phase 2 but doesn't break anything until Phase 2 lands).


Work Chunk 1E: Composite Query Index

Files: migrations/022_notes_query_index.sql, src/core/db.rs

Depends on: Nothing (standalone, can run in parallel with 1A)

Context: The notes table already has single-column indexes on author_username, discussion_id, note_type, position_new_path, and a composite idx_notes_diffnote_path_created. However, the new query_notes() function's most common query patterns would benefit from composite covering indexes.

Tests to Write First

#[test]
fn test_migration_022_indexes_exist() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    run_migrations(&conn).unwrap();

    // Verify all indexes were created
    let count: i64 = conn.query_row(
        "SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND name IN (
            'idx_notes_user_created', 'idx_notes_project_created',
            'idx_notes_project_path_created',
            'idx_discussions_issue_id', 'idx_discussions_mr_id'
        )",
        [],
        |r| r.get(0),
    ).unwrap();
    assert_eq!(count, 5);
}

Implementation

Migration SQL (migrations/022_notes_query_index.sql):

-- Composite index for the common "notes by author" query pattern:
-- non-system notes filtered by author, sorted by created_at DESC with id tiebreaker.
-- The is_system partial index condition avoids indexing system notes (which are
-- filtered out by default and typically comprise 30-50% of all notes).
-- Uses COLLATE NOCASE to match the query's case-insensitive author comparison.
CREATE INDEX IF NOT EXISTS idx_notes_user_created
ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC)
WHERE is_system = 0;

-- Composite index for the common "all notes in project by date" query pattern:
-- serves project-scoped listings without author filter.
CREATE INDEX IF NOT EXISTS idx_notes_project_created
ON notes(project_id, created_at DESC, id DESC)
WHERE is_system = 0;

-- Composite index for path-centric note queries (--path with project/date filters).
-- DiffNote reviews on specific files are a stated hot path for the reviewer
-- profiling use case. Only indexes rows where position_new_path is populated.
CREATE INDEX IF NOT EXISTS idx_notes_project_path_created
ON notes(project_id, position_new_path, created_at DESC, id DESC)
WHERE is_system = 0 AND position_new_path IS NOT NULL;

-- Index on discussions.issue_id for efficient JOIN when filtering by parent issue.
-- The query_notes() function JOINs discussions to reach parent entities.
CREATE INDEX IF NOT EXISTS idx_discussions_issue_id
ON discussions(issue_id);

-- Index on discussions.merge_request_id for efficient JOIN when filtering by parent MR.
CREATE INDEX IF NOT EXISTS idx_discussions_mr_id
ON discussions(merge_request_id);

The first partial index serves the primary use case (author-scoped queries) with COLLATE NOCASE matching the query's case-insensitive author comparison. The second serves project-scoped date-range queries (--since/--until without --author). The third serves path-centric DiffNote queries (--path src/auth/ combined with project and date filters). All three exclude system notes, which are filtered out by default. The discussion indexes accelerate the JOIN path used by all note queries.

Register in src/core/db.rs:

Add to the MIGRATIONS array (after migration 021):

(
    "022",
    include_str!("../../migrations/022_notes_query_index.sql"),
),

Note: This bumps the migration number, so Work Chunk 2A's schema migration (which was originally numbered 022) becomes migration 023 instead.


Phase 2: Per-Note Documents

Work Chunk 2A: Schema Migration (023)

Files: migrations/023_note_documents.sql, src/core/db.rs

Depends on: Work Chunk 1E (must come after migration 022)

Context: Current migration is 021 (022 after Work Chunk 1E). The documents and dirty_sources tables have CHECK constraints limiting source_type to ('issue','merge_request','discussion'). SQLite doesn't support ALTER TABLE ... ALTER CONSTRAINT, so we use the table-rebuild pattern.

Tests to Write First

// In src/core/db.rs tests or a new migration test:

#[test]
fn test_migration_023_allows_note_source_type() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    run_migrations(&conn).unwrap();

    // Should NOT error — note is now a valid source_type
    conn.execute(
        "INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('note', 1, 1000)",
        [],
    ).unwrap();

    conn.execute(
        "INSERT INTO documents (source_type, source_id, project_id, content_text, content_hash, is_truncated)
         VALUES ('note', 1, 1, 'test', 'abc123', 0)",
        [],
    ).unwrap();
}

#[test]
fn test_migration_023_preserves_existing_data() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    run_migrations(&conn).unwrap();

    // Insert with old source types still works
    conn.execute(
        "INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('issue', 1, 1000)",
        [],
    ).unwrap();
    conn.execute(
        "INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('discussion', 2, 1000)",
        [],
    ).unwrap();

    let count: i64 = conn.query_row("SELECT COUNT(*) FROM dirty_sources", [], |r| r.get(0)).unwrap();
    assert_eq!(count, 2);
}

#[test]
fn test_migration_023_fts_triggers_intact() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    run_migrations(&conn).unwrap();

    // Insert a note document
    conn.execute(
        "INSERT INTO documents (source_type, source_id, project_id, title, content_text, content_hash, is_truncated)
         VALUES ('note', 1, 1, 'Test Note', 'This is the note body', 'hash123', 0)",
        [],
    ).unwrap();

    // FTS should auto-sync via trigger
    let count: i64 = conn.query_row(
        "SELECT COUNT(*) FROM documents_fts WHERE documents_fts MATCH 'note'",
        [],
        |r| r.get(0),
    ).unwrap();
    assert_eq!(count, 1);
}

#[test]
fn test_migration_023_row_counts_preserved() {
    // This test verifies the migration doesn't lose data during table rebuild.
    // It runs all migrations up to version 22, inserts test data into documents/dirty_sources/
    // document_labels/document_paths BEFORE migration 023, then verifies
    // counts are identical after migration 023 runs.
    // (Implementation: create_connection_at_version(22) + insert data + run_migration(23) + assert counts)
    // Note: This may require a test helper that runs migrations up to a specific version.
}

Implementation

Migration SQL (migrations/023_note_documents.sql):

The tables with CHECK constraints that need rebuilding:

  1. dirty_sources — add 'note' to source_type CHECK
  2. documents — add 'note' to source_type CHECK

Pattern: create new table, copy data, drop old, rename. Must also recreate FTS triggers (they reference the table by name) and all indexes.

CRITICAL: The documents_fts external content table references documents by rowid. Rebuilding documents changes rowids unless we preserve them. Use INSERT INTO documents_new SELECT * FROM documents to preserve the id (PRIMARY KEY = rowid).

CRITICAL: The FTS triggers (documents_ai, documents_ad, documents_au) must be dropped and recreated after the table rebuild because they reference documents which was dropped/renamed.

Migration safety requirements:

  • The migration executes as a single transaction (SQLite migration runner wraps each migration in a transaction).
  • After the table rebuild, verify row counts match: SELECT COUNT(*) FROM documents must equal the pre-rebuild count. The migration SQL captures counts into temp variables and asserts equality.
  • Run PRAGMA foreign_key_check after the rebuild and abort on any violation.
  • Rebuild FTS index and verify documents_fts row count matches documents row count.

The migration must:

  1. Drop FTS triggers
  2. Create documents_new with updated CHECK (adding 'note')
  3. INSERT INTO documents_new SELECT * FROM documents
  4. Drop documents (cascades document_labels, document_paths due to ON DELETE CASCADE — so save those first!)
  5. Actually: disable foreign keys, copy document_labels and document_paths data, drop old tables, rename new, recreate junction tables, restore data, recreate FTS triggers, recreate all indexes
  6. Same pattern for dirty_sources (simpler — no dependents)
-- Backfill: seed all existing non-system notes into the dirty queue
-- so the next generate-docs run creates documents for them.
-- Uses LEFT JOIN to skip notes that already have documents (idempotent).
-- ON CONFLICT DO NOTHING handles notes already in the dirty queue.
INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, CAST(strftime('%s', 'now') AS INTEGER) * 1000
FROM notes n
LEFT JOIN documents d
  ON d.source_type = 'note' AND d.source_id = n.id
WHERE n.is_system = 0 AND d.id IS NULL
ON CONFLICT(source_type, source_id) DO NOTHING;

Register in src/core/db.rs:

Add to the MIGRATIONS array (after migration 022):

(
    "023",
    include_str!("../../migrations/023_note_documents.sql"),
),

Note: This is a data-only migration — no schema changes. It's safe to run on empty databases (no notes = no-op). On databases with existing notes, it queues them for document generation on the next lore generate-docs or lore sync run.


Work Chunk 2B: SourceType Enum Extension

Files: src/documents/extractor.rs

Depends on: Work Chunk 2A (migration must exist so test DBs have the right schema)

Tests to Write First

Add to src/documents/extractor.rs in the existing test module:

#[test]
fn test_source_type_parse_note() {
    assert_eq!(SourceType::parse("note"), Some(SourceType::Note));
    assert_eq!(SourceType::parse("notes"), Some(SourceType::Note));
    assert_eq!(SourceType::parse("NOTE"), Some(SourceType::Note));
}

#[test]
fn test_source_type_note_as_str() {
    assert_eq!(SourceType::Note.as_str(), "note");
}

#[test]
fn test_source_type_note_display() {
    assert_eq!(format!("{}", SourceType::Note), "note");
}

#[test]
fn test_source_type_note_serde_roundtrip() {
    let st = SourceType::Note;
    let json = serde_json::to_string(&st).unwrap();
    assert_eq!(json, "\"note\"");
    let parsed: SourceType = serde_json::from_str(&json).unwrap();
    assert_eq!(parsed, SourceType::Note);
}

Implementation

In src/documents/extractor.rs:

  1. Add Note variant to SourceType enum (line 18):

    Note,
    
  2. Add match arm to as_str() (line 27):

    Self::Note => "note",
    
  3. Add parse aliases (line 35):

    "note" | "notes" => Some(Self::Note),
    

Work Chunk 2C: Note Document Extractor

Files: src/documents/extractor.rs

Depends on: Work Chunk 2B

Context: Follows the exact pattern of extract_issue_document() (lines 85-184) and extract_discussion_document() (lines 302-516). The new function extracts a single non-system note into a DocumentData struct.

Tests to Write First

Add to src/documents/extractor.rs test module. Uses setup_discussion_test_db() (line 1025) and insert_note() (line 1086) helpers that already exist.

#[test]
fn test_note_document_basic_format() {
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 1, 42, Some("Auth redesign"), Some("desc"), "opened", Some("alice"),
        Some("https://gitlab.example.com/group/project-one/-/issues/42"));
    insert_discussion(&conn, 1, "Issue", Some(1), None);
    insert_note(&conn, 1, 12345, 1, Some("jdefting"), Some("This function is too complex, consider extracting the validation logic."), 1710460800000, false, None, None);

    let doc = extract_note_document(&conn, 1).unwrap().unwrap();
    assert_eq!(doc.source_type, SourceType::Note);
    assert_eq!(doc.source_id, 1);
    assert_eq!(doc.project_id, 1);
    assert_eq!(doc.author_username, Some("jdefting".to_string()));
    assert!(doc.content_text.contains("[[Note]]"));
    assert!(doc.content_text.contains("author: @jdefting"));
    assert!(doc.content_text.contains("This function is too complex"));
    assert!(doc.content_text.contains("Issue #42: Auth redesign"));
    assert!(doc.content_text.contains("group/project-one"));
    assert_eq!(doc.title, Some("Note by @jdefting on Issue #42".to_string()));
    assert!(!doc.is_truncated);
}

#[test]
fn test_note_document_diffnote_with_path() {
    let conn = setup_discussion_test_db();
    insert_mr(&conn, 1, 99, Some("JWT Auth"), Some("desc"), Some("opened"), Some("alice"),
        Some("feat/jwt"), Some("main"), Some("https://gitlab.example.com/group/project-one/-/merge_requests/99"));
    insert_discussion(&conn, 1, "MergeRequest", None, Some(1));
    insert_note(&conn, 1, 54321, 1, Some("jdefting"), Some("This should use a match statement"),
        1710460800000, false, Some("src/old_auth.rs"), Some("src/auth.rs"));

    let doc = extract_note_document(&conn, 1).unwrap().unwrap();
    assert_eq!(doc.paths, vec!["src/auth.rs", "src/old_auth.rs"]);
    assert!(doc.content_text.contains("path: src/auth.rs"));
    assert!(doc.content_text.contains("MR !99: JWT Auth"));
    assert_eq!(doc.title, Some("Note by @jdefting on MR !99".to_string()));
    assert!(doc.url.unwrap().contains("#note_54321"));
}

#[test]
fn test_note_document_inherits_parent_labels() {
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
    insert_label(&conn, 1, "backend");
    insert_label(&conn, 2, "security");
    link_issue_label(&conn, 1, 1);
    link_issue_label(&conn, 1, 2);
    insert_discussion(&conn, 1, "Issue", Some(1), None);
    insert_note(&conn, 1, 100, 1, Some("alice"), Some("Hello"), 1710460800000, false, None, None);

    let doc = extract_note_document(&conn, 1).unwrap().unwrap();
    assert_eq!(doc.labels, vec!["backend", "security"]);
}

#[test]
fn test_note_document_mr_labels() {
    let conn = setup_discussion_test_db();
    insert_mr(&conn, 1, 10, Some("Test"), None, Some("opened"), None, None, None, None);
    insert_label(&conn, 1, "review");
    link_mr_label(&conn, 1, 1);
    insert_discussion(&conn, 1, "MergeRequest", None, Some(1));
    insert_note(&conn, 1, 100, 1, Some("reviewer"), Some("LGTM"), 1710460800000, false, None, None);

    let doc = extract_note_document(&conn, 1).unwrap().unwrap();
    assert_eq!(doc.labels, vec!["review"]);
}

#[test]
fn test_note_document_system_note_returns_none() {
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
    insert_discussion(&conn, 1, "Issue", Some(1), None);
    insert_note(&conn, 1, 100, 1, Some("bot"), Some("assigned to @alice"), 1710460800000, true, None, None);

    let result = extract_note_document(&conn, 1).unwrap();
    assert!(result.is_none());
}

#[test]
fn test_note_document_not_found() {
    let conn = setup_discussion_test_db();
    let result = extract_note_document(&conn, 999).unwrap();
    assert!(result.is_none());
}

#[test]
fn test_note_document_orphaned_discussion() {
    // Discussion exists but parent issue was deleted
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 99, 10, Some("Deleted"), None, "opened", None, None);
    insert_discussion(&conn, 1, "Issue", Some(99), None);
    insert_note(&conn, 1, 100, 1, Some("alice"), Some("Hello"), 1710460800000, false, None, None);
    conn.execute("PRAGMA foreign_keys = OFF", []).unwrap();
    conn.execute("DELETE FROM issues WHERE id = 99", []).unwrap();
    conn.execute("PRAGMA foreign_keys = ON", []).unwrap();

    let result = extract_note_document(&conn, 1).unwrap();
    assert!(result.is_none());
}

#[test]
fn test_note_document_hash_deterministic() {
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
    insert_discussion(&conn, 1, "Issue", Some(1), None);
    insert_note(&conn, 1, 100, 1, Some("alice"), Some("Comment"), 1710460800000, false, None, None);

    let doc1 = extract_note_document(&conn, 1).unwrap().unwrap();
    let doc2 = extract_note_document(&conn, 1).unwrap().unwrap();
    assert_eq!(doc1.content_hash, doc2.content_hash);
    assert_eq!(doc1.labels_hash, doc2.labels_hash);
    assert_eq!(doc1.paths_hash, doc2.paths_hash);
}

#[test]
fn test_note_document_empty_body() {
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
    insert_discussion(&conn, 1, "Issue", Some(1), None);
    insert_note(&conn, 1, 100, 1, Some("alice"), Some(""), 1710460800000, false, None, None);

    // Should still produce a document (body is optional in schema)
    let doc = extract_note_document(&conn, 1).unwrap().unwrap();
    assert!(doc.content_text.contains("[[Note]]"));
}

#[test]
fn test_note_document_null_body() {
    let conn = setup_discussion_test_db();
    insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
    insert_discussion(&conn, 1, "Issue", Some(1), None);
    insert_note(&conn, 1, 100, 1, Some("alice"), None, 1710460800000, false, None, None);

    // Should still produce a document (body is optional in schema)
    let doc = extract_note_document(&conn, 1).unwrap().unwrap();
    assert!(doc.content_text.contains("[[Note]]"));
}

Implementation

Add extract_note_document() to src/documents/extractor.rs (after extract_discussion_document, around line 516):

pub fn extract_note_document(
    conn: &Connection,
    note_id: i64,
) -> Result<Option<DocumentData>> {
    // 1. Fetch the note
    let note_row = conn.query_row(
        "SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type,
                n.is_system, n.created_at, n.updated_at,
                n.position_old_path, n.position_new_path,
                n.position_new_line,
                n.resolvable, n.resolved,
                d.noteable_type, d.issue_id, d.merge_request_id,
                p.path_with_namespace, p.id AS project_id
         FROM notes n
         JOIN discussions d ON n.discussion_id = d.id
         JOIN projects p ON n.project_id = p.id
         WHERE n.id = ?1",
        rusqlite::params![note_id],
        |row| { /* map all fields */ },
    );
    // Handle QueryReturnedNoRows -> Ok(None)

    // 2. Skip system notes
    if is_system { return Ok(None); }

    // 3. Fetch parent entity (Issue or MR) — same pattern as extract_discussion_document lines 332-401
    //    Get parent_iid, parent_title, parent_web_url, labels

    // 4. Build paths BTreeSet from position_old_path, position_new_path

    // 5. Build URL: parent_web_url + "#note_{gitlab_id}"

    // 6. Format content with structured metadata header:
    //    [[Note]]
    //    source_type: note
    //    note_gitlab_id: {gitlab_id}
    //    project: {path_with_namespace}
    //    parent_type: {Issue|MergeRequest}
    //    parent_iid: {iid}
    //    parent_title: {title}
    //    note_type: {DiffNote|DiscussionNote|Comment}
    //    author: @{author}
    //    author_id: {author_id}             (only if non-null)
    //    created_at: {iso8601}
    //    resolved: {true|false}          (only if resolvable)
    //    path: {position_new_path}:{position_new_line}  (only if DiffNote)
    //    labels: {comma-separated}
    //    url: {url}
    //
    //    --- Body ---
    //
    //    {body}

    // 7. Title: "Note by @{author} on {parent_type_prefix}"

    // 8. Compute hashes, apply truncate_hard_cap, return DocumentData
}

The content format uses a structured key-value header optimized for machine parsing and semantic search, followed by the raw note body. This is deliberately different from discussion documents — it's optimized for individual note semantics rather than thread context.

Structured header rationale: The key-value format allows the embedding model and FTS to index structured fields (author, project, parent reference) alongside the free-text body, improving search precision for queries like "jdefting's comments on authentication issues."


Work Chunk 2D: Regenerator & Dirty Tracking Integration

Files: src/documents/regenerator.rs, src/ingestion/discussions.rs, src/ingestion/mr_discussions.rs

Depends on: Work Chunks 0A, 2B, 2C

Tests to Write First

In src/documents/regenerator.rs tests:

#[test]
fn test_regenerate_note_document() {
    let conn = setup_db();
    // Add discussions + notes tables to setup_db() (or use a richer setup)
    // Insert: project, issue, discussion, non-system note
    // mark_dirty(SourceType::Note, note_id)
    // regenerate_dirty_documents()
    // Assert: document created with source_type = 'note'
    // Assert: document content contains note body
}

#[test]
fn test_regenerate_note_system_note_deletes() {
    // Insert system note, mark dirty
    // regenerate_dirty_documents()
    // Assert: no document created (extract returns None -> delete path)
}

#[test]
fn test_regenerate_note_unchanged() {
    // Create note, regenerate, mark dirty again, regenerate
    // Assert: second run returns unchanged = 1
}

#[test]
fn test_note_ingestion_idempotent_across_two_syncs() {
    // Setup: project, issue, discussion, 3 non-system notes
    // Run ingestion once -> verify 3 dirty notes queued
    // Regenerate documents -> verify 3 note documents created
    // Run ingestion again with identical data
    // Assert: no new dirty entries (changed_semantics = false for all)
}

In src/ingestion/dirty_tracker.rs tests:

#[test]
fn test_mark_dirty_note_type() {
    // Update the test DB setup to include 'note' in CHECK constraint
    let conn = setup_db(); // This needs the new CHECK
    mark_dirty(&conn, SourceType::Note, 1).unwrap();
    let results = get_dirty_sources(&conn).unwrap();
    assert_eq!(results.len(), 1);
    assert_eq!(results[0].0, SourceType::Note);
}

Implementation

1. Update regenerate_one() in src/documents/regenerator.rs (line 90):

SourceType::Note => extract_note_document(conn, source_id)?,

And add the import at line 8:

use crate::documents::{
    DocumentData, SourceType, extract_discussion_document, extract_issue_document,
    extract_mr_document, extract_note_document,
};

2. Add change-aware dirty marking in src/ingestion/discussions.rs (in the new upsert loop from Phase 0):

for note in &normalized_notes {
    let outcome = upsert_note_for_issue(&tx, local_discussion_id, &note, last_seen_at)?;
    if !note.is_system && outcome.changed_semantics {
        dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
    }
}
sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?;

3. Same change-aware dirty marking in src/ingestion/mr_discussions.rs (update the existing upsert loop):

let outcome = upsert_note(&tx, local_discussion_id, &note, last_seen_at, None)?;
if !note.is_system && outcome.changed_semantics {
    dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
}

4. Update dirty_tracker.rs test setup_db() to include 'note' in the CHECK constraint (line 134).

5. Update regenerator.rs test setup_db() to include the discussions + notes tables so note-type regeneration tests can run.


Work Chunk 2E: Generate-Docs Full Rebuild Support

Files: Search for where robot-docs manifest is generated (search for robot-docs or RobotDocs command handler)

Depends on: Work Chunk 2D

Context: When lore generate-docs --full runs, it seeds ALL issues, MRs, and discussions into the dirty queue. Notes must be seeded too.

Tests to Write First

#[test]
fn test_full_seed_includes_notes() {
    // Setup DB with project, issue, discussion, 3 non-system notes, 1 system note
    // Call seed_all_dirty(conn) or whatever the full-rebuild seeder is named
    // Assert: dirty_sources contains 3 entries with source_type = 'note'
    // Assert: system note is NOT in dirty_sources
}

#[test]
fn test_note_document_count_stable_after_second_generate_docs_full() {
    // Setup DB with project, issue, discussion, 5 non-system notes
    // Run generate-docs --full equivalent (seed + regenerate)
    // Record document count
    // Run generate-docs --full again
    // Assert: document count unchanged (idempotent)
    // Assert: dirty queue is empty after second run
}

Implementation

Find the function that seeds dirty_sources for --full mode (likely in the generate-docs handler or a dedicated seeder function). Add:

INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, ?1
FROM notes n
WHERE n.is_system = 0
ON CONFLICT(source_type, source_id) DO UPDATE SET
  queued_at = excluded.queued_at,
  attempt_count = 0,
  last_attempt_at = NULL,
  last_error = NULL,
  next_attempt_at = NULL

Work Chunk 2F: Search CLI --type note Support

Files: src/cli/mod.rs, src/cli/commands/search.rs (display code)

Depends on: Work Chunks 2A-2E (documents must exist to be searched)

Tests to Write First

Integration/smoke test:

#[test]
fn test_search_source_type_note_filter() {
    // This is essentially testing that SourceType::Note flows through
    // the existing search pipeline correctly. Since the search filter
    // code is generic (filters.rs:70-73), the main test is that
    // SourceType::parse("note") works — already covered in 2B.
    // Add a smoke test that the CLI accepts --type note.
}

Implementation

  1. Update SearchArgs.source_type value_parser in src/cli/mod.rs (line 542):

    #[arg(long = "type", value_name = "TYPE",
          value_parser = ["issue", "mr", "discussion", "note", "notes"],
          help_heading = "Filters")]
    pub source_type: Option<String>,
    
  2. Update the search results display to show "Note" prefix for note-type results (check print_search_results in src/cli/commands/search.rs).


Work Chunk 2G: Parent Metadata Change Propagation

Files: src/ingestion/orchestrator.rs (or wherever parent entity updates trigger dirty marking), src/documents/regenerator.rs

Depends on: Work Chunk 2D

Context: Note documents inherit metadata from their parent issue/MR — specifically labels and title. When a parent's title or labels change, the note documents derived from that parent become stale. The existing ingestion pipeline already marks discussion documents dirty when parent metadata changes. Note documents need the same treatment.

Problem

If issue #42's title changes from "Auth redesign" to "Auth overhaul", all note documents under that issue still say "Issue #42: Auth redesign" until their content is regenerated. Similarly, label changes on the parent propagate into the note document's labels field and label_names text.

Tests to Write First

#[test]
fn test_parent_title_change_marks_notes_dirty() {
    // Setup: project, issue, discussion, 2 non-system notes
    // Generate note documents (verify they exist)
    // Change the issue title
    // Trigger the parent-change propagation
    // Assert: both note documents are in dirty_sources
}

#[test]
fn test_parent_label_change_marks_notes_dirty() {
    // Setup: project, issue with label "backend", discussion, note
    // Generate note document (verify labels = ["backend"])
    // Add label "security" to the issue
    // Trigger the parent-change propagation
    // Assert: note document is in dirty_sources
    // Regenerate and verify labels = ["backend", "security"]
}

Implementation

Find where the ingestion pipeline detects parent entity changes and marks discussion documents dirty. Add the same logic for note documents:

-- When an issue's title or labels change, mark all its non-system notes dirty
INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, ?1
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE d.issue_id = ?2 AND n.is_system = 0
ON CONFLICT(source_type, source_id) DO UPDATE SET
  queued_at = excluded.queued_at,
  attempt_count = 0

Same pattern for MR parent changes. The exact integration point depends on how the existing discussion dirty-marking works — it should be adjacent to that code.

Note on deletion handling: Note deletion is handled by two complementary mechanisms:

  1. Immediate propagation (Work Chunk 0B): When sweep deletes stale notes, documents and dirty_sources entries are cleaned up in the same transaction. No stale search results.
  2. Eventual consistency (generate-docs --full): For edge cases where a note was deleted outside the normal sweep path, the full rebuild catches orphaned documents since the note row no longer exists and extract_note_document() returns None -> document deleted.

No additional deletion logic is needed beyond Work Chunk 0B + the existing regenerator orphan cleanup.


Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024)

Files: migrations/024_note_dirty_backfill.sql, src/core/db.rs

Depends on: Work Chunk 2A (migration 023 must exist so dirty_sources accepts source_type='note')

Context: When a user upgrades to a version with note document support, existing notes in the database have no corresponding documents. Without a backfill, only notes that change after the upgrade would get documents — historical notes remain invisible to search. This migration seeds all existing non-system notes into the dirty queue so the next generate-docs run creates documents for them.

Tests to Write First

#[test]
fn test_migration_024_backfills_existing_notes() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    // Run migrations up through 023
    // Insert: project, issue, discussion, 5 non-system notes, 2 system notes
    // Run migration 024
    // Assert: dirty_sources contains 5 entries with source_type = 'note'
    // Assert: system notes are NOT in dirty_sources
}

#[test]
fn test_migration_024_idempotent_with_existing_documents() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    // Run all migrations including 024
    // Insert: project, issue, discussion, 3 non-system notes
    // Create note documents for 2 of 3 notes (simulate partial state)
    // Re-run the backfill SQL manually
    // Assert: only the 1 note without a document is in dirty_sources
    // Assert: ON CONFLICT DO NOTHING prevents duplicates
}

#[test]
fn test_migration_024_skips_notes_already_in_dirty_queue() {
    let conn = create_connection(Path::new(":memory:")).unwrap();
    // Run all migrations
    // Insert note and manually add to dirty_sources
    // Re-run backfill SQL
    // Assert: no duplicate entries (ON CONFLICT DO NOTHING)
}

Implementation

Migration SQL (migrations/024_note_dirty_backfill.sql):

-- Backfill: seed all existing non-system notes into the dirty queue
-- so the next generate-docs run creates documents for them.
-- Uses LEFT JOIN to skip notes that already have documents (idempotent).
-- ON CONFLICT DO NOTHING handles notes already in the dirty queue.
INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, CAST(strftime('%s', 'now') AS INTEGER) * 1000
FROM notes n
LEFT JOIN documents d
  ON d.source_type = 'note' AND d.source_id = n.id
WHERE n.is_system = 0 AND d.id IS NULL
ON CONFLICT(source_type, source_id) DO NOTHING;

Register in src/core/db.rs:

Add to the MIGRATIONS array (after migration 023):

(
    "024",
    include_str!("../../migrations/024_note_dirty_backfill.sql"),
),

Note: This is a data-only migration — no schema changes. It's safe to run on empty databases (no notes = no-op). On databases with existing notes, it queues them for document generation on the next lore generate-docs or lore sync run.


Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration

Files: src/documents/regenerator.rs, src/documents/extractor.rs

Depends on: Work Chunk 2C (extractor function must exist)

Context: The extract_note_document() function fetches parent entity metadata (issue/MR title, labels, project path) via individual SQL queries per note. During the initial backfill of ~8,000 existing notes, this creates N+1 query amplification: each note triggers its own parent metadata lookup, even though many notes share the same parent entity. For example, 50 notes on the same MR would execute 50 identical parent metadata queries.

This is a performance optimization for batch regeneration, not a correctness change. Individual note regeneration (dirty tracking during incremental sync) is unaffected — the N+1 cost is negligible for the typical 1-10 dirty notes per sync.

Tests to Write First

#[test]
fn test_note_regeneration_batch_uses_cache() {
    // Setup: project, issue with 10 non-system notes
    // Mark all 10 as dirty
    // Run regenerate_dirty_documents()
    // Assert: all 10 documents created correctly
    // Assert: parent metadata query count == 1 (not 10)
    // (Use a query counter or verify via cache hit metrics)
}

#[test]
fn test_note_regeneration_cache_consistent_with_direct_extraction() {
    // Setup: project, issue with labels, discussion, 3 notes
    // Extract note document directly (no cache)
    // Extract via cached batch path
    // Assert: content_hash is identical for both paths
    // Assert: labels_hash is identical for both paths
}

#[test]
fn test_note_regeneration_cache_invalidates_across_parents() {
    // Setup: 2 issues, each with notes
    // Regenerate notes from both issues in one batch
    // Assert: each issue's notes get correct parent metadata
    // (cache keyed by (noteable_type, parent_id), not globally shared)
}

Implementation

1. Add ParentMetadataCache struct in src/documents/extractor.rs:

use std::collections::HashMap;

/// Cache for parent entity metadata during batch note document extraction.
/// Keyed by (noteable_type, parent_local_id) to avoid repeated lookups
/// when multiple notes share the same parent issue/MR.
pub struct ParentMetadataCache {
    cache: HashMap<(String, i64), ParentMetadata>,
}

pub struct ParentMetadata {
    pub iid: i64,
    pub title: Option<String>,
    pub web_url: Option<String>,
    pub labels: Vec<String>,
    pub project_path: String,
}

impl ParentMetadataCache {
    pub fn new() -> Self { Self { cache: HashMap::new() } }

    pub fn get_or_fetch(
        &mut self,
        conn: &Connection,
        noteable_type: &str,
        parent_id: i64,
    ) -> Result<&ParentMetadata> {
        // HashMap entry API: fetch from DB on miss, return cached on hit
    }
}

2. Add extract_note_document_cached() variant that accepts &mut ParentMetadataCache and uses it instead of inline parent metadata queries. The uncached extract_note_document() remains for single-note regeneration.

3. Update batch regeneration loop in src/documents/regenerator.rs:

// In the regeneration loop, when processing a batch of dirty sources:
let mut parent_cache = ParentMetadataCache::new();

for (source_type, source_id) in dirty_batch {
    match source_type {
        SourceType::Note => extract_note_document_cached(conn, source_id, &mut parent_cache)?,
        // Other source types use existing extraction functions (no cache needed)
        _ => regenerate_one(conn, source_type, source_id)?,
    };
}

Scope limit: The cache is created fresh per regeneration batch and discarded after. No cross-batch persistence, no invalidation complexity. The cache is purely an optimization for batch processing where many notes share parents.


Verification Checklist

After all chunks are complete, run the full quality gate:

cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check

Then functional smoke tests:

# Phase 0 verification
# Sync twice and verify note IDs are stable:
lore sync
lore -J notes --limit 5  # Record gitlab_ids and local ids
lore sync
lore -J notes --limit 5  # Verify same local ids for same gitlab_ids

# Phase 1 verification
lore -J notes --author jdefting --since 365d --limit 5
lore -J notes --note-type DiffNote --path src/ --limit 10
lore notes --for-mr 456 -p group/repo
lore notes --for-issue 42 -p group/repo  # Verify project-scoping works
lore notes --since 180d --until 90d       # Bounded window (180 days ago to 90 days ago)
lore notes --resolution unresolved        # Tri-state resolution filter
lore notes --contains "unwrap" --note-type DiffNote  # Body substring + type filter
lore notes --author jdefting --format jsonl | wc -l  # JSONL streaming
lore -J notes --gitlab-note-id 12345       # Precision filter: exact GitLab note
lore -J notes --discussion-id 42           # Precision filter: all notes in thread

# Phase 2 verification
lore sync  # Should generate note documents
lore -J stats  # Should show note document count in source_type breakdown
lore -J search "code complexity" --type note --author jdefting
lore -J search "error handling" --type note --since 180d

Idempotence checks:

# Verify generate-docs --full is idempotent
lore generate-docs --full
lore -J stats > /tmp/stats1.json
lore generate-docs --full
lore -J stats > /tmp/stats2.json
diff /tmp/stats1.json /tmp/stats2.json  # Should be identical (modulo timing metadata)

Deletion propagation checks:

# Verify that deleted notes don't leave stale documents
# (Manual test: delete a note on GitLab, sync, verify document is gone)
lore sync
lore -J search "specific phrase from deleted note" --type note
# Should return no results

Performance and query plan verification:

# Verify indexes are used for common query patterns
# Run EXPLAIN QUERY PLAN for the hot paths:
sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN
SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type,
       n.is_system, n.created_at, n.updated_at,
       n.position_new_path, n.position_new_line,
       n.position_old_path, n.position_old_line,
       n.resolvable, n.resolved, n.resolved_by,
       d.noteable_type,
       COALESCE(i.iid, m.iid) AS parent_iid,
       COALESCE(i.title, m.title) AS parent_title,
       p.path_with_namespace
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
JOIN projects p ON n.project_id = p.id
LEFT JOIN issues i ON d.issue_id = i.id
LEFT JOIN merge_requests m ON d.merge_request_id = m.id
WHERE n.is_system = 0 AND n.author_username = 'jdefting' COLLATE NOCASE AND n.created_at >= 1704067200000
ORDER BY n.created_at DESC, n.id DESC
LIMIT 50;"
# Should show SEARCH using idx_notes_user_created

sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN
SELECT n.id FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE n.is_system = 0 AND n.project_id = 1 AND n.created_at >= 1704067200000
ORDER BY n.created_at DESC, n.id DESC
LIMIT 50;"
# Should show SEARCH using idx_notes_project_created

sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN
SELECT n.id FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE n.is_system = 0 AND d.issue_id = (SELECT id FROM issues WHERE iid = 42 AND project_id = 1)
ORDER BY n.created_at DESC, n.id DESC
LIMIT 50;"
# Should show SEARCH using idx_discussions_issue_id for the join

Operational checks:

  • lore -J stats output includes documents.notes count (the stats command queries by hardcoded source_type strings — verify 'note' is added)
  • Verify lore -J count notes still reports user vs system breakdown correctly after the changes
  • After a full lore generate-docs --full, verify note document count approximately matches non-system note count from lore count notes

Work Chunk Dependency Graph

0A (stable note identity) ──┬────────────────────────────────────┐
                             │                                    │
                             ├── 0B (deletion propagation) ◄──────┤── 2A (migration 023, + cleanup triggers)
                             │                                    │
                             ├── 0C (sweep safety guard)          │
                             │                                    │
                             ├── 0D (author_id capture)           │
                             │                                    │
1A (data types + query) ──┐  │                                    │
                          ├── 1B (CLI args + wiring) ──┐          │
                          ├── 1C (output formatting)    ├── 1D (robot-docs)
1E (query index, mig 022  │                             │          │
    + author_id column) ──┘                             │          │
                                                        │          │
2A (migration 023) ───────┐                             │          │
                          ├── 2B (SourceType enum)      │          │
                          │     │                       │          │
                          │     ├── 2C (extractor fn)   │          │
                          │     │     │                 │          │
                          │     │     ├── 2D (regenerator + dirty tracking) ◄─┘
                          │     │     │     │
                          │     │     │     ├── 2E (generate-docs --full)
                          │     │     │     │
                          │     │     │     ├── 2F (search --type note)
                          │     │     │     │
                          │     │     │     ├── 2G (parent change propagation)
                          │     │     │     │
                          │     │     │     ├── 2H (backfill migration 024)
                          │     │     │     │
                          │     │     ├── 2I (batch parent metadata cache)

Parallelizable pairs:

  • 0A, 1A, 1E, and 2A can all run simultaneously (no code overlap)
  • 0C and 0D can run immediately after 0A (both modify upsert functions from 0A)
  • 1C and 2B can run simultaneously
  • 2E, 2F, 2G, 2H, and 2I can run simultaneously after 2D (2I only needs 2C)
  • 0B depends on both 0A and 2A (needs sweep functions from 0A and documents table accepting 'note' from 2A)

Critical path: 0A -> 0C -> 2D -> 2G (Phase 0 must land before dirty tracking integrates with upsert outcomes)

Secondary critical path: 2A -> 2B -> 2C -> 2D (document pipeline chain)


Estimated Document Volume Impact

Entity Typical Count Documents Before Documents After
Issues 500 500 500
MRs 300 300 300
Discussions 2,000 2,000 2,000
Notes (non-system) ~8,000 0 +8,000
Total 2,800 10,800

FTS5 handles this comfortably. Embedding generation time scales linearly (~4x increase). The three-hash dedup means incremental syncs remain fast. With Phase 0's change-aware dirty marking, only genuinely modified notes trigger regeneration — typical incremental syncs will dirty a small fraction of the 8k total.


Rejected Recommendations

These recommendations were proposed during review and deliberately rejected. Documenting here to prevent re-proposal.

  • Feature flag gating / gated rollout — rejected because this is a single-user CLI tool in early development with no external users. Adding runtime feature gates (feature.notes_cli, feature.note_documents) for a feature we're building from scratch adds complexity with no benefit. Both phases ship together; there's no "blast radius" to manage.

  • Keyset pagination / cursor support — rejected because no existing list command (lore issues, lore mrs) has pagination. Adding it just for notes would be inconsistent. The year-long analysis use case works fine with --limit 10000. If pagination becomes needed across all list commands, that's a separate horizontal feature.

  • Path filtering upgrade (--path-mode exact|prefix|glob, --match-old-path) — rejected because the trailing-slash prefix convention is already established across the codebase (issues/MRs use the same pattern). Adding glob mode and old-path matching adds multiple CLI flags for a niche use case. Can be added later if users request it.

  • Embedding policy knobs (documents.note_embeddings.min_chars, documents.note_embeddings.enabled, prioritize unresolved DiffNotes) — rejected because the embedding pipeline already handles volume scaling. Adding per-source-type enable flags and minimum character thresholds is premature optimization. Short notes (e.g., "LGTM", "nit: use expect() here") are still semantically valuable for reviewer profiling. The existing embedding batch system handles the volume.

  • Structured reviewer profiling command (lore notes profile --author <user>) — rejected because this is explicitly a non-goal in the PRD. The reviewer profile report is a downstream consumer of the infrastructure we're building. Adding it here is scope creep. It belongs in a separate PRD after this infrastructure lands.

  • Operational SLOs / queue lag metrics — rejected because this is a local CLI tool, not a service. "Oldest dirty note age" and "retry backlog size" are service-oriented metrics that don't apply. The existing lore stats and lore -J count commands provide sufficient observability. If the dirty queue becomes problematic, we add diagnostics then.

  • Replace CHECK constraints with source_types registry table + FK — rejected because the table rebuild for adding a new source type to a CHECK constraint is a rare, one-time cost (done 4 times across 23 migrations). A registry table adds per-insert FK lookup overhead, complicates the migration (still requires a table rebuild to change from CHECK to FK), and optimizes for a hypothetical future where we frequently add source types. The current CHECK approach is simpler, self-documenting, and sufficient.

  • Unresolved-specific partial index (idx_notes_unresolved_project_created) — rejected because the selectivity is too narrow. Unresolved notes are a small, unpredictable subset. The idx_notes_project_created index already covers the project+date scan; adding WHERE resolvable = 1 AND resolved = 0 provides marginal benefit at the cost of index maintenance overhead. SQLite can filter the small remaining set in-memory efficiently.

  • Previous note excerpt in document content (previous_note_excerpt) — rejected because it adds a query per note extraction (fetch the preceding note in the same discussion), increases document size, creates coupling between note documents (changing one note's body would stale the next note's document), and the semantic benefit is marginal. The parent title and labels provide sufficient context. Full thread context is available via the existing discussion documents.

  • Compact/slim metadata header for note documents — rejected because the verbose key-value header is intentional. The structured fields (source_type, note_gitlab_id, project, parent_type, parent_iid, etc.) are what enable precise FTS and embedding search for queries like "jdefting's comments on authentication issues in project-one." The compact format (@author on Issue#42 in project) loses machine-parseable structure and reduces search precision. Metadata stored in document columns/labels/paths is not searchable via FTS — only content_text is FTS-indexed. The token cost of the header (~50 tokens) is negligible compared to typical note body length.

  • Replace fetch_complete: bool with FetchState enum (Complete/Partial/Failed) and run_seen_at monotonicity checks (feedback-5, rec #2) — rejected because the boolean captures the one bit of information that matters: did the fetch complete? FetchState::Failed is redundant with not reaching the sweep call site — if the fetch fails, we don't call sweep at all. The monotonicity check on run_seen_at adds complexity for a condition that can't occur in practice: run_seen_at is generated once per sync run and passed unchanged through all upserts. The boolean is sufficient and self-documenting.

  • Embedding dedup cache keyed by semantic text hash (feedback-5, rec #5) — rejected because the existing content_hash dedup already prevents re-embedding unchanged documents. A semantic-text-only hash that ignores metadata would conflate genuinely different review contexts: two "LGTM" notes from different authors on different MRs are semantically distinct for the reviewer profiling use case (who said it, where, and when matters). The embedding pipeline handles ~8k notes comfortably without dedup optimization.

  • Derived review signal labels (signal:nit, signal:blocking, signal:security) (feedback-5, rec #6) — rejected because (a) it encroaches on the explicitly excluded reviewer profiling scope, (b) heuristic signal derivation (regex for "nit:", keyword matching for "security") is inherently fragile and would require ongoing maintenance as review vocabulary evolves, and (c) the raw note text already supports downstream LLM-based analysis that produces far more accurate signal classification than static keyword matching. This belongs in the downstream profiling PRD where LLM-based classification can be done properly.