Files
gitlore/docs/prd-per-note-search.md
teernisse 125938fba6 docs: add per-note search PRD and user journey documentation
Per-note search PRD: Comprehensive product requirements for evolving
the search system from document-level to note-level granularity.
Includes 6 rounds of iterative feedback refining scope, ranking
strategy, migration path, and robot mode integration.

User journeys: Detailed walkthrough of 8 primary user workflows
covering issue triage, MR review lookup, code archaeology, expert
discovery, sync pipeline operation, and agent integration patterns.
2026-02-12 11:21:23 -05:00

2519 lines
106 KiB
Markdown

---
plan: true
title: ""
status: iterating
iteration: 5
target_iterations: 8
beads_revision: 0
related_plans: []
created: 2026-02-11
updated: 2026-02-12
---
# PRD: Per-Note Search & Reviewer Profiling
## Problem Statement
Lore ingests all GitLab discussion notes with full metadata (author, body, diff positions, timestamps), but the data is only accessible through aggregated discussion documents. There is no way to:
1. **Query individual notes by author** — the `--author` filter on `lore search` only matches the first note's author per discussion thread, and relies solely on mutable usernames (no immutable author identity for longitudinal analysis)
2. **List raw notes with metadata** — no CLI surface exposes the `notes` table directly
3. **Semantically search individual comments** — notes are bundled into thread documents, diluting per-note relevance
**Use case:** "Search through jdefting's code review comments over the past year to build a comprehensive report of their code smell preferences and review patterns."
## Design
Three phases, shipped together as one feature:
- **Phase 0 (Foundation):** Stable note identity — unify issue discussion note ingestion to use upsert+sweep (matching the MR pattern), ensuring `notes.id` is stable across syncs, capturing immutable `author_id` for longitudinal analysis, and enabling change-aware dirty marking
- **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering and multiple export formats (table, JSON, JSONL, CSV)
- **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline
All three phases are required. Phase 0 gives stable identity for reliable document tracking; Phase 1 gives structured data extraction; Phase 2 gives semantic search.
## Non-Goals
- Changing existing discussion document behavior (those remain as-is)
- Adding a "reviewer profile" report command (that's a downstream use case built on this infrastructure)
- Modifying the ingestion pipeline from GitLab (data is already captured — Phase 0 only changes local storage strategy)
- Adding pagination/cursor support (no existing list command has this; high `--limit` covers the year-long analysis use case)
- Feature flag gating (no external users; single-user CLI in early dev)
---
## Phase 0: Stable Note Identity
### Rationale
Issue discussion note ingestion currently uses a delete/reinsert pattern (`DELETE FROM notes WHERE discussion_id = ?` then re-insert). This makes `notes.id` (the local row ID) unstable across syncs — every sync assigns new IDs to the same notes. MR discussion notes already use an upsert pattern (`ON CONFLICT(gitlab_id) DO UPDATE`), producing stable IDs.
Phase 2 depends on `notes.id` as the `source_id` for note documents. Unstable IDs would cause:
- Unnecessary document churn (old ID deleted, new ID created for identical content)
- Stale document accumulation (orphaned documents from old IDs)
- Wasted regeneration cycles on every sync
Unifying both paths to upsert+sweep gives stable identity, enables change-aware dirty marking, and reduces sync overhead.
### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes
**Files:** `src/ingestion/discussions.rs`
**Depends on:** Nothing (standalone)
**Context:** `src/ingestion/mr_discussions.rs` already has `upsert_note()` (line ~470) which uses `ON CONFLICT(gitlab_id) DO UPDATE` and `sweep_stale_notes()` (line ~551) which deletes notes with `last_seen_at < run_seen_at`. The `notes` table already has `UNIQUE(gitlab_id)`. We need to bring the issue discussion path to the same pattern.
#### Tests to Write First
```rust
#[test]
fn test_issue_note_upsert_stable_id() {
// Insert a discussion with 2 notes via the new upsert path
// Record their local IDs
// "Re-sync" the same notes (same gitlab_ids, same content)
// Assert: local IDs are unchanged
}
#[test]
fn test_issue_note_upsert_detects_body_change() {
// Insert note with body "old text"
// Re-sync same gitlab_id with body "new text"
// Assert: upsert returns changed_semantics = true
// Assert: local ID is unchanged
}
#[test]
fn test_issue_note_upsert_unchanged_returns_false() {
// Insert note, re-sync identical note
// Assert: upsert returns changed_semantics = false
}
#[test]
fn test_issue_note_upsert_updated_at_only_does_not_mark_semantic_change() {
// Insert note with body "text", updated_at = 1000
// Re-sync same gitlab_id with body "text", updated_at = 2000
// Assert: upsert returns changed_semantics = false
// Assert: updated_at is updated in the DB (housekeeping fields always refresh)
}
#[test]
fn test_issue_note_sweep_removes_stale() {
// Insert 3 notes for a discussion
// Re-sync with only 2 of the 3 (different last_seen_at)
// Run sweep
// Assert: stale note is deleted, 2 remain
}
#[test]
fn test_issue_note_upsert_returns_local_id() {
// Insert a note via upsert
// Assert: returned local_id matches conn.last_insert_rowid()
// Or for update path: matches the existing row's id
}
```
#### Implementation
**1. Create shared `NoteUpsertOutcome` struct** (in `src/ingestion/discussions.rs` or a shared module):
```rust
pub struct NoteUpsertOutcome {
pub local_note_id: i64,
pub changed_semantics: bool,
}
```
**2. Refactor `insert_note()` → `upsert_note_for_issue()`:**
Replace the current `DELETE FROM notes WHERE discussion_id = ?` + loop insert pattern (lines 132-139) with:
```rust
for note in &normalized_notes {
let outcome = upsert_note_for_issue(&tx, local_discussion_id, note, last_seen_at)?;
// outcome.local_note_id and outcome.changed_semantics available for Phase 2
}
// After loop: sweep stale notes for this discussion
sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?;
```
The upsert SQL follows the MR pattern:
```sql
INSERT INTO notes (gitlab_id, discussion_id, project_id, author_username, body, note_type,
is_system, created_at, updated_at, last_seen_at, ...)
VALUES (?1, ?2, ?3, ...)
ON CONFLICT(gitlab_id) DO UPDATE SET
body = excluded.body,
note_type = excluded.note_type,
updated_at = excluded.updated_at,
last_seen_at = excluded.last_seen_at,
...
```
**Change detection:** Semantic change is computed separately from housekeeping updates. The upsert always updates persistence fields (`updated_at`, `last_seen_at`), but `changed_semantics` is derived only from fields that affect note documents and search filters:
```sql
ON CONFLICT(gitlab_id) DO UPDATE SET
body = excluded.body,
note_type = excluded.note_type,
updated_at = excluded.updated_at,
last_seen_at = excluded.last_seen_at,
resolved = excluded.resolved,
resolved_by = excluded.resolved_by,
position_new_path = excluded.position_new_path,
position_new_line = excluded.position_new_line,
position_old_path = excluded.position_old_path,
position_old_line = excluded.position_old_line,
...
```
Then detect semantic change with a separate check that excludes `updated_at` and `last_seen_at` (housekeeping-only fields):
```sql
WHERE notes.body IS NOT excluded.body
OR notes.note_type IS NOT excluded.note_type
OR notes.resolved IS NOT excluded.resolved
OR notes.resolved_by IS NOT excluded.resolved_by
OR notes.position_new_path IS NOT excluded.position_new_path
OR notes.position_new_line IS NOT excluded.position_new_line
```
**Rationale:** `updated_at` changes alone (e.g., GitLab touching the timestamp without modifying content) should NOT trigger document regeneration. This avoids unnecessary dirty queue churn on large datasets. The WHERE clause fires the DO UPDATE unconditionally (to refresh `last_seen_at`), and `changed_semantics` is derived from `conn.changes()` after a second query that checks only semantic fields:
```rust
// Two-step: always upsert (refreshes housekeeping), then check semantic change
let upserted = /* run upsert SQL */;
let local_id = conn.query_row("SELECT id FROM notes WHERE gitlab_id = ?", [gitlab_id], |r| r.get(0))?;
let changed = conn.query_row(
"SELECT COUNT(*) FROM notes WHERE id = ? AND (body IS NOT ? OR note_type IS NOT ? OR ...)",
params![local_id, body, note_type, ...],
|r| r.get::<_, i64>(0),
)? == 0 && /* was an update, not insert */;
```
Actually, simpler approach: use `conn.changes()` from the initial upsert (which always runs the SET clause), then separately track whether the note already existed:
```rust
// Check if note exists before upsert
let existed = conn.query_row(
"SELECT id, body, note_type, resolved, resolved_by, position_new_path, position_new_line FROM notes WHERE gitlab_id = ?",
[gitlab_id],
|r| Ok((r.get::<_, i64>(0)?, r.get::<_, Option<String>>(1)?, /* ... */)),
).optional()?;
// Run upsert (always updates housekeeping fields)
conn.execute(upsert_sql, params![...])?;
let local_id = match &existed {
Some((id, ..)) => *id,
None => conn.last_insert_rowid(),
};
let changed_semantics = match &existed {
None => true, // New insert = always changed
Some((_, old_body, old_note_type, old_resolved, old_path, old_line)) => {
old_body.as_deref() != body || old_note_type.as_deref() != note_type || /* ... */
}
};
```
This pre-read approach is clearest and avoids any SQLite edge cases with `changes()` counting. The pre-read is a single-row lookup on the UNIQUE(gitlab_id) index — negligible cost.
**3. Also update `upsert_note()` in `src/ingestion/mr_discussions.rs`** to return `NoteUpsertOutcome` instead of `Result<()>`. Same semantic-change-only detection (exclude `updated_at`).
**4. Sweep function for issue notes:**
```rust
fn sweep_stale_issue_notes(conn: &Connection, discussion_id: i64, last_seen_at: i64) -> Result<()> {
conn.execute(
"DELETE FROM notes WHERE discussion_id = ? AND last_seen_at < ?",
rusqlite::params![discussion_id, last_seen_at],
)?;
Ok(())
}
```
---
### Work Chunk 0B: Immediate Deletion Propagation
**Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`
**Depends on:** Work Chunk 0A (uses sweep functions from 0A), Work Chunk 2A (documents table must accept `source_type='note'`)
**Context:** When sweep deletes stale notes, the current plan relies on eventual cleanup via `generate-docs --full` for orphaned note documents. This creates a window where deleted notes still appear in search results, eroding trust in the dataset. Instead, propagate deletion to documents immediately in the same transaction.
#### Tests to Write First
```rust
#[test]
fn test_issue_note_sweep_deletes_note_documents_immediately() {
// Setup: project, issue, discussion, 3 non-system notes
// Generate note documents for all 3
// Re-sync with only 2 of the 3 notes (different last_seen_at)
// Run sweep
// Assert: stale note row is deleted
// Assert: stale note's document is deleted from documents table
// Assert: stale note's dirty_sources entry (if any) is deleted
// Assert: remaining 2 notes' documents are untouched
}
#[test]
fn test_mr_note_sweep_deletes_note_documents_immediately() {
// Same pattern as above but for MR discussion notes
}
#[test]
fn test_sweep_deletion_handles_note_without_document() {
// Setup: note exists but was never turned into a document (e.g., system note)
// Sweep deletes the note
// Assert: no error (DELETE WHERE on non-existent document is a no-op)
}
#[test]
fn test_set_based_deletion_atomicity() {
// Setup: project, issue, discussion, 5 non-system notes with documents
// Mark 3 as stale (different last_seen_at)
// Run sweep
// Assert: exactly 3 note rows deleted
// Assert: exactly 3 documents deleted
// Assert: exactly 3 dirty_sources entries deleted (if any existed)
// Assert: remaining 2 note rows, documents, and dirty_sources untouched
}
```
#### Implementation
Update both sweep functions to propagate deletion to documents and dirty_sources using **set-based SQL** (not per-note loops). This is both faster on large threads and simpler to reason about atomically:
```rust
fn sweep_stale_issue_notes(conn: &Connection, discussion_id: i64, last_seen_at: i64) -> Result<()> {
// Set-based: identify stale non-system note IDs, delete their documents
// and dirty_sources entries, then delete the note rows themselves.
// All in one transaction scope — no per-note loop needed.
// Step 1: Delete documents for stale non-system notes (cascades to
// document_labels and document_paths via ON DELETE CASCADE;
// FTS trigger documents_ad auto-removes FTS entry)
conn.execute(
"DELETE FROM documents WHERE source_type = 'note' AND source_id IN (
SELECT id FROM notes
WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0
)",
rusqlite::params![discussion_id, last_seen_at],
)?;
// Step 2: Delete dirty_sources entries for stale non-system notes
conn.execute(
"DELETE FROM dirty_sources WHERE source_type = 'note' AND source_id IN (
SELECT id FROM notes
WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0
)",
rusqlite::params![discussion_id, last_seen_at],
)?;
// Step 3: Delete all stale note rows (system and non-system)
conn.execute(
"DELETE FROM notes WHERE discussion_id = ? AND last_seen_at < ?",
rusqlite::params![discussion_id, last_seen_at],
)?;
Ok(())
}
```
Same pattern for `sweep_stale_notes()` in `src/ingestion/mr_discussions.rs`.
**Note:** The document DELETE cascades to `document_labels` and `document_paths` via ON DELETE CASCADE. The FTS trigger (`documents_ad`) automatically removes the FTS entry. No additional cleanup needed.
**Why set-based:** The subquery `SELECT id FROM notes WHERE discussion_id = ? AND last_seen_at < ? AND is_system = 0` runs once per step against the UNIQUE(gitlab_id) index. This is O(1) SQL statements regardless of how many stale notes exist, vs O(N) individual DELETE statements in a loop. On large threads (100+ notes), this is measurably faster and avoids the risk of partial completion if the loop is interrupted.
**Defense-in-depth:** Work Chunk 2A's migration also creates DB-level cleanup triggers (`notes_ad_cleanup`, `notes_au_system_cleanup`) that fire on ANY note deletion/system-flip, not just sweep. The sweep functions handle the common path with explicit set-based SQL; the triggers are a safety net for any future code path that deletes notes outside the sweep functions. Both mechanisms coexist — the explicit SQL in sweep is preferred (clearer intent, predictable cost), and the triggers catch edge cases.
---
### Work Chunk 0C: Sweep Safety Guard (Partial Fetch Protection)
**Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`
**Depends on:** Work Chunk 0A (modifies the sweep call site from 0A)
**Context:** The sweep-based deletion pattern (delete notes where `last_seen_at < run_seen_at`) is correct only when a discussion's notes were fully fetched from GitLab. If a page fails mid-fetch (network timeout, rate limit, partial API response), the current logic would incorrectly delete valid notes that simply weren't seen during the incomplete fetch. This is especially dangerous for long threads with many notes that span multiple API pages.
#### Tests to Write First
```rust
#[test]
fn test_partial_fetch_does_not_sweep_notes() {
// Setup: project, issue, discussion, 5 notes already in DB
// Simulate a partial fetch: only 2 of 5 notes returned
// (set last_seen_at for 2 notes to current run, 3 to previous run)
// Call the ingestion function with fetch_complete = false
// Assert: all 5 notes still exist (sweep was skipped)
// Assert: the 2 re-synced notes have updated last_seen_at
}
#[test]
fn test_complete_fetch_runs_sweep_normally() {
// Setup: project, issue, discussion, 5 notes
// Simulate a complete fetch: all 5 notes returned
// Call the ingestion function with fetch_complete = true
// Assert: sweep runs normally (no stale notes in this case)
}
#[test]
fn test_partial_fetch_then_complete_fetch_cleans_up() {
// Setup: project, issue, discussion, 5 notes
// First sync: partial fetch (3 of 5), sweep skipped
// Second sync: complete fetch (only 3 notes exist on GitLab now)
// Assert: sweep runs and removes the 2 notes no longer on GitLab
}
```
#### Implementation
Add a `fetch_complete` parameter to the discussion ingestion functions. Only run the stale-note sweep when the fetch completed successfully:
```rust
// In the discussion ingestion loop, after upserting all notes:
if fetch_complete {
sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?;
} else {
tracing::warn!(
discussion_id = local_discussion_id,
"Skipping stale note sweep due to partial/incomplete fetch"
);
}
```
**Determining `fetch_complete`:** The discussion notes come from the GitLab API response. When the API returns all notes for a discussion in a single response (no pagination error, no timeout), `fetch_complete = true`. When the fetch encounters a network error, rate limit, or is interrupted, `fetch_complete = false`. The exact signaling mechanism depends on how the existing ingestion pipeline handles partial API responses — look at the MR discussion ingestion path for the existing pattern.
**Note:** This is a safety guard, not a completeness guarantee. The sweep will still run on the next successful full fetch. The guard prevents data loss during transient failures, not during permanent API changes.
---
### Work Chunk 0D: Immutable Author Identity Capture
**Files:** `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`
**Depends on:** Work Chunk 0A (modifies the upsert functions from 0A)
**Context:** The core use case is year-scale reviewer profiling ("search through jdefting's code review comments over the past year"). GitLab usernames are mutable — a user can change their username at any time. If a reviewer changes their username from `jdefting` to `jd-engineering` mid-year, author-based queries fragment their identity into two separate result sets. The `notes` table already captures `author_username` from the API response, but this only reflects the username at ingestion time.
GitLab note payloads include `note.author.id` (an immutable integer). Capturing this alongside the username provides a stable identity anchor for longitudinal analysis, even across username changes.
**Scope:** This chunk adds the column and populates it during ingestion. It does NOT add a `--author-id` CLI filter — that's deferred to the downstream reviewer profiling PRD. The value here is data capture: once `author_id` is stored, it can never be retroactively recovered if we don't capture it now.
#### Tests to Write First
```rust
#[test]
fn test_issue_note_upsert_captures_author_id() {
// Insert a note with author_id = 12345
// Assert: notes.author_id == 12345
// Assert: notes.author_username == "jdefting"
}
#[test]
fn test_mr_note_upsert_captures_author_id() {
// Same pattern for MR notes
}
#[test]
fn test_note_upsert_author_id_nullable() {
// Insert a note with author_id = None (older API responses may lack this)
// Assert: notes.author_id IS NULL
// Assert: no error (column is nullable)
}
#[test]
fn test_note_author_id_survives_username_change() {
// Insert note with author_username = "jdefting", author_id = 12345
// Re-upsert same gitlab_id with author_username = "jd-engineering", author_id = 12345
// Assert: author_id unchanged (12345)
// Assert: author_username updated to "jd-engineering"
// Assert: changed_semantics = false (username change is not a semantic change for documents)
}
```
#### Implementation
**1. Migration** — Add `author_id` column to `notes` table. This goes in migration 022 (combined with the query index migration from Work Chunk 1E to avoid an extra migration):
Add to the query index migration SQL:
```sql
-- Add immutable author identity column (nullable for backcompat with pre-existing notes)
ALTER TABLE notes ADD COLUMN author_id INTEGER;
-- Index for future author_id lookups (not used by current CLI, but enables
-- the downstream reviewer profiling PRD to query by stable identity)
CREATE INDEX IF NOT EXISTS idx_notes_author_id
ON notes(author_id)
WHERE author_id IS NOT NULL;
```
**2. Populate `author_id` during upsert** — In both `upsert_note_for_issue()` (discussions.rs) and `upsert_note()` (mr_discussions.rs), add `author_id` to the INSERT and ON CONFLICT DO UPDATE SET clauses. Extract from the GitLab API note payload's `author.id` field.
**3. Semantic change detection**`author_id` changes should NOT trigger `changed_semantics = true`. The `author_id` is an identity anchor, not a content field. It's excluded from the semantic change comparison alongside `updated_at` and `last_seen_at`.
**4. Note document extraction** — No changes needed for this chunk. The `extract_note_document()` function (Work Chunk 2C) uses `author_username` for the document content. The `author_id` is stored for future use but not surfaced in the current document format.
---
## Phase 1: `lore notes` Command
### Work Chunk 1A: Data Types & Query Layer
**Files:** `src/cli/commands/list.rs`
**Context:** This file contains `IssueListRow`, `MrListRow`, their JSON counterparts, `ListFilters`, `MrListFilters`, and the `query_issues()`/`query_mrs()` functions. The new code follows these exact patterns.
**Depends on:** Nothing (standalone)
#### Tests to Write First
Add to `src/cli/commands/list.rs` in the `#[cfg(test)] mod tests` block. The test DB setup requires the `notes` and `discussions` tables — reuse the patterns from `src/documents/extractor.rs::setup_discussion_test_db()`.
```rust
// Test helper — create in-memory DB with projects, issues, MRs, discussions, notes tables
// Pattern: same as extractor.rs::setup_discussion_test_db() but also include
// merge_requests, mr_labels, issue_labels, labels tables
#[test]
fn test_query_notes_empty_db() {
// Setup DB with no notes
// Call query_notes with default NoteListFilters
// Assert: total_count == 0, notes.is_empty()
}
#[test]
fn test_query_notes_returns_user_notes_only() {
// Insert 2 user notes and 1 system note into same discussion
// Call query_notes with default filters (no_system = true by default)
// Assert: returns 2 notes, system note excluded
}
#[test]
fn test_query_notes_include_system() {
// Insert 2 user notes and 1 system note
// Call query_notes with include_system = true
// Assert: returns 3 notes
}
#[test]
fn test_query_notes_filter_author() {
// Insert notes from "alice" and "bob"
// Call query_notes with author = Some("alice")
// Assert: only alice's notes returned
}
#[test]
fn test_query_notes_filter_author_strips_at() {
// Call query_notes with author = Some("@alice")
// Assert: still matches "alice" (@ prefix stripped)
}
#[test]
fn test_query_notes_filter_author_case_insensitive() {
// Insert notes from "Alice" (capital A)
// Call query_notes with author = Some("alice")
// Assert: matches (COLLATE NOCASE)
}
#[test]
fn test_query_notes_filter_note_type() {
// Insert notes with note_type = Some("DiffNote") and Some("DiscussionNote") and None
// Call query_notes with note_type = Some("DiffNote")
// Assert: only DiffNote notes returned
}
#[test]
fn test_query_notes_filter_project() {
// Insert 2 projects, notes in each
// Call query_notes with project = Some("group/project-one")
// Assert: only project-one notes returned (uses resolve_project())
}
#[test]
fn test_query_notes_filter_project_uses_default() {
// Insert 2 projects, notes in each
// Call query_notes with project = None, config.default_project = Some("group/project-one")
// Assert: only project-one notes returned when for_issue_iid or for_mr_iid is set
}
#[test]
fn test_query_notes_filter_since() {
// Insert notes at created_at = 1000, 2000, 3000
// Call with since cutoff that excludes the first
// Assert: only notes after cutoff returned
}
#[test]
fn test_query_notes_filter_until() {
// Insert notes at created_at = 1000, 2000, 3000
// Call with until cutoff that excludes the last
// Assert: only notes before cutoff returned
}
#[test]
fn test_query_notes_filter_since_and_until_combined() {
// Insert notes at created_at = 1000, 2000, 3000, 4000
// Call with since=1500, until=3500
// Assert: only notes at 2000 and 3000 returned
}
#[test]
fn test_query_notes_invalid_time_window_rejected() {
// Call with since pointing to a time AFTER until
// (e.g., since = "30d", until = "90d" — 30 days ago is after 90 days ago)
// Assert: returns a clear error, not an empty result set
}
#[test]
fn test_query_notes_until_date_uses_end_of_day() {
// Insert notes at various times on 2025-06-15
// Call with until = "2025-06-15"
// Assert: all notes on that day are included (end-of-day, not start-of-day)
}
#[test]
fn test_query_notes_filter_contains() {
// Insert notes with body "This function is too complex" and "LGTM"
// Call with contains = Some("complex")
// Assert: only the first note returned
}
#[test]
fn test_query_notes_filter_contains_case_insensitive() {
// Insert note with body "Use EXPECT instead of unwrap"
// Call with contains = Some("expect")
// Assert: matches (COLLATE NOCASE)
}
#[test]
fn test_query_notes_filter_contains_escapes_like_wildcards() {
// Insert notes with body "100% coverage" and "100 tests"
// Call with contains = Some("100%")
// Assert: only "100% coverage" returned (% is literal, not wildcard)
}
#[test]
fn test_query_notes_filter_path() {
// Insert DiffNotes on "src/auth.rs" and "src/config.rs"
// Call with path = Some("src/auth.rs")
// Assert: only auth.rs notes returned
}
#[test]
fn test_query_notes_filter_path_prefix() {
// Insert DiffNotes on "src/auth/login.rs" and "test/auth_test.rs"
// Call with path = Some("src/") (trailing slash = prefix)
// Assert: only src/ notes returned
}
#[test]
fn test_query_notes_filter_for_issue_requires_project() {
// Insert issue with iid=42 in project-one, same iid=42 in project-two
// Call with for_issue_iid = Some(42), project = Some("group/project-one")
// Assert: only notes from project-one's issue #42
}
#[test]
fn test_query_notes_filter_for_mr_requires_project() {
// Insert MR with iid=10 in project-one, same iid=10 in project-two
// Call with for_mr_iid = Some(10), project = Some("group/project-one")
// Assert: only notes from project-one's MR !10
}
#[test]
fn test_query_notes_filter_for_issue_uses_default_project() {
// Insert issue with iid=42 in project-one
// Call with for_issue_iid = Some(42), project = None, config.default_project = Some("group/project-one")
// Assert: resolves via defaultProject fallback — returns notes from project-one's issue #42
}
#[test]
fn test_query_notes_filter_for_mr_uses_default_project() {
// Insert MR with iid=10 in project-one
// Call with for_mr_iid = Some(10), project = None, config.default_project = Some("group/project-one")
// Assert: resolves via defaultProject fallback
}
#[test]
fn test_query_notes_filter_for_issue_without_project_context_errors() {
// Call with for_issue_iid = Some(42), project = None, no defaultProject
// Assert: returns error (IID requires project context)
}
#[test]
fn test_query_notes_filter_resolution_unresolved() {
// Insert 2 notes: one with resolvable=1,resolved=0 and one with resolvable=1,resolved=1
// Call with resolution = "unresolved"
// Assert: only the unresolved note returned
}
#[test]
fn test_query_notes_filter_resolution_resolved() {
// Same setup as above
// Call with resolution = "resolved"
// Assert: only the resolved note returned
}
#[test]
fn test_query_notes_filter_resolution_any() {
// Same setup as above
// Call with resolution = "any" (default)
// Assert: both notes returned
}
#[test]
fn test_query_notes_sort_created_desc() {
// Insert notes with created_at = 1000, 3000, 2000
// Call with sort = "created", order = "desc"
// Assert: notes ordered 3000, 2000, 1000
}
#[test]
fn test_query_notes_sort_created_asc() {
// Same data, order = "asc"
// Assert: ordered 1000, 2000, 3000
}
#[test]
fn test_query_notes_deterministic_tiebreak() {
// Insert 3 notes with identical created_at timestamps
// Call twice with same sort/order
// Assert: order is identical both times (n.id tiebreak)
}
#[test]
fn test_query_notes_limit() {
// Insert 10 notes
// Call with limit = 3
// Assert: notes.len() == 3, total_count == 10
}
#[test]
fn test_query_notes_combined_filters() {
// Insert notes from multiple authors, types, projects, paths
// Call with author + note_type + project + since combined
// Assert: intersection of all filters
}
#[test]
fn test_query_notes_filter_note_id_exact() {
// Insert 3 notes with known local IDs
// Call with note_id = Some(2)
// Assert: only the note with local id 2 returned
}
#[test]
fn test_query_notes_filter_gitlab_note_id_exact() {
// Insert notes with gitlab_id = 12345 and gitlab_id = 67890
// Call with gitlab_note_id = Some(12345)
// Assert: only the note with gitlab_id 12345 returned
}
#[test]
fn test_query_notes_filter_discussion_id_exact() {
// Insert 2 discussions, each with 2 notes
// Call with discussion_id = Some(1)
// Assert: only notes from discussion 1 returned
}
#[test]
fn test_note_list_row_json_conversion() {
// Create NoteListRow with known ms timestamps
// Convert to NoteListRowJson
// Assert: created_at_iso and updated_at_iso are correct ISO strings
// Assert: all fields carry over
}
```
#### Implementation
**Data structures** (add in `src/cli/commands/list.rs` after `MrListResultJson`):
```rust
// NoteListRow — raw query result, ms timestamps
pub struct NoteListRow {
pub id: i64, // notes.id (local)
pub gitlab_id: i64, // notes.gitlab_id
pub author_username: Option<String>,
pub body: Option<String>,
pub note_type: Option<String>, // "DiffNote" | "DiscussionNote" | null
pub is_system: bool,
pub created_at: i64,
pub updated_at: i64,
pub position_new_path: Option<String>,
pub position_new_line: Option<i64>,
pub position_old_path: Option<String>,
pub position_old_line: Option<i64>,
pub resolvable: bool,
pub resolved: bool,
pub resolved_by: Option<String>,
pub noteable_type: String, // "Issue" | "MergeRequest"
pub parent_iid: i64, // parent issue/MR iid
pub parent_title: Option<String>,
pub project_path: String,
}
// NoteListRowJson — ISO timestamps, serde for JSON output
pub struct NoteListRowJson { ... } // with created_at_iso, updated_at_iso
impl From<&NoteListRow> for NoteListRowJson { ... }
// NoteListResult
pub struct NoteListResult {
pub notes: Vec<NoteListRow>,
pub total_count: usize,
}
// NoteListResultJson
pub struct NoteListResultJson {
pub notes: Vec<NoteListRowJson>,
pub total_count: usize,
pub showing: usize,
}
impl From<&NoteListResult> for NoteListResultJson { ... }
```
**Filter struct:**
```rust
pub struct NoteListFilters<'a> {
pub limit: usize,
pub project: Option<&'a str>,
pub author: Option<&'a str>, // case-insensitive match via COLLATE NOCASE
pub note_type: Option<&'a str>, // "DiffNote" | "DiscussionNote"
pub include_system: bool, // default false
pub for_issue_iid: Option<i64>, // filter by parent issue iid
pub for_mr_iid: Option<i64>, // filter by parent MR iid
pub note_id: Option<i64>, // filter by local note row id (exact)
pub gitlab_note_id: Option<i64>, // filter by GitLab note id (exact)
pub discussion_id: Option<i64>, // filter by local discussion id (exact)
pub since: Option<&'a str>,
pub until: Option<&'a str>, // end of time window
pub path: Option<&'a str>, // exact or prefix (trailing /)
pub contains: Option<&'a str>, // case-insensitive body substring match
pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved"
pub sort: &'a str, // "created" (default) | "updated"
pub order: &'a str, // "desc" (default) | "asc"
}
```
**Query function** `query_notes(conn, filters) -> Result<NoteListResult>`:
**Time window parsing:** Parse `since` and `until` relative to a single anchored `now_ms` captured once at the start of `query_notes()`. This prevents subtle drift if parsing happens at different times (e.g., across midnight boundary). When `--until` is a date string (`YYYY-MM-DD`), interpret as end-of-day (`23:59:59.999 UTC`) so that `--until 2025-06-15` includes all events on June 15th, not just those before midnight. After both values are parsed, validate `since_ms <= until_ms` — if the window is inverted (e.g., `--since 30d --until 90d`, which means "30 days ago to 90 days ago" — an inverted range), return a clear error:
```rust
let now_ms = Utc::now().timestamp_millis();
let since_ms = filters.since.map(|s| parse_since_with_anchor(s, now_ms)).transpose()?;
let until_ms = filters.until.map(|s| parse_until_with_anchor(s, now_ms)).transpose()?;
if let (Some(s), Some(u)) = (since_ms, until_ms) {
if s > u {
return Err(LoreError::Usage(format!(
"Invalid time window: --since ({}) is after --until ({}). \
Did you mean --since {} --until {}?",
format_iso(s), format_iso(u),
filters.until.unwrap(), filters.since.unwrap(),
)));
}
}
```
**`parse_until_with_anchor`** differs from `parse_since_with_anchor` in one way: when the input is a `YYYY-MM-DD` date, it returns end-of-day (23:59:59.999 UTC) instead of start-of-day. For relative formats (`7d`, `2w`, `1m`), it behaves identically to `parse_since_with_anchor`.
Core SQL shape:
```sql
SELECT
n.id, n.gitlab_id, n.author_username, n.body, n.note_type,
n.is_system, n.created_at, n.updated_at,
n.position_new_path, n.position_new_line,
n.position_old_path, n.position_old_line,
n.resolvable, n.resolved, n.resolved_by,
d.noteable_type,
COALESCE(i.iid, m.iid) AS parent_iid,
COALESCE(i.title, m.title) AS parent_title,
p.path_with_namespace
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
JOIN projects p ON n.project_id = p.id
LEFT JOIN issues i ON d.issue_id = i.id
LEFT JOIN merge_requests m ON d.merge_request_id = m.id
WHERE {dynamic_filters}
ORDER BY {sort_column} {order}, n.id {order}
LIMIT ?
```
**Important:** The `ORDER BY` includes `n.id` as a deterministic tiebreaker. Notes with identical timestamps will always sort in the same order. This follows SQLite best practice for reproducible result sets.
Dynamic WHERE clauses follow the same `where_clauses` + `params` vec pattern as `query_issues()` (see `list.rs:287-374`).
Filter mappings:
- `include_system = false` (default): `n.is_system = 0`
- `author`: strip `@` prefix, `n.author_username = ? COLLATE NOCASE`
- `note_type`: `n.note_type = ?`
- `project`: `resolve_project(conn, project)?` then `n.project_id = ?`
- `note_id`: `n.id = ?` (exact local row ID match — useful for debugging sync correctness)
- `gitlab_note_id`: `n.gitlab_id = ?` (exact GitLab note ID match — cross-reference with GitLab API)
- `discussion_id`: `n.discussion_id = ?` (all notes in a specific discussion thread)
- `since`: parsed via `parse_since_with_anchor(since_str, now_ms)` then `n.created_at >= ?`
- `until`: parsed via `parse_until_with_anchor(until_str, now_ms)` then `n.created_at <= ?`
- `path` with trailing `/`: `n.position_new_path LIKE ? ESCAPE '\'` (use `escape_like` from `filters.rs`)
- `path` without trailing `/`: `n.position_new_path = ?`
- `resolution = "unresolved"`: `n.resolvable = 1 AND n.resolved = 0`
- `resolution = "resolved"`: `n.resolvable = 1 AND n.resolved = 1`
- `resolution = "any"`: no filter (default)
- `for_issue_iid`: requires resolved project_id (from `--project` flag or `defaultProject` config). SQL: `d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?)` — the project_id param comes from the already-resolved project context
- `for_mr_iid`: same pattern — `d.merge_request_id = (SELECT id FROM merge_requests WHERE iid = ? AND project_id = ?)` — requires resolved project_id
**IID scoping rule:** `for_issue_iid` and `for_mr_iid` require a project context because IIDs are only unique within a project. The query layer validates this: if `for_issue_iid` or `for_mr_iid` is set without a resolved project_id, return an error. The project can come from either `--project` flag or `defaultProject` in config (resolved via the existing `resolve_project()` which already handles `defaultProject` fallback). Note: the CLI does NOT use clap's `requires = "project"` constraint for these flags, because that would block `defaultProject` resolution — the validation happens at the query layer instead.
COUNT query first (same pattern as issues), then SELECT with LIMIT.
**Public entry point:**
```rust
pub fn run_list_notes(config: &Config, filters: NoteListFilters) -> Result<NoteListResult> {
let db_path = get_db_path(config.storage.db_path.as_deref());
let conn = create_connection(&db_path)?;
query_notes(&conn, &filters)
}
```
---
### Work Chunk 1B: CLI Arguments & Command Wiring
**Files:** `src/cli/mod.rs`, `src/main.rs`, `src/cli/commands/mod.rs`, `src/cli/robot.rs`
**Depends on:** Work Chunk 1A
#### Tests to Write First
No unit tests for CLI arg parsing (clap handles this). Integration-level assertions:
```rust
// In src/cli/robot.rs tests (or new test module):
#[test]
fn test_expand_fields_preset_notes() {
let fields = vec!["minimal".to_string()];
let expanded = expand_fields_preset(&fields, "notes");
assert_eq!(expanded, vec!["id", "author_username", "body", "created_at_iso"]);
}
```
#### Implementation
**1. Add `NotesArgs` to `src/cli/mod.rs`** (after `MrsArgs`, around line 472):
```rust
#[derive(Parser)]
#[command(after_help = "\x1b[1mExamples:\x1b[0m
lore notes --author jdefting --since 365d # All of jdefting's notes in past year
lore notes --author jdefting --note-type DiffNote # Only code review comments
lore notes --path src/auth/ --resolution unresolved # Unresolved comments on auth code
lore notes --for-mr 456 -p group/repo # All notes on MR !456
lore notes --since 180d --until 90d # Notes from 180 to 90 days ago
lore notes --author jdefting --format jsonl # Stream notes for LLM analysis
lore notes --contains \"unwrap\" --note-type DiffNote # Find review comments mentioning unwrap")]
pub struct NotesArgs {
/// Maximum results
#[arg(short = 'n', long = "limit", default_value = "50", help_heading = "Output")]
pub limit: usize,
/// Select output fields (comma-separated, or 'minimal' preset)
#[arg(long, help_heading = "Output", value_delimiter = ',')]
pub fields: Option<Vec<String>>,
/// Output format (table, json, jsonl, csv)
#[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")]
pub format: String,
/// Filter by author username (case-insensitive)
#[arg(short = 'a', long, help_heading = "Filters")]
pub author: Option<String>,
/// Filter by note type (DiffNote, DiscussionNote)
#[arg(long = "note-type", value_parser = ["DiffNote", "DiscussionNote"], help_heading = "Filters")]
pub note_type: Option<String>,
/// Filter by case-insensitive substring in note body
#[arg(long, help_heading = "Filters")]
pub contains: Option<String>,
/// Filter by local note row id (exact match, for debugging)
#[arg(long = "note-id", help_heading = "Filters")]
pub note_id: Option<i64>,
/// Filter by GitLab note id (exact match, for cross-referencing)
#[arg(long = "gitlab-note-id", help_heading = "Filters")]
pub gitlab_note_id: Option<i64>,
/// Filter by local discussion id (all notes in a thread)
#[arg(long = "discussion-id", help_heading = "Filters")]
pub discussion_id: Option<i64>,
/// Include system-generated notes (excluded by default)
#[arg(long = "include-system", help_heading = "Filters", overrides_with = "no_include_system")]
pub include_system: bool,
#[arg(long = "no-include-system", hide = true, overrides_with = "include_system")]
pub no_include_system: bool,
/// Filter to notes on a specific issue IID (requires --project or defaultProject)
#[arg(long = "for-issue", help_heading = "Filters", conflicts_with = "for_mr")]
pub for_issue: Option<i64>,
/// Filter to notes on a specific MR IID (requires --project or defaultProject)
#[arg(long = "for-mr", help_heading = "Filters", conflicts_with = "for_issue")]
pub for_mr: Option<i64>,
/// Filter by project path
#[arg(short = 'p', long, help_heading = "Filters")]
pub project: Option<String>,
/// Filter by start time (7d, 2w, 1m, or YYYY-MM-DD)
#[arg(long, help_heading = "Filters")]
pub since: Option<String>,
/// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD)
#[arg(long, help_heading = "Filters")]
pub until: Option<String>,
/// Filter by file path (trailing / for prefix match)
#[arg(long, help_heading = "Filters")]
pub path: Option<String>,
/// Resolution filter: any (default), unresolved, resolved
#[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")]
pub resolution: String,
/// Sort field (created, updated)
#[arg(long, value_parser = ["created", "updated"], default_value = "created", help_heading = "Sorting")]
pub sort: String,
/// Sort ascending (default: descending)
#[arg(long, help_heading = "Sorting", overrides_with = "no_asc")]
pub asc: bool,
#[arg(long = "no-asc", hide = true, overrides_with = "asc")]
pub no_asc: bool,
}
```
**Note on `--for-issue` / `--for-mr`:** These flags do NOT use clap's `requires = "project"` constraint. The `defaultProject` config option provides the project context without the `--project` flag being explicitly passed. Validation happens at the query layer (Work Chunk 1A) — if neither `--project` nor `defaultProject` resolves a project, the query returns a clear error.
**2. Add `Notes` variant to `Commands` enum** in `src/cli/mod.rs` (around line 113):
```rust
/// List discussion notes with filtering
Notes(NotesArgs),
```
**3. Add `"notes"` minimal preset to `expand_fields_preset()`** in `src/cli/robot.rs` (around line 42):
```rust
"notes" => ["id", "author_username", "body", "created_at_iso"]
.iter()
.map(|s| (*s).to_string())
.collect(),
```
**4. Add handler in `src/main.rs`** (follow `handle_issues`/`handle_mrs` pattern):
```rust
fn handle_notes(config_path: Option<&str>, args: NotesArgs, robot_mode: bool) -> Result<()> {
let config = load_config(config_path)?;
let start = std::time::Instant::now();
let filters = NoteListFilters {
limit: args.limit,
project: args.project.as_deref(),
author: args.author.as_deref(),
note_type: args.note_type.as_deref(),
include_system: args.include_system,
for_issue_iid: args.for_issue,
for_mr_iid: args.for_mr,
note_id: args.note_id,
gitlab_note_id: args.gitlab_note_id,
discussion_id: args.discussion_id,
since: args.since.as_deref(),
until: args.until.as_deref(),
path: args.path.as_deref(),
contains: args.contains.as_deref(),
resolution: &args.resolution,
sort: &args.sort,
order: if args.asc { "asc" } else { "desc" },
};
let result = run_list_notes(&config, filters)?;
match (robot_mode, args.format.as_str()) {
(true, _) | (_, "json") => {
print_list_notes_json(&result, start.elapsed().as_millis() as u64, args.fields.as_deref());
}
(_, "jsonl") => {
print_list_notes_jsonl(&result);
}
(_, "csv") => {
print_list_notes_csv(&result);
}
_ => {
print_list_notes(&result);
}
}
Ok(())
}
```
Add dispatch in main match (around line 175):
```rust
Some(Commands::Notes(args)) => handle_notes(cli.config.as_deref(), args, robot_mode),
```
**5. Re-export in `src/cli/commands/mod.rs`:**
```rust
pub use list::{run_list_notes, print_list_notes, print_list_notes_json, print_list_notes_jsonl, print_list_notes_csv};
```
---
### Work Chunk 1C: Human & Robot Output Formatting
**Files:** `src/cli/commands/list.rs`
**Depends on:** Work Chunk 1A
#### Tests to Write First
```rust
#[test]
fn test_truncate_note_body() {
// Body with 200 chars should truncate to 80 + "..."
let body = "x".repeat(200);
let truncated = truncate_with_ellipsis(&body, 80);
assert_eq!(truncated.len(), 80);
assert!(truncated.ends_with("..."));
}
#[test]
fn test_csv_output_roundtrip() {
// NoteListRow with body containing commas, quotes, newlines, and multi-byte chars
// Write via print_list_notes_csv, parse back with csv::ReaderBuilder
// Assert: all fields roundtrip correctly
}
#[test]
fn test_jsonl_output_one_per_line() {
// NoteListResult with 3 notes
// Capture stdout, split by newline
// Assert: each line parses as valid JSON
// Assert: 3 lines total
}
```
#### Implementation
**`print_list_notes(result: &NoteListResult)`** — human-readable table:
Table columns: `ID | Author | Type | Body (truncated 60) | Path:Line | Parent | Created`
- ID: `colored_cell(note.gitlab_id, Color::Cyan)`
- Author: `colored_cell(format!("@{}", author), Color::Magenta)`
- Type: "Diff" or "Disc" or "-" (colored)
- Body: first line, truncated to 60 chars
- Path:Line: `position_new_path:position_new_line` or "-"
- Parent: `Issue #42` or `MR !456` (from noteable_type + parent_iid)
- Created: `format_relative_time(created_at)`
**`print_list_notes_json(result, elapsed_ms, fields)`** — robot JSON:
Follows exact envelope pattern:
```json
{
"ok": true,
"data": {
"notes": [...],
"total_count": N,
"showing": M
},
"meta": { "elapsed_ms": U64 }
}
```
Supports `--fields` via `filter_fields(&mut output, "notes", &expanded)`.
**`print_list_notes_jsonl(result: &NoteListResult)`** — one JSON object per line:
Each line is a complete `NoteListRowJson` object. No envelope, no metadata. This format is ideal for streaming into LLM prompts, `jq` pipelines, or notebook ingestion.
```rust
for note in &result.notes {
let json_row = NoteListRowJson::from(note);
println!("{}", serde_json::to_string(&json_row).unwrap());
}
```
**`print_list_notes_csv(result: &NoteListResult)`** — CSV with header:
Columns mirror `NoteListRowJson` field names. Uses the `csv` crate (`csv::Writer`) for RFC 4180-compliant escaping, handling commas, quotes, newlines, and multi-byte characters correctly. This avoids the fragility of manual CSV escaping.
```rust
let mut wtr = csv::Writer::from_writer(std::io::stdout());
// Write header
wtr.write_record(&["id", "gitlab_id", "author_username", "body", "note_type", ...])?;
// Write rows
for note in &result.notes {
let json_row = NoteListRowJson::from(note);
wtr.write_record(&[json_row.id.to_string(), ...])?;
}
wtr.flush()?;
```
**Dependency:** Add `csv = "1"` to `Cargo.toml` under `[dependencies]`. The `csv` crate is well-maintained, widely adopted (~100M downloads), and has zero unsafe code.
---
### Work Chunk 1D: robot-docs Integration
**Files:** Wherever `robot-docs` manifest is generated (search for `robot-docs` or `RobotDocs` command handler)
**Depends on:** Work Chunks 1A-1C complete
Add the `notes` command to the robot-docs manifest with:
- Command name, description, flags (including `--format`, `--until`, `--resolution`, `--contains`)
- Response schema for robot mode
- Exit codes
Also update `--type` value_parser on `SearchArgs` (line 542 of `src/cli/mod.rs`) to include `"note"` and `"notes"` as valid values for `--source-type` (this is forward-prep for Phase 2 but doesn't break anything until Phase 2 lands).
---
### Work Chunk 1E: Composite Query Index
**Files:** `migrations/022_notes_query_index.sql`, `src/core/db.rs`
**Depends on:** Nothing (standalone, can run in parallel with 1A)
**Context:** The `notes` table already has single-column indexes on `author_username`, `discussion_id`, `note_type`, `position_new_path`, and a composite `idx_notes_diffnote_path_created`. However, the new `query_notes()` function's most common query patterns would benefit from composite covering indexes.
#### Tests to Write First
```rust
#[test]
fn test_migration_022_indexes_exist() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Verify all indexes were created
let count: i64 = conn.query_row(
"SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND name IN (
'idx_notes_user_created', 'idx_notes_project_created',
'idx_discussions_issue_id', 'idx_discussions_mr_id'
)",
[],
|r| r.get(0),
).unwrap();
assert_eq!(count, 4);
}
```
#### Implementation
**Migration SQL** (`migrations/022_notes_query_index.sql`):
```sql
-- Composite index for the common "notes by author" query pattern:
-- non-system notes filtered by author, sorted by created_at DESC with id tiebreaker.
-- The is_system partial index condition avoids indexing system notes (which are
-- filtered out by default and typically comprise 30-50% of all notes).
-- Uses COLLATE NOCASE to match the query's case-insensitive author comparison.
CREATE INDEX IF NOT EXISTS idx_notes_user_created
ON notes(project_id, author_username COLLATE NOCASE, created_at DESC, id DESC)
WHERE is_system = 0;
-- Composite index for the common "all notes in project by date" query pattern:
-- serves project-scoped listings without author filter.
CREATE INDEX IF NOT EXISTS idx_notes_project_created
ON notes(project_id, created_at DESC, id DESC)
WHERE is_system = 0;
-- Index on discussions.issue_id for efficient JOIN when filtering by parent issue.
-- The query_notes() function JOINs discussions to reach parent entities.
CREATE INDEX IF NOT EXISTS idx_discussions_issue_id
ON discussions(issue_id);
-- Index on discussions.merge_request_id for efficient JOIN when filtering by parent MR.
CREATE INDEX IF NOT EXISTS idx_discussions_mr_id
ON discussions(merge_request_id);
```
The first partial index serves the primary use case (author-scoped queries) with `COLLATE NOCASE` matching the query's case-insensitive author comparison. The second serves project-scoped date-range queries (`--since`/`--until` without `--author`). Both exclude system notes, which are filtered out by default. The discussion indexes accelerate the JOIN path used by all note queries.
**Register in `src/core/db.rs`:**
Add to the `MIGRATIONS` array (after migration 021):
```rust
(
"022",
include_str!("../../migrations/022_notes_query_index.sql"),
),
```
**Note:** This bumps the migration number, so Work Chunk 2A's schema migration (which was originally numbered 022) becomes migration **023** instead.
---
## Phase 2: Per-Note Documents
### Work Chunk 2A: Schema Migration (023)
**Files:** `migrations/023_note_documents.sql`, `src/core/db.rs`
**Depends on:** Work Chunk 1E (must come after migration 022)
**Context:** Current migration is 021 (022 after Work Chunk 1E). The `documents` and `dirty_sources` tables have CHECK constraints limiting `source_type` to `('issue','merge_request','discussion')`. SQLite doesn't support `ALTER TABLE ... ALTER CONSTRAINT`, so we use the table-rebuild pattern.
#### Tests to Write First
```rust
// In src/core/db.rs tests or a new migration test:
#[test]
fn test_migration_023_allows_note_source_type() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Should NOT error — note is now a valid source_type
conn.execute(
"INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('note', 1, 1000)",
[],
).unwrap();
conn.execute(
"INSERT INTO documents (source_type, source_id, project_id, content_text, content_hash, is_truncated)
VALUES ('note', 1, 1, 'test', 'abc123', 0)",
[],
).unwrap();
}
#[test]
fn test_migration_023_preserves_existing_data() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Insert with old source types still works
conn.execute(
"INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('issue', 1, 1000)",
[],
).unwrap();
conn.execute(
"INSERT INTO dirty_sources (source_type, source_id, queued_at) VALUES ('discussion', 2, 1000)",
[],
).unwrap();
let count: i64 = conn.query_row("SELECT COUNT(*) FROM dirty_sources", [], |r| r.get(0)).unwrap();
assert_eq!(count, 2);
}
#[test]
fn test_migration_023_fts_triggers_intact() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Insert a note document
conn.execute(
"INSERT INTO documents (source_type, source_id, project_id, title, content_text, content_hash, is_truncated)
VALUES ('note', 1, 1, 'Test Note', 'This is the note body', 'hash123', 0)",
[],
).unwrap();
// FTS should auto-sync via trigger
let count: i64 = conn.query_row(
"SELECT COUNT(*) FROM documents_fts WHERE documents_fts MATCH 'note'",
[],
|r| r.get(0),
).unwrap();
assert_eq!(count, 1);
}
#[test]
fn test_migration_023_row_counts_preserved() {
// This test verifies the migration doesn't lose data during table rebuild.
// It runs all migrations up to version 22, inserts test data into documents/dirty_sources/
// document_labels/document_paths BEFORE migration 023, then verifies
// counts are identical after migration 023 runs.
// (Implementation: create_connection_at_version(22) + insert data + run_migration(23) + assert counts)
// Note: This may require a test helper that runs migrations up to a specific version.
}
```
#### Implementation
**Migration SQL** (`migrations/023_note_documents.sql`):
The tables with CHECK constraints that need rebuilding:
1. `dirty_sources` — add `'note'` to source_type CHECK
2. `documents` — add `'note'` to source_type CHECK
Pattern: create new table, copy data, drop old, rename. Must also recreate FTS triggers (they reference the table by name) and all indexes.
**CRITICAL:** The `documents_fts` external content table references `documents` by rowid. Rebuilding `documents` changes rowids unless we preserve them. Use `INSERT INTO documents_new SELECT * FROM documents` to preserve the `id` (PRIMARY KEY = rowid).
**CRITICAL:** The FTS triggers (`documents_ai`, `documents_ad`, `documents_au`) must be dropped and recreated after the table rebuild because they reference `documents` which was dropped/renamed.
**Migration safety requirements:**
- The migration executes as a single transaction (SQLite migration runner wraps each migration in a transaction).
- After the table rebuild, verify row counts match: `SELECT COUNT(*) FROM documents` must equal the pre-rebuild count. The migration SQL captures counts into temp variables and asserts equality.
- Run `PRAGMA foreign_key_check` after the rebuild and abort on any violation.
- Rebuild FTS index and verify `documents_fts` row count matches `documents` row count.
The migration must:
1. Drop FTS triggers
2. Create `documents_new` with updated CHECK (adding `'note'`)
3. `INSERT INTO documents_new SELECT * FROM documents`
4. Drop `documents` (cascades `document_labels`, `document_paths` due to ON DELETE CASCADE — so save those first!)
5. Actually: disable foreign keys, copy document_labels and document_paths data, drop old tables, rename new, recreate junction tables, restore data, recreate FTS triggers, recreate all indexes
6. Same pattern for `dirty_sources` (simpler — no dependents)
```sql
-- Capture pre-migration counts for integrity verification
CREATE TEMP TABLE _pre_counts AS
SELECT
(SELECT COUNT(*) FROM documents) AS doc_count,
(SELECT COUNT(*) FROM document_labels) AS label_count,
(SELECT COUNT(*) FROM document_paths) AS path_count,
(SELECT COUNT(*) FROM dirty_sources) AS dirty_count;
-- Rebuild dirty_sources with expanded CHECK
CREATE TABLE dirty_sources_new (
source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion','note')),
source_id INTEGER NOT NULL,
queued_at INTEGER NOT NULL,
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER,
last_error TEXT,
next_attempt_at INTEGER,
PRIMARY KEY(source_type, source_id)
);
INSERT INTO dirty_sources_new SELECT * FROM dirty_sources;
DROP TABLE dirty_sources;
ALTER TABLE dirty_sources_new RENAME TO dirty_sources;
CREATE INDEX idx_dirty_sources_next_attempt ON dirty_sources(next_attempt_at);
-- Rebuild documents (must preserve FTS consistency)
-- Step 1: Save junction table data
CREATE TABLE _doc_labels_backup AS SELECT * FROM document_labels;
CREATE TABLE _doc_paths_backup AS SELECT * FROM document_paths;
-- Step 2: Drop FTS triggers (they reference 'documents')
DROP TRIGGER IF EXISTS documents_ai;
DROP TRIGGER IF EXISTS documents_ad;
DROP TRIGGER IF EXISTS documents_au;
-- Step 3: Drop junction tables (they FK to documents)
DROP TABLE document_labels;
DROP TABLE document_paths;
-- Step 4: Rebuild documents with updated CHECK
CREATE TABLE documents_new (
id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion','note')),
source_id INTEGER NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT,
label_names TEXT,
created_at INTEGER,
updated_at INTEGER,
url TEXT,
title TEXT,
content_text TEXT NOT NULL,
content_hash TEXT NOT NULL,
labels_hash TEXT NOT NULL DEFAULT '',
paths_hash TEXT NOT NULL DEFAULT '',
is_truncated INTEGER NOT NULL DEFAULT 0,
truncated_reason TEXT CHECK (
truncated_reason IN (
'token_limit_middle_drop','single_note_oversized','first_last_oversized',
'hard_cap_oversized'
)
OR truncated_reason IS NULL
),
UNIQUE(source_type, source_id)
);
INSERT INTO documents_new SELECT * FROM documents;
DROP TABLE documents;
ALTER TABLE documents_new RENAME TO documents;
-- Step 5: Recreate indexes
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
CREATE INDEX idx_documents_hash ON documents(content_hash);
-- Step 6: Recreate junction tables
CREATE TABLE document_labels (
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
label_name TEXT NOT NULL,
PRIMARY KEY(document_id, label_name)
) WITHOUT ROWID;
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
CREATE TABLE document_paths (
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
path TEXT NOT NULL,
PRIMARY KEY(document_id, path)
) WITHOUT ROWID;
CREATE INDEX idx_document_paths_path ON document_paths(path);
-- Step 7: Restore junction table data
INSERT INTO document_labels SELECT * FROM _doc_labels_backup;
INSERT INTO document_paths SELECT * FROM _doc_paths_backup;
DROP TABLE _doc_labels_backup;
DROP TABLE _doc_paths_backup;
-- Step 8: Recreate FTS triggers
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, COALESCE(new.title, ''), new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, COALESCE(old.title, ''), old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents
WHEN old.title IS NOT new.title OR old.content_text != new.content_text
BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, COALESCE(old.title, ''), old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, COALESCE(new.title, ''), new.content_text);
END;
-- Step 9: Rebuild FTS index to be safe
INSERT INTO documents_fts(documents_fts) VALUES('rebuild');
-- Step 10: Defense-in-depth cleanup triggers for note documents.
-- These fire when a note is deleted or flipped to system, ensuring orphaned
-- documents/dirty_sources entries cannot survive even if a future code path
-- deletes notes outside the normal sweep functions (Work Chunk 0B).
-- The sweep functions handle the common path; these triggers are the safety net.
CREATE TRIGGER notes_ad_cleanup AFTER DELETE ON notes
WHEN old.is_system = 0
BEGIN
DELETE FROM documents
WHERE source_type = 'note' AND source_id = old.id;
DELETE FROM dirty_sources
WHERE source_type = 'note' AND source_id = old.id;
END;
-- If a note is reclassified from user to system (unlikely but possible via
-- API changes), remove its document artifacts since system notes don't get documents.
CREATE TRIGGER notes_au_system_cleanup AFTER UPDATE OF is_system ON notes
WHEN old.is_system = 0 AND new.is_system = 1
BEGIN
DELETE FROM documents
WHERE source_type = 'note' AND source_id = new.id;
DELETE FROM dirty_sources
WHERE source_type = 'note' AND source_id = new.id;
END;
-- Step 11: Integrity verification (moved to migration tests)
-- Note: RAISE(ABORT, ...) in standalone SELECT is not valid SQLite usage outside
-- triggers/CHECK constraints. Integrity checks are enforced in the migration test
-- suite instead (see test_migration_023_integrity_checks_pass). This keeps migration
-- SQL portable and avoids relying on SQLite-version-specific behavior.
DROP TABLE _pre_counts;
```
**Register in `src/core/db.rs`:**
Add to the `MIGRATIONS` array (after migration 022):
```rust
(
"023",
include_str!("../../migrations/023_note_documents.sql"),
),
```
`LATEST_SCHEMA_VERSION` auto-derives from `MIGRATIONS.len()` — no manual change needed.
**Migration integrity tests** (add to migration test module):
```rust
#[test]
fn test_migration_023_integrity_checks_pass() {
let conn = create_connection(Path::new(":memory:")).unwrap();
// Run all migrations up to 022, then insert test data
// Run migration 023
// 1. Verify pre/post row count equality for documents, document_labels,
// document_paths, and dirty_sources
// 2. Verify PRAGMA foreign_key_check returns empty result set
// 3. Verify documents_fts row count matches documents row count after rebuild
}
#[test]
fn test_migration_023_fts_rebuild_consistent() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Insert several documents with different source types
// Verify: SELECT COUNT(*) FROM documents_fts == SELECT COUNT(*) FROM documents
}
#[test]
fn test_migration_023_note_delete_trigger_cleans_document() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Setup: project, issue, discussion, non-system note
// Insert note document (source_type='note', source_id=note.id)
// Delete the note row directly (simulating a non-sweep deletion path)
// Assert: document row for that note is gone (trigger fired)
// Assert: dirty_sources entry (if any) is gone
}
#[test]
fn test_migration_023_note_system_flip_trigger_cleans_document() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Setup: project, issue, discussion, non-system note with a document
// UPDATE notes SET is_system = 1 WHERE id = note_id
// Assert: document row for that note is gone (trigger fired)
}
#[test]
fn test_migration_023_system_note_delete_trigger_does_not_fire() {
let conn = create_connection(Path::new(":memory:")).unwrap();
run_migrations(&conn).unwrap();
// Setup: system note (is_system = 1) — no document exists
// Delete the system note row
// Assert: no error (trigger WHEN clause skips system notes)
}
```
---
### Work Chunk 2B: SourceType Enum Extension
**Files:** `src/documents/extractor.rs`
**Depends on:** Work Chunk 2A (migration must exist so test DBs have the right schema)
#### Tests to Write First
Add to `src/documents/extractor.rs` in the existing test module:
```rust
#[test]
fn test_source_type_parse_note() {
assert_eq!(SourceType::parse("note"), Some(SourceType::Note));
assert_eq!(SourceType::parse("notes"), Some(SourceType::Note));
assert_eq!(SourceType::parse("NOTE"), Some(SourceType::Note));
}
#[test]
fn test_source_type_note_as_str() {
assert_eq!(SourceType::Note.as_str(), "note");
}
#[test]
fn test_source_type_note_display() {
assert_eq!(format!("{}", SourceType::Note), "note");
}
#[test]
fn test_source_type_note_serde_roundtrip() {
let st = SourceType::Note;
let json = serde_json::to_string(&st).unwrap();
assert_eq!(json, "\"note\"");
let parsed: SourceType = serde_json::from_str(&json).unwrap();
assert_eq!(parsed, SourceType::Note);
}
```
#### Implementation
In `src/documents/extractor.rs`:
1. Add `Note` variant to `SourceType` enum (line 18):
```rust
Note,
```
2. Add match arm to `as_str()` (line 27):
```rust
Self::Note => "note",
```
3. Add parse aliases (line 35):
```rust
"note" | "notes" => Some(Self::Note),
```
---
### Work Chunk 2C: Note Document Extractor
**Files:** `src/documents/extractor.rs`
**Depends on:** Work Chunk 2B
**Context:** Follows the exact pattern of `extract_issue_document()` (lines 85-184) and `extract_discussion_document()` (lines 302-516). The new function extracts a single non-system note into a `DocumentData` struct.
#### Tests to Write First
Add to `src/documents/extractor.rs` test module. Uses `setup_discussion_test_db()` (line 1025) and `insert_note()` (line 1086) helpers that already exist.
```rust
#[test]
fn test_note_document_basic_format() {
let conn = setup_discussion_test_db();
insert_issue(&conn, 1, 42, Some("Auth redesign"), Some("desc"), "opened", Some("alice"),
Some("https://gitlab.example.com/group/project-one/-/issues/42"));
insert_discussion(&conn, 1, "Issue", Some(1), None);
insert_note(&conn, 1, 12345, 1, Some("jdefting"), Some("This function is too complex, consider extracting the validation logic."), 1710460800000, false, None, None);
let doc = extract_note_document(&conn, 1).unwrap().unwrap();
assert_eq!(doc.source_type, SourceType::Note);
assert_eq!(doc.source_id, 1);
assert_eq!(doc.project_id, 1);
assert_eq!(doc.author_username, Some("jdefting".to_string()));
assert!(doc.content_text.contains("[[Note]]"));
assert!(doc.content_text.contains("author: @jdefting"));
assert!(doc.content_text.contains("This function is too complex"));
assert!(doc.content_text.contains("Issue #42: Auth redesign"));
assert!(doc.content_text.contains("group/project-one"));
assert_eq!(doc.title, Some("Note by @jdefting on Issue #42".to_string()));
assert!(!doc.is_truncated);
}
#[test]
fn test_note_document_diffnote_with_path() {
let conn = setup_discussion_test_db();
insert_mr(&conn, 1, 99, Some("JWT Auth"), Some("desc"), Some("opened"), Some("alice"),
Some("feat/jwt"), Some("main"), Some("https://gitlab.example.com/group/project-one/-/merge_requests/99"));
insert_discussion(&conn, 1, "MergeRequest", None, Some(1));
insert_note(&conn, 1, 54321, 1, Some("jdefting"), Some("This should use a match statement"),
1710460800000, false, Some("src/old_auth.rs"), Some("src/auth.rs"));
let doc = extract_note_document(&conn, 1).unwrap().unwrap();
assert_eq!(doc.paths, vec!["src/auth.rs", "src/old_auth.rs"]);
assert!(doc.content_text.contains("path: src/auth.rs"));
assert!(doc.content_text.contains("MR !99: JWT Auth"));
assert_eq!(doc.title, Some("Note by @jdefting on MR !99".to_string()));
assert!(doc.url.unwrap().contains("#note_54321"));
}
#[test]
fn test_note_document_inherits_parent_labels() {
let conn = setup_discussion_test_db();
insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
insert_label(&conn, 1, "backend");
insert_label(&conn, 2, "security");
link_issue_label(&conn, 1, 1);
link_issue_label(&conn, 1, 2);
insert_discussion(&conn, 1, "Issue", Some(1), None);
insert_note(&conn, 1, 100, 1, Some("jdefting"), Some("Comment"), 1000, false, None, None);
let doc = extract_note_document(&conn, 1).unwrap().unwrap();
assert_eq!(doc.labels, vec!["backend", "security"]);
}
#[test]
fn test_note_document_mr_labels() {
let conn = setup_discussion_test_db();
insert_mr(&conn, 1, 10, Some("Test"), None, Some("opened"), None, None, None, None);
insert_label(&conn, 1, "review");
link_mr_label(&conn, 1, 1);
insert_discussion(&conn, 1, "MergeRequest", None, Some(1));
insert_note(&conn, 1, 100, 1, Some("reviewer"), Some("LGTM"), 1000, false, None, None);
let doc = extract_note_document(&conn, 1).unwrap().unwrap();
assert_eq!(doc.labels, vec!["review"]);
}
#[test]
fn test_note_document_system_note_returns_none() {
let conn = setup_discussion_test_db();
insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
insert_discussion(&conn, 1, "Issue", Some(1), None);
insert_note(&conn, 1, 100, 1, Some("bot"), Some("assigned to @alice"), 1000, true, None, None);
let result = extract_note_document(&conn, 1).unwrap();
assert!(result.is_none());
}
#[test]
fn test_note_document_not_found() {
let conn = setup_discussion_test_db();
let result = extract_note_document(&conn, 999).unwrap();
assert!(result.is_none());
}
#[test]
fn test_note_document_orphaned_discussion() {
// Discussion exists but parent issue was deleted
let conn = setup_discussion_test_db();
insert_issue(&conn, 99, 10, Some("Deleted"), None, "opened", None, None);
insert_discussion(&conn, 1, "Issue", Some(99), None);
insert_note(&conn, 1, 100, 1, Some("alice"), Some("Hello"), 1000, false, None, None);
conn.execute("PRAGMA foreign_keys = OFF", []).unwrap();
conn.execute("DELETE FROM issues WHERE id = 99", []).unwrap();
conn.execute("PRAGMA foreign_keys = ON", []).unwrap();
let result = extract_note_document(&conn, 1).unwrap();
assert!(result.is_none());
}
#[test]
fn test_note_document_hash_deterministic() {
let conn = setup_discussion_test_db();
insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
insert_discussion(&conn, 1, "Issue", Some(1), None);
insert_note(&conn, 1, 100, 1, Some("alice"), Some("Comment"), 1000, false, None, None);
let doc1 = extract_note_document(&conn, 1).unwrap().unwrap();
let doc2 = extract_note_document(&conn, 1).unwrap().unwrap();
assert_eq!(doc1.content_hash, doc2.content_hash);
assert_eq!(doc1.labels_hash, doc2.labels_hash);
assert_eq!(doc1.paths_hash, doc2.paths_hash);
}
#[test]
fn test_note_document_empty_body() {
let conn = setup_discussion_test_db();
insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
insert_discussion(&conn, 1, "Issue", Some(1), None);
insert_note(&conn, 1, 100, 1, Some("alice"), Some(""), 1000, false, None, None);
let doc = extract_note_document(&conn, 1).unwrap().unwrap();
assert!(doc.content_text.contains("--- Body ---"));
}
#[test]
fn test_note_document_null_body() {
let conn = setup_discussion_test_db();
insert_issue(&conn, 1, 10, Some("Test"), Some("desc"), "opened", None, None);
insert_discussion(&conn, 1, "Issue", Some(1), None);
insert_note(&conn, 1, 100, 1, Some("alice"), None, 1000, false, None, None);
// Should still produce a document (body is optional in schema)
let doc = extract_note_document(&conn, 1).unwrap().unwrap();
assert!(doc.content_text.contains("[[Note]]"));
}
```
#### Implementation
Add `extract_note_document()` to `src/documents/extractor.rs` (after `extract_discussion_document`, around line 516):
```rust
pub fn extract_note_document(
conn: &Connection,
note_id: i64,
) -> Result<Option<DocumentData>> {
// 1. Fetch the note
let note_row = conn.query_row(
"SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type,
n.is_system, n.created_at, n.updated_at,
n.position_old_path, n.position_new_path,
n.position_new_line,
n.resolvable, n.resolved,
d.noteable_type, d.issue_id, d.merge_request_id,
p.path_with_namespace, p.id AS project_id
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
JOIN projects p ON n.project_id = p.id
WHERE n.id = ?1",
rusqlite::params![note_id],
|row| { /* map all fields */ },
);
// Handle QueryReturnedNoRows -> Ok(None)
// 2. Skip system notes
if is_system { return Ok(None); }
// 3. Fetch parent entity (Issue or MR) — same pattern as extract_discussion_document lines 332-401
// Get parent_iid, parent_title, parent_web_url, labels
// 4. Build paths BTreeSet from position_old_path, position_new_path
// 5. Build URL: parent_web_url + "#note_{gitlab_id}"
// 6. Format content with structured metadata header:
// [[Note]]
// source_type: note
// note_gitlab_id: {gitlab_id}
// project: {path_with_namespace}
// parent_type: {Issue|MergeRequest}
// parent_iid: {iid}
// parent_title: {title}
// note_type: {DiffNote|DiscussionNote|Comment}
// author: @{author}
// created_at: {iso8601}
// resolved: {true|false} (only if resolvable)
// path: {position_new_path}:{position_new_line} (only if DiffNote)
// labels: {comma-separated}
// url: {url}
//
// --- Body ---
//
// {body}
// 7. Title: "Note by @{author} on {parent_type_prefix}"
// 8. Compute hashes, apply truncate_hard_cap, return DocumentData
}
```
The content format uses a structured key-value header optimized for machine parsing and semantic search, followed by the raw note body. This is deliberately different from discussion documents — it's optimized for individual note semantics rather than thread context.
**Structured header rationale:** The key-value format allows the embedding model and FTS to index structured fields (author, project, parent reference) alongside the free-text body, improving search precision for queries like "jdefting's comments on authentication issues."
---
### Work Chunk 2D: Regenerator & Dirty Tracking Integration
**Files:** `src/documents/regenerator.rs`, `src/ingestion/discussions.rs`, `src/ingestion/mr_discussions.rs`
**Depends on:** Work Chunks 0A, 2B, 2C
#### Tests to Write First
**In `src/documents/regenerator.rs` tests:**
```rust
#[test]
fn test_regenerate_note_document() {
let conn = setup_db();
// Add discussions + notes tables to setup_db() (or use a richer setup)
// Insert: project, issue, discussion, non-system note
// mark_dirty(SourceType::Note, note_id)
// regenerate_dirty_documents()
// Assert: document created with source_type = 'note'
// Assert: document content contains note body
}
#[test]
fn test_regenerate_note_system_note_deletes() {
// Insert system note, mark dirty
// regenerate_dirty_documents()
// Assert: no document created (extract returns None -> delete path)
}
#[test]
fn test_regenerate_note_unchanged() {
// Create note, regenerate, mark dirty again, regenerate
// Assert: second run returns unchanged = 1
}
#[test]
fn test_note_ingestion_idempotent_across_two_syncs() {
// Setup: project, issue, discussion, 3 non-system notes
// Run ingestion once -> verify 3 dirty notes queued
// Regenerate documents -> verify 3 note documents created
// Run ingestion again with identical data
// Assert: no new dirty entries (changed_semantics = false for all)
}
```
**In `src/ingestion/dirty_tracker.rs` tests:**
```rust
#[test]
fn test_mark_dirty_note_type() {
// Update the test DB setup to include 'note' in CHECK constraint
let conn = setup_db(); // This needs the new CHECK
mark_dirty(&conn, SourceType::Note, 1).unwrap();
let results = get_dirty_sources(&conn).unwrap();
assert_eq!(results.len(), 1);
assert_eq!(results[0].0, SourceType::Note);
}
```
#### Implementation
**1. Update `regenerate_one()` in `src/documents/regenerator.rs`** (line 90):
```rust
SourceType::Note => extract_note_document(conn, source_id)?,
```
And add the import at line 8:
```rust
use crate::documents::{
DocumentData, SourceType, extract_discussion_document, extract_issue_document,
extract_mr_document, extract_note_document,
};
```
**2. Add change-aware dirty marking in `src/ingestion/discussions.rs`** (in the new upsert loop from Phase 0):
```rust
for note in &normalized_notes {
let outcome = upsert_note_for_issue(&tx, local_discussion_id, &note, last_seen_at)?;
if !note.is_system && outcome.changed_semantics {
dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
}
}
sweep_stale_issue_notes(&tx, local_discussion_id, last_seen_at)?;
```
**3. Same change-aware dirty marking in `src/ingestion/mr_discussions.rs`** (update the existing upsert loop):
```rust
let outcome = upsert_note(&tx, local_discussion_id, &note, last_seen_at, None)?;
if !note.is_system && outcome.changed_semantics {
dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?;
}
```
**4. Update `dirty_tracker.rs` test `setup_db()`** to include `'note'` in the CHECK constraint (line 134).
**5. Update `regenerator.rs` test `setup_db()`** to include the discussions + notes tables so note-type regeneration tests can run.
---
### Work Chunk 2E: Generate-Docs Full Rebuild Support
**Files:** Search for where `generate-docs --full` seeds the dirty queue
**Depends on:** Work Chunk 2D
**Context:** When `lore generate-docs --full` runs, it seeds ALL issues, MRs, and discussions into the dirty queue. Notes must be seeded too.
#### Tests to Write First
```rust
#[test]
fn test_full_seed_includes_notes() {
// Setup DB with project, issue, discussion, 3 non-system notes, 1 system note
// Call seed_all_dirty(conn) or whatever the full-rebuild seeder is named
// Assert: dirty_sources contains 3 entries with source_type = 'note'
// Assert: system note is NOT in dirty_sources
}
#[test]
fn test_note_document_count_stable_after_second_generate_docs_full() {
// Setup DB with project, issue, discussion, 5 non-system notes
// Run generate-docs --full equivalent (seed + regenerate)
// Record document count
// Run generate-docs --full again
// Assert: document count unchanged (idempotent)
// Assert: dirty queue is empty after second run
}
```
#### Implementation
Find the function that seeds dirty_sources for `--full` mode (likely in the generate-docs handler or a dedicated seeder function). Add:
```sql
INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, ?1
FROM notes n
WHERE n.is_system = 0
ON CONFLICT(source_type, source_id) DO UPDATE SET
queued_at = excluded.queued_at,
attempt_count = 0,
last_attempt_at = NULL,
last_error = NULL,
next_attempt_at = NULL
```
---
### Work Chunk 2F: Search CLI `--type note` Support
**Files:** `src/cli/mod.rs`, `src/cli/commands/search.rs` (display code)
**Depends on:** Work Chunks 2A-2D (documents must exist to be searched)
#### Tests to Write First
Integration/smoke test:
```rust
#[test]
fn test_search_source_type_note_filter() {
// This is essentially testing that SourceType::Note flows through
// the existing search pipeline correctly. Since the search filter
// code is generic (filters.rs:70-73), the main test is that
// SourceType::parse("note") works — already covered in 2B.
// Add a smoke test that the CLI accepts --type note.
}
```
#### Implementation
1. Update `SearchArgs.source_type` value_parser in `src/cli/mod.rs` (line 542):
```rust
#[arg(long = "type", value_name = "TYPE",
value_parser = ["issue", "mr", "discussion", "note", "notes"],
help_heading = "Filters")]
pub source_type: Option<String>,
```
2. Update the search results display to show `"Note"` prefix for note-type results (check `print_search_results` in `src/cli/commands/search.rs`).
---
### Work Chunk 2G: Parent Metadata Change Propagation
**Files:** `src/ingestion/orchestrator.rs` (or wherever parent entity updates trigger dirty marking), `src/documents/regenerator.rs`
**Depends on:** Work Chunk 2D
**Context:** Note documents inherit metadata from their parent issue/MR — specifically labels and title. When a parent's title or labels change, the note documents derived from that parent become stale. The existing ingestion pipeline already marks discussion documents dirty when parent metadata changes. Note documents need the same treatment.
#### Problem
If issue #42's title changes from "Auth redesign" to "Auth overhaul", all note documents under that issue still say "Issue #42: Auth redesign" until their content is regenerated. Similarly, label changes on the parent propagate into the note document's `labels` field and `label_names` text.
#### Tests to Write First
```rust
#[test]
fn test_parent_title_change_marks_notes_dirty() {
// Setup: project, issue, discussion, 2 non-system notes
// Generate note documents (verify they exist)
// Change the issue title
// Trigger the parent-change propagation
// Assert: both note documents are in dirty_sources
}
#[test]
fn test_parent_label_change_marks_notes_dirty() {
// Setup: project, issue with label "backend", discussion, note
// Generate note document (verify labels = ["backend"])
// Add label "security" to the issue
// Trigger the parent-change propagation
// Assert: note document is in dirty_sources
// Regenerate and verify labels = ["backend", "security"]
}
```
#### Implementation
Find where the ingestion pipeline detects parent entity changes and marks discussion documents dirty. Add the same logic for note documents:
```sql
-- When an issue's title or labels change, mark all its non-system notes dirty
INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, ?1
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE d.issue_id = ?2 AND n.is_system = 0
ON CONFLICT(source_type, source_id) DO UPDATE SET
queued_at = excluded.queued_at,
attempt_count = 0
```
Same pattern for MR parent changes. The exact integration point depends on how the existing discussion dirty-marking works — it should be adjacent to that code.
**Note on deletion handling:** Note deletion is handled by two complementary mechanisms:
1. **Immediate propagation (Work Chunk 0B):** When sweep deletes stale notes, documents and dirty_sources entries are cleaned up in the same transaction. No stale search results.
2. **Eventual consistency (generate-docs --full):** For edge cases where a note was deleted outside the normal sweep path, the full rebuild catches orphaned documents since the note row no longer exists and `extract_note_document()` returns `None` -> document deleted.
No additional deletion logic is needed beyond Work Chunk 0B + the existing regenerator orphan cleanup.
---
### Work Chunk 2H: Backfill Existing Notes After Upgrade (Migration 024)
**Files:** `migrations/024_note_dirty_backfill.sql`, `src/core/db.rs`
**Depends on:** Work Chunk 2A (migration 023 must exist so `dirty_sources` accepts `source_type='note'`)
**Context:** When a user upgrades to a version with note document support, existing notes in the database have no corresponding documents. Without a backfill, only notes that change after the upgrade would get documents — historical notes remain invisible to search. This migration seeds all existing non-system notes into the dirty queue so the next `generate-docs` run creates documents for them.
#### Tests to Write First
```rust
#[test]
fn test_migration_024_backfills_existing_notes() {
let conn = create_connection(Path::new(":memory:")).unwrap();
// Run migrations up through 023
// Insert: project, issue, discussion, 5 non-system notes, 2 system notes
// Run migration 024
// Assert: dirty_sources contains 5 entries with source_type = 'note'
// Assert: system notes are NOT in dirty_sources
}
#[test]
fn test_migration_024_idempotent_with_existing_documents() {
let conn = create_connection(Path::new(":memory:")).unwrap();
// Run all migrations including 024
// Insert: project, issue, discussion, 3 non-system notes
// Create note documents for 2 of 3 notes (simulate partial state)
// Re-run the backfill SQL manually
// Assert: only the 1 note without a document is in dirty_sources
// Assert: ON CONFLICT DO NOTHING prevents duplicates
}
#[test]
fn test_migration_024_skips_notes_already_in_dirty_queue() {
let conn = create_connection(Path::new(":memory:")).unwrap();
// Run all migrations
// Insert note and manually add to dirty_sources
// Re-run backfill SQL
// Assert: no duplicate entries (ON CONFLICT DO NOTHING)
}
```
#### Implementation
**Migration SQL** (`migrations/024_note_dirty_backfill.sql`):
```sql
-- Backfill: seed all existing non-system notes into the dirty queue
-- so the next generate-docs run creates documents for them.
-- Uses LEFT JOIN to skip notes that already have documents (idempotent).
-- ON CONFLICT DO NOTHING handles notes already in the dirty queue.
INSERT INTO dirty_sources (source_type, source_id, queued_at)
SELECT 'note', n.id, CAST(strftime('%s', 'now') AS INTEGER) * 1000
FROM notes n
LEFT JOIN documents d
ON d.source_type = 'note' AND d.source_id = n.id
WHERE n.is_system = 0 AND d.id IS NULL
ON CONFLICT(source_type, source_id) DO NOTHING;
```
**Register in `src/core/db.rs`:**
Add to the `MIGRATIONS` array (after migration 023):
```rust
(
"024",
include_str!("../../migrations/024_note_dirty_backfill.sql"),
),
```
**Note:** This is a data-only migration — no schema changes. It's safe to run on empty databases (no notes = no-op). On databases with existing notes, it queues them for document generation on the next `lore generate-docs` or `lore sync` run.
---
### Work Chunk 2I: Batch Parent Metadata Cache for Note Regeneration
**Files:** `src/documents/regenerator.rs`, `src/documents/extractor.rs`
**Depends on:** Work Chunk 2C (extractor function must exist)
**Context:** The `extract_note_document()` function fetches parent entity metadata (issue/MR title, labels, project path) via individual SQL queries per note. During the initial backfill of ~8,000 existing notes, this creates N+1 query amplification: each note triggers its own parent metadata lookup, even though many notes share the same parent entity. For example, 50 notes on the same MR would execute 50 identical parent metadata queries.
This is a performance optimization for batch regeneration, not a correctness change. Individual note regeneration (dirty tracking during incremental sync) is unaffected — the N+1 cost is negligible for the typical 1-10 dirty notes per sync.
#### Tests to Write First
```rust
#[test]
fn test_note_regeneration_batch_uses_cache() {
// Setup: project, issue with 10 non-system notes
// Mark all 10 as dirty
// Run regenerate_dirty_documents()
// Assert: all 10 documents created correctly
// Assert: parent metadata query count == 1 (not 10)
// (Use a query counter or verify via cache hit metrics)
}
#[test]
fn test_note_regeneration_cache_consistent_with_direct_extraction() {
// Setup: project, issue with labels, discussion, 3 notes
// Extract note document directly (no cache)
// Extract via cached batch path
// Assert: content_hash is identical for both paths
// Assert: labels_hash is identical for both paths
}
#[test]
fn test_note_regeneration_cache_invalidates_across_parents() {
// Setup: 2 issues, each with notes
// Regenerate notes from both issues in one batch
// Assert: each issue's notes get correct parent metadata
// (cache keyed by (noteable_type, parent_id), not globally shared)
}
```
#### Implementation
**1. Add `ParentMetadataCache` struct** in `src/documents/extractor.rs`:
```rust
use std::collections::HashMap;
/// Cache for parent entity metadata during batch note document extraction.
/// Keyed by (noteable_type, parent_local_id) to avoid repeated lookups
/// when multiple notes share the same parent issue/MR.
pub struct ParentMetadataCache {
cache: HashMap<(String, i64), ParentMetadata>,
}
pub struct ParentMetadata {
pub iid: i64,
pub title: Option<String>,
pub web_url: Option<String>,
pub labels: Vec<String>,
pub project_path: String,
}
impl ParentMetadataCache {
pub fn new() -> Self { Self { cache: HashMap::new() } }
pub fn get_or_fetch(
&mut self,
conn: &Connection,
noteable_type: &str,
parent_id: i64,
) -> Result<&ParentMetadata> {
// HashMap entry API: fetch from DB on miss, return cached on hit
}
}
```
**2. Add `extract_note_document_cached()` variant** that accepts `&mut ParentMetadataCache` and uses it instead of inline parent metadata queries. The uncached `extract_note_document()` remains for single-note regeneration.
**3. Update batch regeneration loop** in `src/documents/regenerator.rs`:
```rust
// In the regeneration loop, when processing a batch of dirty sources:
let mut parent_cache = ParentMetadataCache::new();
for (source_type, source_id) in dirty_batch {
match source_type {
SourceType::Note => extract_note_document_cached(conn, source_id, &mut parent_cache)?,
// Other source types use existing extraction functions (no cache needed)
_ => regenerate_one(conn, source_type, source_id)?,
};
}
```
**Scope limit:** The cache is created fresh per regeneration batch and discarded after. No cross-batch persistence, no invalidation complexity. The cache is purely an optimization for batch processing where many notes share parents.
---
## Verification Checklist
After all chunks are complete, run the full quality gate:
```bash
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check
```
Then functional smoke tests:
```bash
# Phase 0 verification
# Sync twice and verify note IDs are stable:
lore sync
lore -J notes --limit 5 # Record gitlab_ids and local ids
lore sync
lore -J notes --limit 5 # Verify same local ids for same gitlab_ids
# Phase 1 verification
lore -J notes --author jdefting --since 365d --limit 5
lore -J notes --note-type DiffNote --path src/ --limit 10
lore notes --for-mr 456 -p group/repo
lore notes --for-issue 42 -p group/repo # Verify project-scoping works
lore notes --since 180d --until 90d # Bounded window (180 days ago to 90 days ago)
lore notes --resolution unresolved # Tri-state resolution filter
lore notes --contains "unwrap" --note-type DiffNote # Body substring + type filter
lore notes --author jdefting --format jsonl | wc -l # JSONL streaming
lore notes --format csv > /tmp/notes.csv && head -1 /tmp/notes.csv # CSV header
lore -J notes --gitlab-note-id 12345 # Precision filter: exact GitLab note
lore -J notes --discussion-id 42 # Precision filter: all notes in thread
# Phase 2 verification
lore sync # Should generate note documents
lore -J stats # Should show note document count in source_type breakdown
lore -J search "code complexity" --type note --author jdefting
lore -J search "error handling" --type note --since 180d
```
Idempotence checks:
```bash
# Verify generate-docs --full is idempotent
lore generate-docs --full
lore -J stats > /tmp/stats1.json
lore generate-docs --full
lore -J stats > /tmp/stats2.json
diff /tmp/stats1.json /tmp/stats2.json # Should be identical (modulo timing metadata)
```
Deletion propagation checks:
```bash
# Verify that deleted notes don't leave stale documents
# (Manual test: delete a note on GitLab, sync, verify document is gone)
lore sync
lore -J search "specific phrase from deleted note" --type note
# Should return no results
```
Performance and query plan verification:
```bash
# Verify indexes are used for common query patterns
# Run EXPLAIN QUERY PLAN for the hot paths:
sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN
SELECT n.id, n.gitlab_id, n.author_username, n.body, n.note_type,
n.is_system, n.created_at, n.updated_at,
n.position_new_path, n.position_new_line,
n.position_old_path, n.position_old_line,
n.resolvable, n.resolved, n.resolved_by,
d.noteable_type,
COALESCE(i.iid, m.iid) AS parent_iid,
COALESCE(i.title, m.title) AS parent_title,
p.path_with_namespace
FROM notes n
JOIN discussions d ON n.discussion_id = d.id
JOIN projects p ON n.project_id = p.id
LEFT JOIN issues i ON d.issue_id = i.id
LEFT JOIN merge_requests m ON d.merge_request_id = m.id
WHERE n.is_system = 0 AND n.author_username = 'jdefting' COLLATE NOCASE AND n.created_at >= 1704067200000
ORDER BY n.created_at DESC, n.id DESC
LIMIT 50;"
# Should show SEARCH using idx_notes_user_created
sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN
SELECT n.id FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE n.is_system = 0 AND n.project_id = 1 AND n.created_at >= 1704067200000
ORDER BY n.created_at DESC, n.id DESC
LIMIT 50;"
# Should show SEARCH using idx_notes_project_created
sqlite3 ~/.local/share/lore/lore.db "EXPLAIN QUERY PLAN
SELECT n.id FROM notes n
JOIN discussions d ON n.discussion_id = d.id
WHERE n.is_system = 0 AND d.issue_id = (SELECT id FROM issues WHERE iid = 42 AND project_id = 1)
ORDER BY n.created_at DESC, n.id DESC
LIMIT 50;"
# Should show SEARCH using idx_discussions_issue_id for the join
```
Operational checks:
- `lore -J stats` output includes `documents.notes` count (the stats command queries by hardcoded source_type strings — verify `'note'` is added)
- Verify `lore -J count notes` still reports user vs system breakdown correctly after the changes
- After a full `lore generate-docs --full`, verify note document count approximately matches non-system note count from `lore count notes`
---
## Work Chunk Dependency Graph
```
0A (stable note identity) ──┬────────────────────────────────────┐
│ │
├── 0B (deletion propagation) ◄──────┤── 2A (migration 023, + cleanup triggers)
│ │
├── 0C (sweep safety guard) │
│ │
├── 0D (author_id capture) │
│ │
1A (data types + query) ──┐ │ │
├── 1B (CLI args + wiring) ──┐ │
├── 1C (output formatting) ├── 1D (robot-docs)
1E (query index, mig 022 │ │ │
+ author_id column) ──┘ │ │
│ │
2A (migration 023) ───────┐ │ │
├── 2B (SourceType enum) │ │
│ │ │ │
│ ├── 2C (extractor fn) │ │
│ │ │ │ │
│ │ ├── 2D (regenerator + dirty tracking) ◄─┘
│ │ │ │
│ │ │ ├── 2E (generate-docs --full)
│ │ │ │
│ │ │ ├── 2F (search --type note)
│ │ │ │
│ │ │ ├── 2G (parent change propagation)
│ │ │ │
│ │ │ ├── 2H (backfill migration 024)
│ │ │ │
│ │ ├── 2I (batch parent metadata cache)
```
**Parallelizable pairs:**
- 0A, 1A, 1E, and 2A can all run simultaneously (no code overlap)
- 0C and 0D can run immediately after 0A (both modify upsert functions from 0A)
- 1C and 2B can run simultaneously
- 2E, 2F, 2G, 2H, and 2I can run simultaneously after 2D (2I only needs 2C)
- 0B depends on both 0A and 2A (needs sweep functions from 0A and documents table accepting 'note' from 2A)
**Critical path:** 0A -> 0C -> 2D -> 2G (Phase 0 must land before dirty tracking integrates with upsert outcomes)
**Secondary critical path:** 2A -> 2B -> 2C -> 2D (document pipeline chain)
---
## Estimated Document Volume Impact
| Entity | Typical Count | Documents Before | Documents After |
|--------|--------------|-----------------|-----------------|
| Issues | 500 | 500 | 500 |
| MRs | 300 | 300 | 300 |
| Discussions | 2,000 | 2,000 | 2,000 |
| Notes (non-system) | ~8,000 | 0 | **+8,000** |
| **Total** | | **2,800** | **10,800** |
FTS5 handles this comfortably. Embedding generation time scales linearly (~4x increase). The three-hash dedup means incremental syncs remain fast. With Phase 0's change-aware dirty marking, only genuinely modified notes trigger regeneration — typical incremental syncs will dirty a small fraction of the 8k total.
---
## Rejected Recommendations
These recommendations were proposed during review and deliberately rejected. Documenting here to prevent re-proposal.
- **Feature flag gating / gated rollout** — rejected because this is a single-user CLI tool in early development with no external users. Adding runtime feature gates (`feature.notes_cli`, `feature.note_documents`) for a feature we're building from scratch adds complexity with no benefit. Both phases ship together; there's no "blast radius" to manage.
- **Keyset pagination / cursor support** — rejected because no existing list command (`lore issues`, `lore mrs`) has pagination. Adding it just for `notes` would be inconsistent. The year-long analysis use case works fine with `--limit 10000`. If pagination becomes needed across all list commands, that's a separate horizontal feature.
- **Path filtering upgrade (`--path-mode exact|prefix|glob`, `--match-old-path`)** — rejected because the trailing-slash prefix convention is already established across the codebase (issues/MRs use the same pattern). Adding glob mode and old-path matching adds multiple CLI flags for a niche use case. Can be added later if users request it.
- **Embedding policy knobs (`documents.note_embeddings.min_chars`, `documents.note_embeddings.enabled`, prioritize unresolved DiffNotes)** — rejected because the embedding pipeline already handles volume scaling. Adding per-source-type enable flags and minimum character thresholds is premature optimization. Short notes (e.g., "LGTM", "nit: use `expect()` here") are still semantically valuable for reviewer profiling. The existing embedding batch system handles the volume.
- **Structured reviewer profiling command (`lore notes profile --author <user>`)** — rejected because this is explicitly a non-goal in the PRD. The reviewer profile report is a downstream consumer of the infrastructure we're building. Adding it here is scope creep. It belongs in a separate PRD after this infrastructure lands.
- **Operational SLOs / queue lag metrics** — rejected because this is a local CLI tool, not a service. "Oldest dirty note age" and "retry backlog size" are service-oriented metrics that don't apply. The existing `lore stats` and `lore -J count` commands provide sufficient observability. If the dirty queue becomes problematic, we add diagnostics then.
- **Replace CHECK constraints with source_types registry table + FK** — rejected because the table rebuild for adding a new source type to a CHECK constraint is a rare, one-time cost (done 4 times across 23 migrations). A registry table adds per-insert FK lookup overhead, complicates the migration (still requires a table rebuild to change from CHECK to FK), and optimizes for a hypothetical future where we frequently add source types. The current CHECK approach is simpler, self-documenting, and sufficient.
- **Unresolved-specific partial index (`idx_notes_unresolved_project_created`)** — rejected because the selectivity is too narrow. Unresolved notes are a small, unpredictable subset. The `idx_notes_project_created` index already covers the project+date scan; adding `WHERE resolvable = 1 AND resolved = 0` provides marginal benefit at the cost of index maintenance overhead. SQLite can filter the small remaining set in-memory efficiently.
- **Previous note excerpt in document content (`previous_note_excerpt`)** — rejected because it adds a query per note extraction (fetch the preceding note in the same discussion), increases document size, creates coupling between note documents (changing one note's body would stale the next note's document), and the semantic benefit is marginal. The parent title and labels provide sufficient context. Full thread context is available via the existing discussion documents.
- **Compact/slim metadata header for note documents** — rejected because the verbose key-value header is intentional. The structured fields (`source_type`, `note_gitlab_id`, `project`, `parent_type`, `parent_iid`, etc.) are what enable precise FTS and embedding search for queries like "jdefting's comments on authentication issues in project-one." The compact format (`@author on Issue#42 in project`) loses machine-parseable structure and reduces search precision. Metadata stored in document columns/labels/paths is not searchable via FTS — only `content_text` is FTS-indexed. The token cost of the header (~50 tokens) is negligible compared to typical note body length.
- **Replace IID filter subqueries with JOIN predicates** — rejected because the subquery approach (`d.issue_id = (SELECT id FROM issues WHERE iid = ? AND project_id = ?)`) is clearer about intent and the performance difference is negligible. The subquery hits a UNIQUE index for a single-row lookup. The JOIN alternative (`i.iid = ? AND i.project_id = ?`) requires the query planner to choose the right join order, and the LEFT JOIN is already present for fetching parent metadata. Adding a WHERE clause on a LEFT JOINed table that may have NULL values for non-matching rows introduces subtle correctness risks. The subquery is self-contained and correct by construction.
- **Use `notes.gitlab_id` instead of `notes.id` as document `source_id`** (feedback-4, rec #1) — rejected because the entire existing document pipeline uses local row IDs as `source_id` for issues, MRs, and discussions. Switching to `gitlab_id` only for notes would create an inconsistent pattern where note documents use a different identity scheme than every other document type. This inconsistency would complicate the regenerator (which dispatches by `source_type` + `source_id`), the dirty tracker, and the full-rebuild seeder. Phase 0 specifically stabilizes local IDs via upsert, making them reliable for this purpose. If we ever want to move to `gitlab_id` globally, that's a cross-cutting migration affecting all source types — not a per-type decision.
- **`--aggregate` analytics mode for `lore notes`** (feedback-4, rec #5) — rejected because it's scope creep that edges into the explicitly excluded "reviewer profile" non-goal. The raw note output in JSONL/CSV format already supports downstream analysis via `jq`, `awk`, or LLM ingestion. Adding `--aggregate author|note_type|path|resolution` with `--top N` introduces a new query mode, output format, and interaction model. This belongs in a follow-up PRD focused on analytics primitives, not in the per-note search infrastructure PRD.
- **Source-type fairness / weighted scheduling in dirty queue processing** (feedback-4, rec #6) — rejected because the dirty queue is processed by a single-user CLI tool, not a multi-tenant service. The backfill of ~8k notes is a one-time event after upgrade. After the initial backfill, incremental syncs produce proportional dirty counts across source types. Adding weighted bucket scheduling (issue:3, MR:3, discussion:2, note:1) for a CLI that runs `generate-docs` on demand is premature optimization. If queue starvation becomes a real problem, we can add round-robin by source type then — but it hasn't happened with 2,800 documents and won't happen with 10,800.
- **Replace `fetch_complete: bool` with `FetchState` enum (`Complete`/`Partial`/`Failed`) and run_seen_at monotonicity checks** (feedback-5, rec #2) — rejected because the boolean captures the one bit of information that matters: did the fetch complete? `FetchState::Failed` is redundant with not reaching the sweep call site — if the fetch fails, we don't call sweep at all. The monotonicity check on `run_seen_at` adds complexity for a condition that can't occur in practice: `run_seen_at` is generated once per sync run and passed unchanged through all upserts. The boolean is sufficient and self-documenting.
- **Embedding dedup cache keyed by semantic text hash** (feedback-5, rec #5) — rejected because the existing `content_hash` dedup already prevents re-embedding unchanged documents. A semantic-text-only hash that ignores metadata would conflate genuinely different review contexts: two "LGTM" notes from different authors on different MRs are semantically distinct for the reviewer profiling use case (who said it, where, and when matters). The embedding pipeline handles ~8k notes comfortably without dedup optimization.
- **Derived review signal labels (`signal:nit`, `signal:blocking`, `signal:security`)** (feedback-5, rec #6) — rejected because (a) it encroaches on the explicitly excluded reviewer profiling scope, (b) heuristic signal derivation (regex for "nit:", keyword matching for "security") is inherently fragile and would require ongoing maintenance as review vocabulary evolves, and (c) the raw note text already supports downstream LLM-based analysis that produces far more accurate signal classification than static keyword matching. This belongs in the downstream profiling PRD where LLM-based classification can be done properly.