Below are the strongest revisions I’d make, excluding everything in your `## Rejected Recommendations` list. 1. **Add a Phase 0 for stable note identity before any note-doc generation** Rationale: your current plan still allows note document churn because Issue discussion ingestion is delete/reinsert-based. That makes local `notes.id` unstable, causing unnecessary dirtying/regeneration and potential stale-doc edge cases. Stabilizing identity first (upsert-by-GitLab-ID + sweep stale) improves correctness and cuts repeated work. ```diff @@ ## Design -Two phases, shipped together as one feature: +Three phases, shipped together as one feature: +- **Phase 0 (Foundation):** Stable note identity in local DB (upsert + sweep, no delete/reinsert churn) - **Phase 1 (Option A):** `lore notes` command — direct SQL query over the `notes` table with rich filtering - **Phase 2 (Option B):** Per-note documents — each non-system note becomes its own searchable document in the FTS/embedding pipeline @@ +## Phase 0: Stable Note Identity + +### Work Chunk 0A: Upsert/Sweep for Issue Discussion Notes +**Files:** `src/ingestion/discussions.rs`, `migrations/022_notes_identity_index.sql`, `src/core/db.rs` +**Implementation:** +- Add unique index: `UNIQUE(project_id, gitlab_id)` on `notes` +- Replace delete/reinsert issue-note flow with upsert + `last_seen_at` sweep (same durability model as MR note sweep) +- Ensure `insert_note/upsert_note` returns the stable local row id for both insert and update paths ``` 2. **Replace `source_type` CHECK constraints with a registry table + FK in migration** Rationale: table CHECKs force full table rebuild for every new source type forever. A `source_types` table with FK keeps DB-level integrity and future extensibility without rebuilding `documents`/`dirty_sources` every time. This is a major architecture hardening win. ```diff @@ ### Work Chunk 2A: Schema Migration (023) -Current migration ... CHECK constraints limiting `source_type` ... +Current migration ... CHECK constraints limiting `source_type` ... +Revision: migrate to `source_types` registry table + FK constraints. @@ -1. `dirty_sources` — add `'note'` to source_type CHECK -2. `documents` — add `'note'` to source_type CHECK +1. Create `source_types(name TEXT PRIMARY KEY)` and seed: `issue, merge_request, discussion, note` +2. Rebuild `dirty_sources` and `documents` to replace CHECK with `REFERENCES source_types(name)` +3. Future source-type additions become `INSERT INTO source_types(name) VALUES (?)` (no table rebuild) @@ +#### Additional integrity tests +#[test] +fn test_source_types_registry_contains_note() { ... } +#[test] +fn test_documents_source_type_fk_enforced() { ... } +#[test] +fn test_dirty_sources_source_type_fk_enforced() { ... } ``` 3. **Mark note documents dirty only when note semantics actually changed** Rationale: current loops mark every non-system note dirty every sync. With 8k+ notes this creates avoidable queue pressure and regeneration time. Change-aware dirtying (inserted/changed only) gives major performance and stability improvements. ```diff @@ ### Work Chunk 2D: Regenerator & Dirty Tracking Integration -for note in notes { - let local_note_id = insert_note(&tx, local_discussion_id, ¬e, None)?; - if !note.is_system { - dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, local_note_id)?; - } -} +for note in notes { + let outcome = upsert_note(&tx, local_discussion_id, ¬e, None)?; + if !note.is_system && outcome.changed_semantics { + dirty_tracker::mark_dirty_tx(&tx, SourceType::Note, outcome.local_note_id)?; + } +} @@ +// changed_semantics should include: body, note_type, path/line positions, resolvable/resolved/resolved_by, updated_at ``` 4. **Expand filters to support real analysis windows and resolution state** Rationale: reviewer profiling usually needs bounded windows and both resolved/unresolved views. Current `unresolved: bool` is too narrow and one-sided. Add `--until` and tri-state resolution filtering for better analytical power. ```diff @@ pub struct NoteListFilters<'a> { - pub since: Option<&'a str>, + pub since: Option<&'a str>, + pub until: Option<&'a str>, @@ - pub unresolved: bool, + pub resolution: &'a str, // "any" (default) | "unresolved" | "resolved" @@ - pub author: Option<&'a str>, + pub author: Option<&'a str>, // case-insensitive match @@ - // Filter by time (7d, 2w, 1m, or YYYY-MM-DD) + // Filter by start time (7d, 2w, 1m, or YYYY-MM-DD) pub since: Option, + /// Filter by end time (7d, 2w, 1m, or YYYY-MM-DD) + #[arg(long, help_heading = "Filters")] + pub until: Option, @@ - /// Only show unresolved review comments - pub unresolved: bool, + /// Resolution filter: any, unresolved, resolved + #[arg(long, value_parser = ["any", "unresolved", "resolved"], default_value = "any", help_heading = "Filters")] + pub resolution: String, ``` 5. **Broaden index strategy to match actual query shapes, not just author queries** Rationale: `idx_notes_user_created` helps one path, but common usage also includes project+time scans and unresolved filters. Add two more partial composites for high-selectivity paths. ```diff @@ ### Work Chunk 1E: Composite Query Index CREATE INDEX IF NOT EXISTS idx_notes_user_created ON notes(project_id, author_username, created_at DESC, id DESC) WHERE is_system = 0; + +CREATE INDEX IF NOT EXISTS idx_notes_project_created +ON notes(project_id, created_at DESC, id DESC) +WHERE is_system = 0; + +CREATE INDEX IF NOT EXISTS idx_notes_unresolved_project_created +ON notes(project_id, created_at DESC, id DESC) +WHERE is_system = 0 AND resolvable = 1 AND resolved = 0; @@ +#[test] +fn test_notes_query_plan_uses_project_created_index_for_default_listing() { ... } +#[test] +fn test_notes_query_plan_uses_unresolved_index_when_resolution_unresolved() { ... } ``` 6. **Improve per-note document payload with structured metadata header + minimal thread context** Rationale: isolated single-note docs can lose meaning. A small structured header plus lightweight context (parent + one preceding note excerpt) improves semantic retrieval quality substantially without re-bundling full threads. ```diff @@ ### Work Chunk 2C: Note Document Extractor -// 6. Format content: -// [[Note]] {note_type or "Comment"} on {parent_type_prefix}: {parent_title} -// Project: {path_with_namespace} -// URL: {url} -// Author: @{author} -// Date: {format_date(created_at)} -// Labels: {labels_json} -// File: {position_new_path}:{position_new_line} (if DiffNote) -// -// --- Body --- -// -// {body} +// 6. Format content with machine-readable header: +// [[Note]] +// source_type: note +// note_gitlab_id: {gitlab_id} +// project: {path_with_namespace} +// parent_type: {Issue|MergeRequest} +// parent_iid: {iid} +// note_type: {DiffNote|DiscussionNote|Comment} +// author: @{author} +// created_at: {iso8601} +// resolved: {true|false} +// path: {position_new_path}:{position_new_line} +// url: {url} +// +// --- Context --- +// parent_title: {title} +// previous_note_excerpt: {optional, max 200 chars} +// +// --- Body --- +// {body} ``` 7. **Add first-class export modes for downstream profiling pipelines** Rationale: this makes the feature much more useful immediately (LLM prompts, notebook analysis, external scripts) without adding a profiling command. It stays within your non-goals and increases adoption. ```diff @@ pub struct NotesArgs { + /// Output format + #[arg(long, value_parser = ["table", "json", "jsonl", "csv"], default_value = "table", help_heading = "Output")] + pub format: String, @@ - if robot_mode { + if robot_mode || args.format == "json" || args.format == "jsonl" || args.format == "csv" { print_list_notes_json(...) } else { print_list_notes(&result); } @@ ### Work Chunk 1C: Human & Robot Output Formatting +Add `print_list_notes_csv()` and `print_list_notes_jsonl()`: +- CSV columns mirror `NoteListRowJson` field names +- JSONL emits one note object per line for streaming pipelines ``` 8. **Strengthen verification with idempotence + migration data-preservation checks** Rationale: this feature touches ingestion, migrations, indexing, and regeneration. Add explicit idempotence/perf checks so regressions surface early. ```diff @@ ## Verification Checklist cargo test cargo clippy --all-targets -- -D warnings cargo fmt --check +cargo test test_note_ingestion_idempotent_across_two_syncs +cargo test test_note_document_count_stable_after_second_generate_docs_full @@ +lore sync +lore generate-docs --full +lore -J stats > /tmp/stats1.json +lore generate-docs --full +lore -J stats > /tmp/stats2.json +# assert note doc count unchanged and dirty queue drains to zero ``` If you want, I can turn this into a fully rewritten PRD v2 draft with these changes merged in-place and renumbered work chunks end-to-end.