diff --git a/AGENTS.md b/AGENTS.md index 0e4181b..44adddf 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -33,38 +33,50 @@ The `lore` CLI has a robot mode optimized for AI agent consumption with structur ```bash # Explicit flag -lore --robot list issues +lore --robot issues -n 10 + +# JSON shorthand (-J) +lore -J issues -n 10 # Auto-detection (when stdout is not a TTY) -lore list issues | jq . +lore issues | jq . # Environment variable -LORE_ROBOT=true lore list issues +LORE_ROBOT=1 lore issues ``` ### Robot Mode Commands ```bash # List issues/MRs with JSON output -lore --robot list issues --limit=10 -lore --robot list mrs --state=opened +lore --robot issues -n 10 +lore --robot mrs -s opened + +# Show detailed entity info +lore --robot issues 123 +lore --robot mrs 456 -p group/repo # Count entities lore --robot count issues -lore --robot count discussions --type=mr +lore --robot count discussions --for mr -# Show detailed entity info -lore --robot show issue 123 -lore --robot show mr 456 --project=group/repo +# Search indexed documents +lore --robot search "authentication bug" # Check sync status -lore --robot sync-status +lore --robot status -# Run ingestion (quiet, JSON summary) -lore --robot ingest --type=issues +# Run full sync pipeline +lore --robot sync + +# Run ingestion only +lore --robot ingest issues # Check environment health lore --robot doctor + +# Document and index statistics +lore --robot stats ``` ### Response Format @@ -102,8 +114,8 @@ Errors return structured JSON to stderr: ### Best Practices -- Use `lore --robot` for all agent interactions +- Use `lore --robot` or `lore -J` for all agent interactions - Check exit codes for error handling - Parse JSON errors from stderr -- Use `--limit` to control response size +- Use `-n` / `--limit` to control response size - TTY detection handles piped commands automatically diff --git a/README.md b/README.md index 742182b..c62c40e 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Gitlore -Local GitLab data management with semantic search. Syncs issues, MRs, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying and filtering. +Local GitLab data management with semantic search. Syncs issues, MRs, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying, filtering, and hybrid search. ## Features @@ -9,8 +9,10 @@ Local GitLab data management with semantic search. Syncs issues, MRs, discussion - **Full re-sync**: Reset cursors and fetch all data from scratch when needed - **Multi-project**: Track issues and MRs across multiple GitLab projects - **Rich filtering**: Filter by state, author, assignee, labels, milestone, due date, draft status, reviewer, branches +- **Hybrid search**: Combines FTS5 lexical search with Ollama-powered vector embeddings via Reciprocal Rank Fusion - **Raw payload storage**: Preserves original GitLab API responses for debugging - **Discussion threading**: Full support for issue and MR discussions including inline code review comments +- **Robot mode**: Machine-readable JSON output with structured errors and meaningful exit codes ## Installation @@ -32,25 +34,28 @@ cargo build --release lore init # Verify authentication -lore auth-test +lore auth -# Sync issues from GitLab -lore ingest --type issues - -# Sync merge requests from GitLab -lore ingest --type mrs +# Sync everything from GitLab (issues + MRs + docs + embeddings) +lore sync # List recent issues -lore list issues --limit 10 +lore issues -n 10 # List open merge requests -lore list mrs --state opened +lore mrs -s opened # Show issue details -lore show issue 123 --project group/repo +lore issues 123 # Show MR details with discussions -lore show mr 456 --project group/repo +lore mrs 456 + +# Search across all indexed data +lore search "authentication bug" + +# Robot mode (machine-readable JSON) +lore -J issues -n 5 | jq . ``` ## Configuration @@ -79,6 +84,12 @@ Configuration is stored in `~/.config/lore/config.json` (or `$XDG_CONFIG_HOME/lo }, "storage": { "compressRawPayloads": true + }, + "embedding": { + "provider": "ollama", + "model": "nomic-embed-text", + "baseUrl": "http://localhost:11434", + "concurrency": 4 } } ``` @@ -87,9 +98,9 @@ Configuration is stored in `~/.config/lore/config.json` (or `$XDG_CONFIG_HOME/lo | Section | Field | Default | Description | |---------|-------|---------|-------------| -| `gitlab` | `baseUrl` | — | GitLab instance URL (required) | +| `gitlab` | `baseUrl` | -- | GitLab instance URL (required) | | `gitlab` | `tokenEnvVar` | `GITLAB_TOKEN` | Environment variable containing API token | -| `projects` | `path` | — | Project path (e.g., `group/project`) | +| `projects` | `path` | -- | Project path (e.g., `group/project`) | | `sync` | `backfillDays` | `14` | Days to backfill on initial sync | | `sync` | `staleLockMinutes` | `10` | Minutes before sync lock considered stale | | `sync` | `heartbeatIntervalSeconds` | `30` | Frequency of lock heartbeat updates | @@ -107,7 +118,7 @@ Configuration is stored in `~/.config/lore/config.json` (or `$XDG_CONFIG_HOME/lo ### Config File Resolution The config file is resolved in this order: -1. `--config` CLI flag +1. `--config` / `-c` CLI flag 2. `LORE_CONFIG_PATH` environment variable 3. `~/.config/lore/config.json` (XDG default) 4. `./lore.config.json` (local fallback for development) @@ -116,7 +127,7 @@ The config file is resolved in this order: Create a personal access token with `read_api` scope: -1. Go to GitLab → Settings → Access Tokens +1. Go to GitLab > Settings > Access Tokens 2. Create token with `read_api` scope 3. Export it: `export GITLAB_TOKEN=glpat-xxxxxxxxxxxx` @@ -126,12 +137,185 @@ Create a personal access token with `read_api` scope: |----------|---------|----------| | `GITLAB_TOKEN` | GitLab API authentication token (name configurable via `gitlab.tokenEnvVar`) | Yes | | `LORE_CONFIG_PATH` | Override config file location | No | +| `LORE_ROBOT` | Enable robot mode globally (set to `true` or `1`) | No | | `XDG_CONFIG_HOME` | XDG Base Directory for config (fallback: `~/.config`) | No | | `XDG_DATA_HOME` | XDG Base Directory for data (fallback: `~/.local/share`) | No | | `RUST_LOG` | Logging level filter (e.g., `lore=debug`) | No | ## Commands +### `lore issues` + +Query issues from local database, or show a specific issue. + +```bash +lore issues # Recent issues (default 50) +lore issues 123 # Show issue #123 with discussions +lore issues 123 -p group/repo # Disambiguate by project +lore issues -n 100 # More results +lore issues -s opened # Only open issues +lore issues -s closed # Only closed issues +lore issues -a username # By author (@ prefix optional) +lore issues -A username # By assignee (@ prefix optional) +lore issues -l bug # By label (AND logic) +lore issues -l bug -l urgent # Multiple labels +lore issues -m "v1.0" # By milestone title +lore issues --since 7d # Updated in last 7 days +lore issues --since 2w # Updated in last 2 weeks +lore issues --since 2024-01-01 # Updated since date +lore issues --due-before 2024-12-31 # Due before date +lore issues --has-due # Only issues with due dates +lore issues -p group/repo # Filter by project +lore issues --sort created --asc # Sort by created date, ascending +lore issues -o # Open first result in browser +``` + +When listing, output includes: IID, title, state, author, assignee, labels, and update time. + +When showing a single issue (e.g., `lore issues 123`), output includes: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions. + +### `lore mrs` + +Query merge requests from local database, or show a specific MR. + +```bash +lore mrs # Recent MRs (default 50) +lore mrs 456 # Show MR !456 with discussions +lore mrs 456 -p group/repo # Disambiguate by project +lore mrs -n 100 # More results +lore mrs -s opened # Only open MRs +lore mrs -s merged # Only merged MRs +lore mrs -s closed # Only closed MRs +lore mrs -s locked # Only locked MRs +lore mrs -s all # All states +lore mrs -a username # By author (@ prefix optional) +lore mrs -A username # By assignee (@ prefix optional) +lore mrs -r username # By reviewer (@ prefix optional) +lore mrs -d # Only draft/WIP MRs +lore mrs -D # Exclude draft MRs +lore mrs --target main # By target branch +lore mrs --source feature/foo # By source branch +lore mrs -l needs-review # By label (AND logic) +lore mrs --since 7d # Updated in last 7 days +lore mrs -p group/repo # Filter by project +lore mrs --sort created --asc # Sort by created date, ascending +lore mrs -o # Open first result in browser +``` + +When listing, output includes: IID, title (with [DRAFT] prefix if applicable), state, author, assignee, labels, and update time. + +When showing a single MR (e.g., `lore mrs 456`), output includes: title, description, state, draft status, author, assignees, reviewers, labels, source/target branches, merge status, web URL, and threaded discussions. Inline code review comments (DiffNotes) display file context in the format `[src/file.ts:45]`. + +### `lore search` + +Search across indexed documents using hybrid (lexical + semantic), lexical-only, or semantic-only modes. + +```bash +lore search "authentication bug" # Hybrid search (default) +lore search "login flow" --mode lexical # FTS5 lexical only +lore search "login flow" --mode semantic # Vector similarity only +lore search "auth" --type issue # Filter by source type +lore search "auth" --type mr # MR documents only +lore search "auth" --type discussion # Discussion documents only +lore search "deploy" --author username # Filter by author +lore search "deploy" -p group/repo # Filter by project +lore search "deploy" --label backend # Filter by label (AND logic) +lore search "deploy" --path src/ # Filter by file path (trailing / for prefix) +lore search "deploy" --after 7d # Created after (7d, 2w, or YYYY-MM-DD) +lore search "deploy" --updated-after 2w # Updated after +lore search "deploy" -n 50 # Limit results (default 20, max 100) +lore search "deploy" --explain # Show ranking explanation per result +lore search "deploy" --fts-mode raw # Raw FTS5 query syntax (advanced) +``` + +Requires `lore generate-docs` (or `lore sync`) to have been run at least once. Semantic mode requires Ollama with the configured embedding model. + +### `lore sync` + +Run the full sync pipeline: ingest from GitLab, generate searchable documents, and compute embeddings. + +```bash +lore sync # Full pipeline +lore sync --full # Reset cursors, fetch everything +lore sync --force # Override stale lock +lore sync --no-embed # Skip embedding step +lore sync --no-docs # Skip document regeneration +``` + +### `lore ingest` + +Sync data from GitLab to local database. Runs only the ingestion step (no doc generation or embeddings). + +```bash +lore ingest # Ingest everything (issues + MRs) +lore ingest issues # Issues only +lore ingest mrs # MRs only +lore ingest issues -p group/repo # Single project +lore ingest --force # Override stale lock +lore ingest --full # Full re-sync (reset cursors) +``` + +The `--full` flag resets sync cursors and discussion watermarks, then fetches all data from scratch. Useful when: +- Assignee data or other fields were missing from earlier syncs +- You want to ensure complete data after schema changes +- Troubleshooting sync issues + +### `lore generate-docs` + +Extract searchable documents from ingested issues, MRs, and discussions for the FTS5 index. + +```bash +lore generate-docs # Incremental (dirty items only) +lore generate-docs --full # Full rebuild +lore generate-docs -p group/repo # Single project +``` + +### `lore embed` + +Generate vector embeddings for documents via Ollama. Requires Ollama running with the configured embedding model. + +```bash +lore embed # Embed new/changed documents +lore embed --retry-failed # Retry previously failed embeddings +``` + +### `lore count` + +Count entities in local database. + +```bash +lore count issues # Total issues +lore count mrs # Total MRs (with state breakdown) +lore count discussions # Total discussions +lore count discussions --for issue # Issue discussions only +lore count discussions --for mr # MR discussions only +lore count notes # Total notes (system vs user breakdown) +lore count notes --for issue # Issue notes only +``` + +### `lore stats` + +Show document and index statistics, with optional integrity checks. + +```bash +lore stats # Document and index statistics +lore stats --check # Run integrity checks +lore stats --check --repair # Repair integrity issues +``` + +### `lore status` + +Show current sync state and watermarks. + +```bash +lore status +``` + +Displays: +- Last sync run details (status, timing) +- Cursor positions per project and resource type (issues and MRs) +- Data summary counts + ### `lore init` Initialize configuration and database interactively. @@ -142,12 +326,12 @@ lore init --force # Overwrite existing config lore init --non-interactive # Fail if prompts needed ``` -### `lore auth-test` +### `lore auth` Verify GitLab authentication is working. ```bash -lore auth-test +lore auth # Authenticated as @username (Full Name) # GitLab: https://gitlab.com ``` @@ -157,8 +341,7 @@ lore auth-test Check environment health and configuration. ```bash -lore doctor # Human-readable output -lore doctor --json # JSON output for scripting +lore doctor ``` Checks performed: @@ -168,132 +351,6 @@ Checks performed: - Project accessibility - Ollama connectivity (optional) -### `lore ingest` - -Sync data from GitLab to local database. - -```bash -# Issues -lore ingest --type issues # Sync all projects -lore ingest --type issues --project group/repo # Single project -lore ingest --type issues --force # Override stale lock -lore ingest --type issues --full # Full re-sync (reset cursors) - -# Merge Requests -lore ingest --type mrs # Sync all projects -lore ingest --type mrs --project group/repo # Single project -lore ingest --type mrs --full # Full re-sync (reset cursors) -``` - -The `--full` flag resets sync cursors and discussion watermarks, then fetches all data from scratch. Useful when: -- Assignee data or other fields were missing from earlier syncs -- You want to ensure complete data after schema changes -- Troubleshooting sync issues - -### `lore list issues` - -Query issues from local database. - -```bash -lore list issues # Recent issues (default 50) -lore list issues --limit 100 # More results -lore list issues --state opened # Only open issues -lore list issues --state closed # Only closed issues -lore list issues --author username # By author (@ prefix optional) -lore list issues --assignee username # By assignee (@ prefix optional) -lore list issues --label bug # By label (AND logic) -lore list issues --label bug --label urgent # Multiple labels -lore list issues --milestone "v1.0" # By milestone title -lore list issues --since 7d # Updated in last 7 days -lore list issues --since 2w # Updated in last 2 weeks -lore list issues --since 2024-01-01 # Updated since date -lore list issues --due-before 2024-12-31 # Due before date -lore list issues --has-due-date # Only issues with due dates -lore list issues --project group/repo # Filter by project -lore list issues --sort created --order asc # Sort options -lore list issues --open # Open first result in browser -lore list issues --json # JSON output -``` - -Output includes: IID, title, state, author, assignee, labels, and update time. - -### `lore list mrs` - -Query merge requests from local database. - -```bash -lore list mrs # Recent MRs (default 50) -lore list mrs --limit 100 # More results -lore list mrs --state opened # Only open MRs -lore list mrs --state merged # Only merged MRs -lore list mrs --state closed # Only closed MRs -lore list mrs --state locked # Only locked MRs -lore list mrs --state all # All states -lore list mrs --author username # By author (@ prefix optional) -lore list mrs --assignee username # By assignee (@ prefix optional) -lore list mrs --reviewer username # By reviewer (@ prefix optional) -lore list mrs --draft # Only draft/WIP MRs -lore list mrs --no-draft # Exclude draft MRs -lore list mrs --target-branch main # By target branch -lore list mrs --source-branch feature/foo # By source branch -lore list mrs --label needs-review # By label (AND logic) -lore list mrs --since 7d # Updated in last 7 days -lore list mrs --project group/repo # Filter by project -lore list mrs --sort created --order asc # Sort options -lore list mrs --open # Open first result in browser -lore list mrs --json # JSON output -``` - -Output includes: IID, title (with [DRAFT] prefix if applicable), state, author, assignee, labels, and update time. - -### `lore show issue` - -Display detailed issue information. - -```bash -lore show issue 123 # Show issue #123 -lore show issue 123 --project group/repo # Disambiguate if needed -``` - -Shows: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions. - -### `lore show mr` - -Display detailed merge request information. - -```bash -lore show mr 456 # Show MR !456 -lore show mr 456 --project group/repo # Disambiguate if needed -``` - -Shows: title, description, state, draft status, author, assignees, reviewers, labels, source/target branches, merge status, web URL, and threaded discussions. Inline code review comments (DiffNotes) display file context in the format `[src/file.ts:45]`. - -### `lore count` - -Count entities in local database. - -```bash -lore count issues # Total issues -lore count mrs # Total MRs (with state breakdown) -lore count discussions # Total discussions -lore count discussions --type issue # Issue discussions only -lore count discussions --type mr # MR discussions only -lore count notes # Total notes (shows system vs user breakdown) -``` - -### `lore sync-status` - -Show current sync state and watermarks. - -```bash -lore sync-status -``` - -Displays: -- Last sync run details (status, timing) -- Cursor positions per project and resource type (issues and MRs) -- Data summary counts - ### `lore migrate` Run pending database migrations. @@ -302,8 +359,6 @@ Run pending database migrations. lore migrate ``` -Shows current schema version and applies any pending migrations. - ### `lore version` Show version information. @@ -312,26 +367,67 @@ Show version information. lore version ``` -### `lore backup` +## Robot Mode -Create timestamped database backup. +Machine-readable JSON output for scripting and AI agent consumption. + +### Activation ```bash -lore backup +# Global flag +lore --robot issues -n 5 + +# JSON shorthand (-J) +lore -J issues -n 5 + +# Environment variable +LORE_ROBOT=1 lore issues -n 5 + +# Auto-detection (when stdout is not a TTY) +lore issues -n 5 | jq . ``` -*Note: Not yet implemented.* +### Response Format -### `lore reset` +All commands return consistent JSON: -Delete database and reset all state. +```json +{"ok": true, "data": {...}, "meta": {...}} +``` + +Errors return structured JSON to stderr: + +```json +{"error": {"code": "CONFIG_NOT_FOUND", "message": "...", "suggestion": "Run 'lore init'"}} +``` + +### Exit Codes + +| Code | Meaning | +|------|---------| +| 0 | Success | +| 1 | Internal error | +| 2 | Config not found | +| 3 | Config invalid | +| 4 | Token not set | +| 5 | GitLab auth failed | +| 6 | Resource not found | +| 7 | Rate limited | +| 8 | Network error | +| 9 | Database locked | +| 10 | Database error | +| 11 | Migration failed | +| 12 | I/O error | +| 13 | Transform error | + +## Global Options ```bash -lore reset --confirm +lore -c /path/to/config.json # Use alternate config +lore --robot # Machine-readable JSON +lore -J # JSON shorthand ``` -*Note: Not yet implemented.* - ## Database Schema Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables: @@ -350,6 +446,9 @@ Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables: | `mr_reviewers` | Many-to-many MR-reviewer relationships | | `discussions` | Issue/MR discussion threads | | `notes` | Individual notes within discussions (with system note flag and DiffNote position data) | +| `documents` | Extracted searchable text for FTS and embedding | +| `documents_fts` | FTS5 full-text search index | +| `embeddings` | Vector embeddings for semantic search | | `sync_runs` | Audit trail of sync operations | | `sync_cursors` | Cursor positions for incremental sync | | `app_locks` | Crash-safe single-flight lock | @@ -358,12 +457,6 @@ Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables: The database is stored at `~/.local/share/lore/lore.db` by default (XDG compliant). -## Global Options - -```bash -lore --config /path/to/config.json # Use alternate config -``` - ## Development ```bash @@ -371,10 +464,10 @@ lore --config /path/to/config.json # Use alternate config cargo test # Run with debug logging -RUST_LOG=lore=debug lore list issues +RUST_LOG=lore=debug lore issues # Run with trace logging -RUST_LOG=lore=trace lore ingest --type issues +RUST_LOG=lore=trace lore ingest issues # Check formatting cargo fmt --check @@ -386,7 +479,8 @@ cargo clippy ## Tech Stack - **Rust** (2024 edition) -- **SQLite** via rusqlite (bundled) +- **SQLite** via rusqlite (bundled) with FTS5 and sqlite-vec +- **Ollama** for vector embeddings (nomic-embed-text) - **clap** for CLI parsing - **reqwest** for HTTP - **tokio** for async runtime @@ -394,23 +488,6 @@ cargo clippy - **tracing** for logging - **indicatif** for progress bars -## Current Status - -This is Checkpoint 2 (CP2) of the Gitlore project. Currently implemented: - -- Issue ingestion with cursor-based incremental sync -- Merge request ingestion with cursor-based incremental sync -- Discussion and note syncing for issues and MRs -- DiffNote support for inline code review comments -- Rich filtering and querying for both issues and MRs -- Full re-sync capability with watermark reset - -Not yet implemented: -- Semantic search with embeddings (CP3+) -- Backup and reset commands - -See [SPEC.md](SPEC.md) for the full project roadmap and architecture. - ## License MIT diff --git a/docs/api-efficiency-findings.md b/docs/api-efficiency-findings.md new file mode 100644 index 0000000..c9c9398 --- /dev/null +++ b/docs/api-efficiency-findings.md @@ -0,0 +1,354 @@ +# API Efficiency & Observability Findings + +> **Status:** Draft - working through items +> **Context:** Audit of gitlore's GitLab API usage, data processing, and observability gaps +> **Interactive reference:** `api-review.html` (root of repo, open in browser) + +--- + +## Checkpoint 3 Alignment + +Checkpoint 3 (`docs/prd/checkpoint-3.md`) introduces `lore sync` orchestration, document generation, and search. Several findings here overlap with that work. This section maps the relationship so effort isn't duplicated and so CP3 implementation can absorb the right instrumentation as it's built. + +### Direct overlaps (CP3 partially addresses) + +| Finding | CP3 coverage | Remaining gap | +|---------|-------------|---------------| +| **P0-1** sync_runs never written | `lore sync` step 7 says "record sync_run". `SyncResult` struct defined with counts. | Only covers the new `lore sync` command. Existing `lore ingest` still won't write sync_runs. Either instrument `lore ingest` separately or have `lore sync` subsume it entirely. | +| **P0-2** No timing | `print_sync` captures wall-clock `elapsed_secs` / `elapsed_ms` in robot mode JSON `meta` envelope. | Wall-clock only. No per-phase, per-API-call, or per-DB-write breakdown. The `SyncResult` struct has counts but no duration fields. | +| **P2-1** Discussion full-refresh | CP3 introduces `pending_discussion_fetches` queue with exponential backoff and bounded processing per sync. Structures the work better. | Same full-refresh strategy per entity. The queue adds retry resilience but doesn't reduce the number of API calls for unchanged discussions. | + +### Different scope (complementary, no overlap) + +| Finding | Why no overlap | +|---------|---------------| +| **P0-3** metrics_json schema | CP3 doesn't reference the `metrics_json` column. `SyncResult` is printed/returned but not persisted there. | +| **P0-4** Discussion sync telemetry columns | CP3's queue system (`pending_discussion_fetches`) is a replacement architecture. The existing per-MR telemetry columns (`discussions_sync_attempts`, `_last_error`) aren't referenced in CP3. Decide: use CP3's queue table or wire up the existing columns? | +| **P0-5** Progress events lack timing | CP3 lists "Progress visible during long syncs" as acceptance criteria but doesn't spec timing in events. | +| **P1-\*** Free data capture | CP3 doesn't touch GitLab API response field coverage at all. These are independent. | +| **P2-2** Keyset pagination (GitLab API) | CP3 uses keyset pagination for local SQLite queries (document seeding, embedding pipelines). Completely different from using GitLab API keyset pagination. | +| **P2-3** ETags | Not mentioned in CP3. | +| **P2-4** Labels enrichment | Not mentioned in CP3. | +| **P3-\*** Structural improvements | Not in CP3 scope. | + +### Recommendation + +CP3's `lore sync` orchestrator is the natural integration point for P0 instrumentation. Rather than retrofitting `lore ingest` separately, the most efficient path is: + +1. Build P0 timing instrumentation as a reusable layer (e.g., a `SyncMetrics` struct that accumulates phase timings) +2. Wire it into the CP3 `run_sync` implementation as it's built +3. Have `run_sync` persist the full metrics (counts + timing) to `sync_runs.metrics_json` +4. Decide whether `lore ingest` becomes a thin wrapper around `lore sync --no-docs --no-embed` or stays separate with its own sync_runs recording + +This avoids building instrumentation twice and ensures the new sync pipeline is observable from day one. + +### Decision: `lore ingest` goes away + +`lore sync` becomes the single command for all data fetching. First run does a full fetch (equivalent to today's `lore ingest`), subsequent runs are incremental via cursors. `lore ingest` becomes a hidden deprecated alias. + +Implications: +- P0 instrumentation only needs to be built in one place (`run_sync`) +- CP3 Gate C owns the sync_runs lifecycle end-to-end +- The existing `lore ingest issues` / `lore ingest mrs` code becomes internal functions called by `run_sync`, not standalone CLI commands +- `lore sync` always syncs everything: issues, MRs, discussions, documents, embeddings (with `--no-embed` / `--no-docs` to opt out of later stages) + +--- + +## Implementation Sequence + +### Phase A: Before CP3 (independent, enriches data model) + +**Do first.** Migration + struct changes only. No architectural dependency. Gets richer source data into the DB before CP3's document generation pipeline locks in its schema. + +1. **P1 batch: free data capture** - All ~11 fields in a single migration. `user_notes_count`, `upvotes`, `downvotes`, `confidential`, `has_conflicts`, `blocking_discussions_resolved`, `merge_commit_sha`, `discussion_locked`, `task_completion_status`, `issue_type`, `issue references`. +2. **P1-10: MR milestones** - Reuse existing issue milestone transformer. Slightly more work, same migration. + +### Phase B: During CP3 Gate C (`lore sync`) + +**Build instrumentation into the sync orchestrator as it's constructed.** Not a separate effort. + +3. **P0-1 + P0-2 + P0-3** - `SyncMetrics` struct accumulating phase timings. `run_sync` writes to `sync_runs` with full `metrics_json` on completion. +4. **P0-4** - Decide: use CP3's `pending_discussion_fetches` queue or existing per-MR telemetry columns. Wire up the winner. +5. **P0-5** - Add `elapsed_ms` to `*Complete` progress event variants. +6. **Deprecate `lore ingest`** - Hidden alias pointing to `lore sync`. Remove from help output. + +### Phase C: After CP3 ships, informed by real metrics + +**Only pursue items that P0 data proves matter.** + +7. **P2-1: Discussion optimization** - Check metrics_json from real runs. If discussion phase is <10% of wall-clock, skip. +8. **P2-2: Keyset pagination** - Check primary fetch timing on largest project. If fast, skip. +9. **P2-4: Labels enrichment** - If label colors are needed for any UI surface. + +### Phase D: Future (needs a forcing function) + +10. **P3-1: Users table** - When a UI needs display names / avatars. +11. **P2-3: ETags** - Only if P2-1 doesn't sufficiently reduce discussion overhead. +12. **P3-2/3/4: GraphQL, Events API, Webhooks** - Architectural shifts. Only if pull-based sync hits a scaling wall. + +--- + +## Priority 0: Observability (prerequisite for everything else) + +We can't evaluate any efficiency question without measurement. Gitlore has no runtime performance instrumentation. The infrastructure for it was scaffolded (sync_runs table, metrics_json column, discussion sync telemetry columns) but never wired up. + +### P0-1: sync_runs table is never written to + +**Location:** Schema in `migrations/001_initial.sql:25-34`, read in `src/cli/commands/sync_status.rs:69-72` + +The table exists and `lore status` reads from it, but no code ever INSERTs or UPDATEs rows. The entire audit trail is empty. + +```sql +-- Exists in schema, never populated +CREATE TABLE sync_runs ( + id INTEGER PRIMARY KEY, + started_at INTEGER NOT NULL, + heartbeat_at INTEGER NOT NULL, + finished_at INTEGER, + status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' + command TEXT NOT NULL, + error TEXT, + metrics_json TEXT -- never written +); +``` + +**What to do:** Instrument the ingest orchestrator to record sync runs. Each `lore ingest issues` / `lore ingest mrs` invocation should: +- INSERT a row with status='running' at start +- UPDATE with status='succeeded'/'failed' + finished_at on completion +- Populate metrics_json with the IngestProjectResult / IngestMrProjectResult counters + +### P0-2: No operation timing anywhere + +**Location:** Rate limiter in `src/gitlab/client.rs:20-65`, orchestrator in `src/ingestion/orchestrator.rs` + +`Instant::now()` is used only for rate limiter enforcement. No operation durations are measured or logged. We don't know: + +- How long a full issue ingest takes +- How long discussion sync takes per entity +- How long individual API requests take (network latency) +- How long database writes take per batch +- How long rate limiter sleeps accumulate to +- How long pagination takes across pages + +**What to do:** Add timing instrumentation at these levels: + +| Level | What to time | Where | +|-------|-------------|-------| +| **Run** | Total ingest wall-clock time | orchestrator entry/exit | +| **Phase** | Primary fetch vs discussion sync | orchestrator phase boundaries | +| **API call** | Individual HTTP request round-trip | client.rs request method | +| **DB write** | Transaction duration per batch | ingestion store functions | +| **Rate limiter** | Cumulative sleep time per run | client.rs acquire() | + +Store phase-level and run-level timing in `metrics_json`. Log API-call-level timing at debug level. + +### P0-3: metrics_json has no defined schema + +**What to do:** Define what goes in there. Strawman based on existing IngestProjectResult fields plus timing: + +```json +{ + "wall_clock_ms": 14200, + "phases": { + "primary_fetch": { + "duration_ms": 8400, + "api_calls": 12, + "items_fetched": 1143, + "items_upserted": 87, + "pages": 12, + "rate_limit_sleep_ms": 1200 + }, + "discussion_sync": { + "duration_ms": 5800, + "entities_checked": 87, + "entities_synced": 14, + "entities_skipped": 73, + "api_calls": 22, + "discussions_fetched": 156, + "notes_upserted": 412, + "rate_limit_sleep_ms": 2200 + } + }, + "db": { + "labels_created": 3, + "raw_payloads_stored": 87, + "raw_payloads_deduped": 42 + } +} +``` + +### P0-4: Discussion sync telemetry columns are dead code + +**Location:** `merge_requests` table columns: `discussions_sync_last_attempt_at`, `discussions_sync_attempts`, `discussions_sync_last_error` + +These exist in the schema but are never read or written. They were designed for tracking retry behavior on failed discussion syncs. + +**What to do:** Wire these up during discussion sync. On attempt: set last_attempt_at and increment attempts. On failure: set last_error. On success: reset attempts to 0. This provides per-entity visibility into discussion sync health. + +### P0-5: Progress events carry no timing + +**Location:** `src/ingestion/orchestrator.rs:28-53` + +ProgressEvent variants (`IssueFetched`, `DiscussionSynced`, etc.) carry only counts. Adding elapsed_ms to at least `*Complete` variants would give callers (CLI progress bars, robot mode output) real throughput numbers. + +--- + +## Priority 1: Free data capture (zero API cost) + +These fields are already in the API responses gitlore receives. Storing them requires only Rust struct additions and DB column migrations. No additional API calls. + +### P1-1: user_notes_count (Issues + MRs) + +**API field:** `user_notes_count` (integer) +**Value:** Could short-circuit discussion re-sync. If count hasn't changed, discussions probably haven't changed either. Also useful for "most discussed" queries. +**Effort:** Add field to serde struct, add DB column, store during transform. + +### P1-2: upvotes / downvotes (Issues + MRs) + +**API field:** `upvotes`, `downvotes` (integers) +**Value:** Engagement metrics for triage. "Most upvoted open issues" is a common query. +**Effort:** Same pattern as above. + +### P1-3: confidential (Issues) + +**API field:** `confidential` (boolean) +**Value:** Security-sensitive filtering. Important to know when exposing issue data. +**Effort:** Low. + +### P1-4: has_conflicts (MRs) + +**API field:** `has_conflicts` (boolean) +**Value:** Identify MRs needing rebase. Useful for "stale MR" detection. +**Effort:** Low. + +### P1-5: blocking_discussions_resolved (MRs) + +**API field:** `blocking_discussions_resolved` (boolean) +**Value:** MR readiness indicator without joining the discussions table. +**Effort:** Low. + +### P1-6: merge_commit_sha (MRs) + +**API field:** `merge_commit_sha` (string, nullable) +**Value:** Trace merged MRs to specific commits in git history. +**Effort:** Low. + +### P1-7: discussion_locked (Issues + MRs) + +**API field:** `discussion_locked` (boolean) +**Value:** Know if new comments can be added. Useful for robot mode consumers. +**Effort:** Low. + +### P1-8: task_completion_status (Issues + MRs) + +**API field:** `task_completion_status` (object: `{count, completed_count}`) +**Value:** Track task-list checkbox progress without parsing markdown. +**Effort:** Low. Store as two integer columns or a small JSON blob. + +### P1-9: issue_type (Issues) + +**API field:** `issue_type` (string: "issue" | "incident" | "test_case") +**Value:** Distinguish issues vs incidents vs test cases for filtering. +**Effort:** Low. + +### P1-10: MR milestone (MRs) + +**API field:** `milestone` (object, same structure as on issues) +**Current state:** Milestones are fully stored for issues but completely ignored for MRs. +**Value:** "Which MRs are in milestone X?" Currently impossible to query locally. +**Effort:** Medium - reuse existing milestone transformer from issue pipeline. + +### P1-11: Issue references (Issues) + +**API field:** `references` (object: `{short, relative, full}`) +**Current state:** Stored for MRs (`references_short`, `references_full`), dropped for issues. +**Value:** Cross-project issue references (e.g., `group/project#42`). +**Effort:** Low. + +--- + +## Priority 2: Efficiency improvements (requires measurement from P0 first) + +These are potential optimizations. **Do not implement until P0 instrumentation proves they matter.** + +### P2-1: Discussion full-refresh strategy + +**Current behavior:** When an issue/MR's `updated_at` advances, ALL its discussions are deleted and re-fetched from scratch. + +**Potential optimization:** Use `user_notes_count` (P1-1) to detect whether discussions actually changed. Skip re-sync if count is unchanged. + +**Why we need P0 first:** The full-refresh may be fast enough. Since we already fetch the data from GitLab, the DELETE+INSERT is just local SQLite I/O. If discussion sync for a typical entity takes <100ms locally, this isn't worth optimizing. We need the per-entity timing from P0-2 to know. + +**Trade-offs to consider:** +- Full-refresh catches edited and deleted notes. Incremental would miss those. +- `user_notes_count` doesn't change when notes are edited, only when added/removed. +- Full-refresh is simpler to reason about for consistency. + +### P2-2: Keyset pagination + +**Current behavior:** Offset-based (`page=N&per_page=100`). +**Alternative:** Keyset pagination (`pagination=keyset`), O(1) per page instead of O(N). + +**Why we need P0 first:** Only matters for large projects (>10K issues). Most projects will never hit enough pages for this to be measurable. P0 timing of pagination will show if this is a bottleneck. + +**Note:** Gitlore already parses `Link` headers for next-page detection, which is the client-side mechanism keyset pagination uses. So partial support exists. + +### P2-3: ETag / conditional requests + +**Current behavior:** All requests are unconditional. +**Alternative:** Cache ETags, send `If-None-Match`, get 304s back. + +**Why we need P0 first:** The cursor-based sync already avoids re-fetching unchanged data for primary resources. ETags would mainly help with discussion re-fetches where nothing changed. If P2-1 (user_notes_count skip) is implemented, ETags become less valuable. + +### P2-4: Labels API enrichment + +**Current behavior:** Labels extracted from the `labels[]` string array in issue/MR responses. The `labels` table has `color` and `description` columns that may not be populated. +**Alternative:** Single call to `GET /projects/:id/labels` per project per sync to populate label metadata. +**Cost:** 1 API call per project per sync run. +**Value:** Label colors for UI rendering, descriptions for tooltips. + +--- + +## Priority 3: Structural improvements (future consideration) + +### P3-1: Users table + +**Current state:** Only `username` stored. Author `name`, `avatar_url`, `web_url`, `state` are in every API response but discarded. +**Proposal:** Create a `users` table, upsert on every encounter. Zero API cost. +**Value:** Richer user display, detect blocked/deactivated users. + +### P3-2: GraphQL API for field-precise fetching + +**Current state:** REST API returns ~40-50 fields per entity. Gitlore uses ~15-23. +**Alternative:** GraphQL API allows requesting exactly the fields needed. +**Trade-offs:** Different pagination model, potentially less stable API, more complex client code. The bandwidth savings are real but likely minor compared to discussion re-fetch overhead. + +### P3-3: Events API for lightweight change detection + +**Endpoint:** `GET /projects/:id/events` +**Value:** Lightweight "has anything changed?" check before running full issue/MR sync. Could replace or supplement the cursor-based approach for very active projects. + +### P3-4: Webhook-based push sync + +**Endpoint:** `POST /projects/:id/hooks` (setup), then receive pushes. +**Value:** Near-real-time sync without polling cost. Eliminates all rate-limit concerns. +**Barrier:** Requires a listener endpoint, which changes the architecture from pull-only CLI to something with a daemon/server component. + +--- + +## Working notes + +_Space for recording decisions as we work through items._ + +### Decisions made + +| Item | Decision | Rationale | +|------|----------|-----------| +| `lore ingest` | Remove. `lore sync` is the single entry point. | No reason to separate initial load from incremental updates. First run = full fetch, subsequent = cursor-based delta. | +| CP3 alignment | Build P0 instrumentation into CP3 Gate C, not separately. | Avoids building in two places. `lore sync` owns the full lifecycle. | +| P2 timing | Defer all efficiency optimizations until P0 metrics from real runs are available. | Can't evaluate trade-offs without measurement. | + +### Open questions + +- What's the typical project size (issue/MR count) for gitlore users? This determines whether keyset pagination (P2-2) matters. +- Is there a plan for a web UI or TUI? That would increase the value of P3-1 (users table) and P2-4 (label colors). diff --git a/docs/phase-a-spec.md b/docs/phase-a-spec.md new file mode 100644 index 0000000..7973795 --- /dev/null +++ b/docs/phase-a-spec.md @@ -0,0 +1,456 @@ +# Phase A: Complete API Field Capture + +> **Status:** Draft +> **Guiding principle:** Mirror everything GitLab gives us. +> - **Lossless mirror:** the raw API JSON stored behind `raw_payload_id`. This is the true complete representation of every API response. +> - **Relational projection:** a stable, query-optimized subset of fields we commit to keeping current on every re-sync. +> This preserves maximum context for processing and analysis while avoiding unbounded schema growth. +> **Migration:** 007_complete_field_capture.sql +> **Prerequisite:** None (independent of CP3) + +--- + +## Scope + +One migration. Three categories of work: + +1. **New columns** on `issues` and `merge_requests` for fields currently dropped by serde or dropped during transform +2. **New serde fields** on `GitLabIssue` and `GitLabMergeRequest` to deserialize currently-silently-dropped JSON fields +3. **Transformer + insert updates** to pass the new fields through to the DB + +No new tables. No new API calls. No new endpoints. All data comes from responses we already receive. + +--- + +## Issues: Field Gap Inventory + +### Currently stored +id, iid, project_id, title, description, state, author_username, created_at, updated_at, web_url, due_date, milestone_id, milestone_title, raw_payload_id, last_seen_at, discussions_synced_for_updated_at, labels (junction), assignees (junction) + +### Currently deserialized but dropped during transform +| API Field | Status | Action | +|-----------|--------|--------| +| `closed_at` | Deserialized in serde struct, but no DB column exists and transformer never populates it | Add column in migration 007, wire up in IssueRow + transform + INSERT | +| `author.id` | Deserialized | Store as `author_id` column | +| `author.name` | Deserialized | Store as `author_name` column | + +### Currently silently dropped by serde (not in GitLabIssue struct) +| API Field | Type | DB Column | Notes | +|-----------|------|-----------|-------| +| `issue_type` | Option\ | `issue_type` | Canonical field (lowercase, e.g. "issue"); preferred for DB storage | +| `upvotes` | i64 | `upvotes` | | +| `downvotes` | i64 | `downvotes` | | +| `user_notes_count` | i64 | `user_notes_count` | Useful for discussion sync optimization | +| `merge_requests_count` | i64 | `merge_requests_count` | Count of linked MRs | +| `confidential` | bool | `confidential` | 0/1 | +| `discussion_locked` | bool | `discussion_locked` | 0/1 | +| `weight` | Option\ | `weight` | Premium/Ultimate, null on Free | +| `time_stats.time_estimate` | i64 | `time_estimate` | Seconds | +| `time_stats.total_time_spent` | i64 | `time_spent` | Seconds | +| `time_stats.human_time_estimate` | Option\ | `human_time_estimate` | e.g. "3h 30m" | +| `time_stats.human_total_time_spent` | Option\ | `human_time_spent` | e.g. "1h 15m" | +| `task_completion_status.count` | i64 | `task_count` | Checkbox total | +| `task_completion_status.completed_count` | i64 | `task_completed_count` | Checkboxes checked | +| `has_tasks` | bool | `has_tasks` | 0/1 | +| `severity` | Option\ | `severity` | Incident severity | +| `closed_by` | Option\ | `closed_by_username` | Who closed it (username only, consistent with author pattern) | +| `imported` | bool | `imported` | 0/1 | +| `imported_from` | Option\ | `imported_from` | Import source | +| `moved_to_id` | Option\ | `moved_to_id` | Target issue if moved | +| `references.short` | String | `references_short` | e.g. "#42" | +| `references.relative` | String | `references_relative` | e.g. "#42" or "group/proj#42" | +| `references.full` | String | `references_full` | e.g. "group/project#42" | +| `health_status` | Option\ | `health_status` | Ultimate only | +| `type` | Option\ | (transform-only) | Uppercase category (e.g. "ISSUE"); fallback for `issue_type` -- lowercased before storage. Not stored as separate column; raw JSON remains lossless. | +| `epic.id` | Option\ | `epic_id` | Premium/Ultimate, null on Free | +| `epic.iid` | Option\ | `epic_iid` | | +| `epic.title` | Option\ | `epic_title` | | +| `epic.url` | Option\ | `epic_url` | | +| `epic.group_id` | Option\ | `epic_group_id` | | +| `iteration.id` | Option\ | `iteration_id` | Premium/Ultimate, null on Free | +| `iteration.iid` | Option\ | `iteration_iid` | | +| `iteration.title` | Option\ | `iteration_title` | | +| `iteration.state` | Option\ | `iteration_state` | Enum: 1=upcoming, 2=current, 3=closed | +| `iteration.start_date` | Option\ | `iteration_start_date` | ISO date | +| `iteration.due_date` | Option\ | `iteration_due_date` | ISO date | + +--- + +## Merge Requests: Field Gap Inventory + +### Currently stored +id, iid, project_id, title, description, state, draft, author_username, source_branch, target_branch, head_sha, references_short, references_full, detailed_merge_status, merge_user_username, created_at, updated_at, merged_at, closed_at, last_seen_at, web_url, raw_payload_id, discussions_synced_for_updated_at, discussions_sync_last_attempt_at, discussions_sync_attempts, discussions_sync_last_error, labels (junction), assignees (junction), reviewers (junction) + +### Currently deserialized but dropped during transform +| API Field | Status | Action | +|-----------|--------|--------| +| `author.id` | Deserialized | Store as `author_id` column | +| `author.name` | Deserialized | Store as `author_name` column | +| `work_in_progress` | Used transiently for `draft` fallback | Already handled, no change needed | +| `merge_status` (legacy) | Used transiently for `detailed_merge_status` fallback | Already handled, no change needed | +| `merged_by` | Used transiently for `merge_user` fallback | Already handled, no change needed | + +### Currently silently dropped by serde (not in GitLabMergeRequest struct) +| API Field | Type | DB Column | Notes | +|-----------|------|-----------|-------| +| `upvotes` | i64 | `upvotes` | | +| `downvotes` | i64 | `downvotes` | | +| `user_notes_count` | i64 | `user_notes_count` | | +| `source_project_id` | i64 | `source_project_id` | Fork source | +| `target_project_id` | i64 | `target_project_id` | Fork target | +| `milestone` | Option\ | `milestone_id`, `milestone_title` | Reuse issue milestone pattern | +| `merge_when_pipeline_succeeds` | bool | `merge_when_pipeline_succeeds` | 0/1, auto-merge flag | +| `merge_commit_sha` | Option\ | `merge_commit_sha` | Commit ref after merge | +| `squash_commit_sha` | Option\ | `squash_commit_sha` | Commit ref after squash | +| `discussion_locked` | bool | `discussion_locked` | 0/1 | +| `should_remove_source_branch` | Option\ | `should_remove_source_branch` | 0/1 | +| `force_remove_source_branch` | Option\ | `force_remove_source_branch` | 0/1 | +| `squash` | bool | `squash` | 0/1 | +| `squash_on_merge` | bool | `squash_on_merge` | 0/1 | +| `has_conflicts` | bool | `has_conflicts` | 0/1 | +| `blocking_discussions_resolved` | bool | `blocking_discussions_resolved` | 0/1 | +| `time_stats.time_estimate` | i64 | `time_estimate` | Seconds | +| `time_stats.total_time_spent` | i64 | `time_spent` | Seconds | +| `time_stats.human_time_estimate` | Option\ | `human_time_estimate` | | +| `time_stats.human_total_time_spent` | Option\ | `human_time_spent` | | +| `task_completion_status.count` | i64 | `task_count` | | +| `task_completion_status.completed_count` | i64 | `task_completed_count` | | +| `closed_by` | Option\ | `closed_by_username` | | +| `prepared_at` | Option\ | `prepared_at` | ISO datetime in API; store as ms epoch via `iso_to_ms()`, nullable | +| `merge_after` | Option\ | `merge_after` | ISO datetime in API; store as ms epoch via `iso_to_ms()`, nullable (scheduled merge) | +| `imported` | bool | `imported` | 0/1 | +| `imported_from` | Option\ | `imported_from` | | +| `approvals_before_merge` | Option\ | `approvals_before_merge` | Deprecated, scheduled for removal in GitLab API v5; store best-effort, keep nullable | +| `references.relative` | String | `references_relative` | Currently only short + full stored | +| `confidential` | bool | `confidential` | 0/1 (MRs can be confidential too) | +| `iteration.id` | Option\ | `iteration_id` | Premium/Ultimate, null on Free | +| `iteration.iid` | Option\ | `iteration_iid` | | +| `iteration.title` | Option\ | `iteration_title` | | +| `iteration.state` | Option\ | `iteration_state` | | +| `iteration.start_date` | Option\ | `iteration_start_date` | ISO date | +| `iteration.due_date` | Option\ | `iteration_due_date` | ISO date | + +--- + +## Migration 007: complete_field_capture.sql + +```sql +-- Migration 007: Capture all remaining GitLab API response fields. +-- Principle: mirror everything GitLab returns. No field left behind. + +-- ============================================================ +-- ISSUES: new columns +-- ============================================================ + +-- Fields currently deserialized but not stored +ALTER TABLE issues ADD COLUMN closed_at INTEGER; -- ms epoch, deserialized but never stored until now +ALTER TABLE issues ADD COLUMN author_id INTEGER; -- GitLab user ID +ALTER TABLE issues ADD COLUMN author_name TEXT; -- Display name + +-- Issue metadata +ALTER TABLE issues ADD COLUMN issue_type TEXT; -- 'issue' | 'incident' | 'test_case' +ALTER TABLE issues ADD COLUMN confidential INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN discussion_locked INTEGER NOT NULL DEFAULT 0; + +-- Engagement +ALTER TABLE issues ADD COLUMN upvotes INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN downvotes INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN user_notes_count INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN merge_requests_count INTEGER NOT NULL DEFAULT 0; + +-- Time tracking +ALTER TABLE issues ADD COLUMN time_estimate INTEGER NOT NULL DEFAULT 0; -- seconds +ALTER TABLE issues ADD COLUMN time_spent INTEGER NOT NULL DEFAULT 0; -- seconds +ALTER TABLE issues ADD COLUMN human_time_estimate TEXT; +ALTER TABLE issues ADD COLUMN human_time_spent TEXT; + +-- Task lists +ALTER TABLE issues ADD COLUMN task_count INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN task_completed_count INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN has_tasks INTEGER NOT NULL DEFAULT 0; + +-- References (MRs already have short + full) +ALTER TABLE issues ADD COLUMN references_short TEXT; -- e.g. "#42" +ALTER TABLE issues ADD COLUMN references_relative TEXT; -- context-dependent +ALTER TABLE issues ADD COLUMN references_full TEXT; -- e.g. "group/project#42" + +-- Close/move tracking +ALTER TABLE issues ADD COLUMN closed_by_username TEXT; + +-- Premium/Ultimate fields (nullable, null on Free tier) +ALTER TABLE issues ADD COLUMN weight INTEGER; +ALTER TABLE issues ADD COLUMN severity TEXT; +ALTER TABLE issues ADD COLUMN health_status TEXT; + +-- Import tracking +ALTER TABLE issues ADD COLUMN imported INTEGER NOT NULL DEFAULT 0; +ALTER TABLE issues ADD COLUMN imported_from TEXT; +ALTER TABLE issues ADD COLUMN moved_to_id INTEGER; + +-- Epic (Premium/Ultimate, null on Free) +ALTER TABLE issues ADD COLUMN epic_id INTEGER; +ALTER TABLE issues ADD COLUMN epic_iid INTEGER; +ALTER TABLE issues ADD COLUMN epic_title TEXT; +ALTER TABLE issues ADD COLUMN epic_url TEXT; +ALTER TABLE issues ADD COLUMN epic_group_id INTEGER; + +-- Iteration (Premium/Ultimate, null on Free) +ALTER TABLE issues ADD COLUMN iteration_id INTEGER; +ALTER TABLE issues ADD COLUMN iteration_iid INTEGER; +ALTER TABLE issues ADD COLUMN iteration_title TEXT; +ALTER TABLE issues ADD COLUMN iteration_state INTEGER; +ALTER TABLE issues ADD COLUMN iteration_start_date TEXT; +ALTER TABLE issues ADD COLUMN iteration_due_date TEXT; + +-- ============================================================ +-- MERGE REQUESTS: new columns +-- ============================================================ + +-- Author enrichment +ALTER TABLE merge_requests ADD COLUMN author_id INTEGER; +ALTER TABLE merge_requests ADD COLUMN author_name TEXT; + +-- Engagement +ALTER TABLE merge_requests ADD COLUMN upvotes INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN downvotes INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN user_notes_count INTEGER NOT NULL DEFAULT 0; + +-- Fork tracking +ALTER TABLE merge_requests ADD COLUMN source_project_id INTEGER; +ALTER TABLE merge_requests ADD COLUMN target_project_id INTEGER; + +-- Milestone (parity with issues) +ALTER TABLE merge_requests ADD COLUMN milestone_id INTEGER; +ALTER TABLE merge_requests ADD COLUMN milestone_title TEXT; + +-- Merge behavior +ALTER TABLE merge_requests ADD COLUMN merge_when_pipeline_succeeds INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT; +ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT; +ALTER TABLE merge_requests ADD COLUMN squash INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN squash_on_merge INTEGER NOT NULL DEFAULT 0; + +-- Merge readiness +ALTER TABLE merge_requests ADD COLUMN has_conflicts INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN blocking_discussions_resolved INTEGER NOT NULL DEFAULT 0; + +-- Branch cleanup +ALTER TABLE merge_requests ADD COLUMN should_remove_source_branch INTEGER; +ALTER TABLE merge_requests ADD COLUMN force_remove_source_branch INTEGER; + +-- Discussion lock +ALTER TABLE merge_requests ADD COLUMN discussion_locked INTEGER NOT NULL DEFAULT 0; + +-- Time tracking +ALTER TABLE merge_requests ADD COLUMN time_estimate INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN time_spent INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN human_time_estimate TEXT; +ALTER TABLE merge_requests ADD COLUMN human_time_spent TEXT; + +-- Task lists +ALTER TABLE merge_requests ADD COLUMN task_count INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN task_completed_count INTEGER NOT NULL DEFAULT 0; + +-- Close tracking +ALTER TABLE merge_requests ADD COLUMN closed_by_username TEXT; + +-- Scheduling (API returns ISO datetimes; we store ms epoch for consistency) +ALTER TABLE merge_requests ADD COLUMN prepared_at INTEGER; -- ms epoch after iso_to_ms() +ALTER TABLE merge_requests ADD COLUMN merge_after INTEGER; -- ms epoch after iso_to_ms() + +-- References (add relative, short + full already exist) +ALTER TABLE merge_requests ADD COLUMN references_relative TEXT; + +-- Import tracking +ALTER TABLE merge_requests ADD COLUMN imported INTEGER NOT NULL DEFAULT 0; +ALTER TABLE merge_requests ADD COLUMN imported_from TEXT; + +-- Premium/Ultimate +ALTER TABLE merge_requests ADD COLUMN approvals_before_merge INTEGER; +ALTER TABLE merge_requests ADD COLUMN confidential INTEGER NOT NULL DEFAULT 0; + +-- Iteration (Premium/Ultimate, null on Free) +ALTER TABLE merge_requests ADD COLUMN iteration_id INTEGER; +ALTER TABLE merge_requests ADD COLUMN iteration_iid INTEGER; +ALTER TABLE merge_requests ADD COLUMN iteration_title TEXT; +ALTER TABLE merge_requests ADD COLUMN iteration_state INTEGER; +ALTER TABLE merge_requests ADD COLUMN iteration_start_date TEXT; +ALTER TABLE merge_requests ADD COLUMN iteration_due_date TEXT; + +-- Record migration version +INSERT INTO schema_version (version, applied_at, description) +VALUES (7, strftime('%s', 'now') * 1000, 'Complete API field capture for issues and merge requests'); +``` + +--- + +## Serde Struct Changes + +### Existing type changes + +``` +GitLabReferences // Add: relative: Option (with #[serde(default)]) + // Existing fields short + full remain unchanged +GitLabIssue // Add #[derive(Default)] for test ergonomics +GitLabMergeRequest // Add #[derive(Default)] for test ergonomics +``` + +### New helper types needed + +``` +GitLabTimeStats { time_estimate, total_time_spent, human_time_estimate, human_total_time_spent } +GitLabTaskCompletionStatus { count, completed_count } +GitLabClosedBy (reuse GitLabAuthor shape: id, username, name) +GitLabEpic { id, iid, title, url, group_id } +GitLabIteration { id, iid, title, state, start_date, due_date } +``` + +### GitLabIssue: add fields + +``` +type: Option // #[serde(rename = "type")] -- fallback-only (uppercase category); "type" is reserved in Rust +upvotes: i64 // #[serde(default)] +downvotes: i64 // #[serde(default)] +user_notes_count: i64 // #[serde(default)] +merge_requests_count: i64 // #[serde(default)] +confidential: bool // #[serde(default)] +discussion_locked: bool // #[serde(default)] +weight: Option +time_stats: Option +task_completion_status: Option +has_tasks: bool // #[serde(default)] +references: Option +closed_by: Option +severity: Option +health_status: Option +imported: bool // #[serde(default)] +imported_from: Option +moved_to_id: Option +issue_type: Option // canonical field (lowercase); preferred for DB storage over `type` +epic: Option +iteration: Option +``` + +### GitLabMergeRequest: add fields + +``` +upvotes: i64 // #[serde(default)] +downvotes: i64 // #[serde(default)] +user_notes_count: i64 // #[serde(default)] +source_project_id: Option +target_project_id: Option +milestone: Option // reuse existing type +merge_when_pipeline_succeeds: bool // #[serde(default)] +merge_commit_sha: Option +squash_commit_sha: Option +squash: bool // #[serde(default)] +squash_on_merge: bool // #[serde(default)] +has_conflicts: bool // #[serde(default)] +blocking_discussions_resolved: bool // #[serde(default)] +should_remove_source_branch: Option +force_remove_source_branch: Option +discussion_locked: bool // #[serde(default)] +time_stats: Option +task_completion_status: Option +closed_by: Option +prepared_at: Option +merge_after: Option +imported: bool // #[serde(default)] +imported_from: Option +approvals_before_merge: Option +confidential: bool // #[serde(default)] +iteration: Option +``` + +--- + +## Transformer Changes + +### IssueRow: add fields + +All new fields map 1:1 from the serde struct except: +- `closed_at` -> `iso_to_ms()` conversion (already in serde struct, just not passed through) +- `time_stats` -> flatten to 4 individual fields +- `task_completion_status` -> flatten to 2 individual fields +- `references` -> flatten to 3 individual fields +- `closed_by` -> extract `username` only (consistent with author pattern) +- `author` -> additionally extract `id` and `name` (currently only `username`) +- `issue_type` -> store as-is (canonical, lowercase); fallback to lowercased `type` field if `issue_type` absent +- `epic` -> flatten to 5 individual fields (id, iid, title, url, group_id) +- `iteration` -> flatten to 6 individual fields (id, iid, title, state, start_date, due_date) + +### NormalizedMergeRequest: add fields + +Same patterns as issues, plus: +- `milestone` -> reuse `upsert_milestone_tx` from issue pipeline, add `milestone_id` + `milestone_title` +- `prepared_at`, `merge_after` -> `iso_to_ms()` conversion (API provides ISO datetimes) +- `source_project_id`, `target_project_id` -> direct pass-through +- `iteration` -> flatten to 6 individual fields (same as issues) + +### Insert statement changes + +Both `process_issue_in_transaction` and `process_mr_in_transaction` need their INSERT and ON CONFLICT DO UPDATE statements extended with all new columns. The ON CONFLICT clause should update all new fields on re-sync. + +**Implementation note (reliability):** Define a single authoritative list of persisted columns per entity and generate/compose both SQL fragments from it: +- INSERT column list + VALUES placeholders +- ON CONFLICT DO UPDATE assignments + +This prevents drift where a new field is added to one clause but not the other -- the most likely bug class with 40+ new columns. + +--- + +## Prerequisite refactors (prep commits before main Phase A work) + +### 1. Align issue transformer on `core::time` + +The issue transformer (`transformers/issue.rs`) has a local `parse_timestamp()` that duplicates `iso_to_ms_strict()` from `core::time`. The MR transformer already uses the shared module. Before adding Phase A's optional timestamp fields (especially `closed_at` as `Option`), migrate the issue transformer to use `iso_to_ms_strict()` and `iso_to_ms_opt_strict()` from `core::time`. This avoids duplicating the `opt` variant locally and establishes one timestamp parsing path across the codebase. + +**Changes:** Replace `parse_timestamp()` calls with `iso_to_ms_strict()`, adapt or remove `TransformError::TimestampParse` (MR transformer uses `String` errors; align on that or on a shared error type). + +### 2. Extract shared ingestion helpers + +`upsert_milestone_tx` (in `ingestion/issues.rs`) and `upsert_label_tx` (duplicated in both `ingestion/issues.rs` and `ingestion/merge_requests.rs`) should be moved to a shared module (e.g., `src/ingestion/shared.rs`). MR ingestion needs `upsert_milestone_tx` for Phase A milestone support, and the label helper is already copy-pasted between files. + +**Changes:** Create `src/ingestion/shared.rs`, move `upsert_milestone_tx`, `upsert_label_tx`, and `MilestoneRow` there. Update imports in both issue and MR ingestion modules. + +--- + +## Files touched + +| File | Change | +|------|--------| +| `migrations/007_complete_field_capture.sql` | New file | +| `src/gitlab/types.rs` | Add `#[derive(Default)]` to `GitLabIssue` and `GitLabMergeRequest`; add `relative: Option` to `GitLabReferences`; add fields to both structs; add `GitLabTimeStats`, `GitLabTaskCompletionStatus`, `GitLabEpic`, `GitLabIteration` | +| `src/gitlab/transformers/issue.rs` | Remove local `parse_timestamp()`, switch to `core::time`; extend IssueRow, IssueWithMetadata, transform_issue() | +| `src/gitlab/transformers/merge_request.rs` | Extend NormalizedMergeRequest, MergeRequestWithMetadata, transform_merge_request(); extract `references_relative` | +| `src/ingestion/shared.rs` | New file: shared `upsert_milestone_tx`, `upsert_label_tx`, `MilestoneRow` | +| `src/ingestion/issues.rs` | Extend INSERT/UPSERT SQL; import from shared module | +| `src/ingestion/merge_requests.rs` | Extend INSERT/UPSERT SQL; import from shared module; add milestone upsert | +| `src/core/db.rs` | Register migration 007 in `MIGRATIONS` array | + +--- + +## What this does NOT include + +- No new API endpoints called +- No new tables (except reusing existing `milestones` for MRs) +- No CLI changes (new fields are stored but not yet surfaced in `lore issues` / `lore mrs` output) +- No changes to discussion/note ingestion (Phase A is issues + MRs only) +- No observability instrumentation (that's Phase B) + +--- + +## Rollout / Backfill Note + +After applying Migration 007 and shipping transformer + UPSERT updates, **existing rows will not have the new columns populated** until issues/MRs are reprocessed. Plan on a **one-time full re-sync** (`lore ingest --type issues --full` and `lore ingest --type mrs --full`) to backfill the new fields. Until then, queries on new columns will return NULL/default values for previously-synced entities. + +--- + +## Resolved decisions + +| Field | Decision | Rationale | +|-------|----------|-----------| +| `subscribed` | **Excluded** | User-relative field (reflects token holder's subscription state, not an entity property). Changes meaning if the token is rotated to a different user. Not entity data. | +| `_links` | **Excluded** | HATEOAS API navigation metadata, not entity data. Every URL is deterministically constructable from `project_id` + `iid` + GitLab base URL. Note: `closed_as_duplicate_of` inside `_links` contains a real entity reference -- extracting that is deferred to a future phase. | +| `epic` / `iteration` | **Flatten to columns** | Same denormalization pattern as milestones. Epic gets 5 columns (`epic_id`, `epic_iid`, `epic_title`, `epic_url`, `epic_group_id`). Iteration gets 6 columns (`iteration_id`, `iteration_iid`, `iteration_title`, `iteration_state`, `iteration_start_date`, `iteration_due_date`). Both nullable (null on Free tier). | +| `approvals_before_merge` | **Store best-effort** | Deprecated and scheduled for removal in GitLab API v5. Keep as `Option` / nullable column. Never depend on it for correctness -- it may disappear in a future GitLab release. | diff --git a/docs/prd/checkpoint-3.md b/docs/prd/checkpoint-3.md index 9a23033..5414661 100644 --- a/docs/prd/checkpoint-3.md +++ b/docs/prd/checkpoint-3.md @@ -7,20 +7,29 @@ This checkpoint consolidates SPEC.md checkpoints 3A, 3B, 4, and 5 into a unified implementation plan. The work is structured for parallel agent execution where dependencies allow. All code integrates with existing `gitlore` infrastructure: -- Error handling via `GiError` and `ErrorCode` in `src/core/error.rs` +- Error handling via `LoreError` and `ErrorCode` in `src/core/error.rs` - CLI patterns matching `src/cli/commands/*.rs` (run functions, JSON/human output) - Database via `rusqlite::Connection` with migrations in `migrations/` +- Database connections via `create_connection(db_path)` in `src/core/db.rs` - Config via `src/core/config.rs` (EmbeddingConfig already defined) - Robot mode JSON with `{"ok": true, "data": {...}}` pattern +**Code Sample Convention:** Rust code samples in this PRD omit `use` imports for brevity. +Implementers should add the appropriate imports (`use crate::...`, `use rusqlite::...`, etc.) +based on the types and functions referenced in each sample. + +**Prerequisite Rename:** Before or during Gate A, rename `GiError` → `LoreError` across the +codebase (16 files). Update `src/core/error.rs` enum name, `src/core/mod.rs` re-export, +`src/lib.rs` re-export, and all `use` statements. This is a mechanical find-and-replace. + --- ## Executive Summary (Gated Milestones) This checkpoint ships in three gates to reduce integration risk. Each gate is independently verifiable and shippable: -**Gate A (Lexical MVP):** documents + FTS + filters + `lore search --mode=lexical` + `lore stats` -**Gate B (Hybrid MVP):** embeddings + vector + RRF fusion + graceful degradation +**Gate A (Lexical MVP):** documents + FTS + filters + `lore search --mode=lexical` + `lore stats` (no sqlite-vec dependency) +**Gate B (Hybrid MVP):** sqlite-vec extension + embeddings + vector + RRF fusion + graceful degradation **Gate C (Sync MVP):** `lore sync` orchestration + queues/backoff + integrity check/repair **Deliverables:** @@ -40,12 +49,20 @@ This checkpoint ships in three gates to reduce integration risk. Each gate is in **Key Design Decisions:** - Documents are the search unit (not raw entities) - FTS5 works standalone when Ollama unavailable (graceful degradation) -- sqlite-vec `rowid = documents.id` for simple joins +- Gate A requires only SQLite + FTS5; sqlite-vec is introduced in Gate B (reduces env risk) +- Documents store full untruncated text with a hard safety cap (2MB) for pathological outliers +- Embedding pipeline chunks long documents at embed time (overlap-aware paragraph splitting) +- sqlite-vec `rowid = doc_id * 1000 + chunk_index` encodes chunks; vector search deduplicates by doc_id - RRF ranking avoids score normalization complexity - Queue-based discussion fetching isolates failures - FTS5 query sanitization prevents syntax errors from user input - Exponential backoff on all queues prevents hot-loop retries - Transient embed failures trigger graceful degradation (not hard errors) +- `lore sync` is the unified command (ingest + generate-docs + embed); individual commands exist for recovery +- Dirty source tracking: mark ALL upserted entities inside ingestion transactions via `mark_dirty_tx(&Transaction)` +- Queue draining: `generate-docs` and `sync` drain queues completely using bounded batch loops (not single-pass ceilings) +- Discussion sweep uses CTE to capture stale IDs before cascading deletes +- Unchanged documents are fully skipped (content_hash + labels_hash + paths_hash triple-check) --- @@ -74,7 +91,10 @@ CREATE TABLE documents ( paths_hash TEXT NOT NULL DEFAULT '', -- SHA-256 over sorted paths (write optimization) is_truncated INTEGER NOT NULL DEFAULT 0, truncated_reason TEXT CHECK ( - truncated_reason IN ('token_limit_middle_drop','single_note_oversized','first_last_oversized') + truncated_reason IN ( + 'token_limit_middle_drop','single_note_oversized','first_last_oversized', + 'hard_cap_oversized' + ) OR truncated_reason IS NULL ), UNIQUE(source_type, source_id) @@ -191,55 +211,83 @@ END; --- -### 1.3 Embeddings Schema (Migration 009) +### 1.3 Embeddings Schema (Migration 009) — Gate B Only **File:** `migrations/009_embeddings.sql` +> **Gate Boundary:** This migration is intentionally part of Gate B. Gate A ships with +> migrations 007 + 008 only, so it has zero dependency on sqlite-vec being present or +> loadable. The migration runner must load sqlite-vec *before* applying this migration, +> but Gate A can run without it. + ```sql -- NOTE: sqlite-vec vec0 virtual tables cannot participate in FK cascades. -- We must use an explicit trigger to delete orphan embeddings when documents -- are deleted. See documents_embeddings_ad trigger below. -- sqlite-vec virtual table for vector search --- Storage rule: embeddings.rowid = documents.id +-- Storage rule: embeddings.rowid = document_id * 1000 + chunk_index +-- This encodes (document_id, chunk_index) into a single integer rowid. +-- Supports up to 1000 chunks per document (32M chars at 32k/chunk). CREATE VIRTUAL TABLE embeddings USING vec0( embedding float[768] ); --- Embedding provenance + change detection +-- Embedding provenance + change detection (one row per chunk) +-- NOTE: Two hash columns serve different purposes: +-- document_hash: SHA-256 of full documents.content_text (staleness detection) +-- chunk_hash: SHA-256 of this individual chunk's text (debug/provenance) +-- Pending detection uses document_hash (not chunk_hash) because staleness is +-- a document-level condition: if the document changed, ALL chunks need re-embedding. CREATE TABLE embedding_metadata ( - document_id INTEGER PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE, + document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE, + chunk_index INTEGER NOT NULL DEFAULT 0, -- 0-indexed position within document model TEXT NOT NULL, -- 'nomic-embed-text' dims INTEGER NOT NULL, -- 768 - content_hash TEXT NOT NULL, -- copied from documents.content_hash + document_hash TEXT NOT NULL, -- SHA-256 of full documents.content_text (staleness) + chunk_hash TEXT NOT NULL, -- SHA-256 of this chunk's text (provenance) created_at INTEGER NOT NULL, -- ms epoch UTC last_error TEXT, -- error message from last failed attempt attempt_count INTEGER NOT NULL DEFAULT 0, - last_attempt_at INTEGER -- ms epoch UTC + last_attempt_at INTEGER, -- ms epoch UTC + PRIMARY KEY(document_id, chunk_index) ); CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL; -CREATE INDEX idx_embedding_metadata_hash ON embedding_metadata(content_hash); +CREATE INDEX idx_embedding_metadata_doc ON embedding_metadata(document_id); --- CRITICAL: Delete orphan embeddings when documents are deleted. +-- CRITICAL: Delete ALL chunk embeddings when a document is deleted. -- vec0 virtual tables don't support FK ON DELETE CASCADE, so we need this trigger. --- embedding_metadata has ON DELETE CASCADE, so only vec0 needs explicit cleanup +-- embedding_metadata has ON DELETE CASCADE, so only vec0 needs explicit cleanup. +-- Range: [document_id * 1000, document_id * 1000 + 999] CREATE TRIGGER documents_embeddings_ad AFTER DELETE ON documents BEGIN - DELETE FROM embeddings WHERE rowid = old.id; + DELETE FROM embeddings + WHERE rowid >= old.id * 1000 + AND rowid < (old.id + 1) * 1000; END; ``` +**Chunking Constants:** +| Constant | Value | Rationale | +|----------|-------|-----------| +| `CHUNK_MAX_CHARS` | 32,000 | ~8k tokens at 4 chars/token, fits nomic-embed-text context window | +| `CHUNK_OVERLAP_CHARS` | 500 | ~125 tokens overlap preserves context at chunk boundaries | +| `CHUNK_ROWID_MULTIPLIER` | 1,000 | Encodes (doc_id, chunk_index) into single rowid; supports up to 1000 chunks/doc | + **Acceptance Criteria:** - [ ] `embeddings` vec0 table created -- [ ] `embedding_metadata` tracks provenance +- [ ] `embedding_metadata` tracks provenance per chunk (composite PK: document_id, chunk_index) - [ ] Error tracking fields present for retry logic -- [ ] Orphan cleanup trigger fires on document deletion +- [ ] Orphan cleanup trigger deletes ALL chunks (range deletion) on document deletion +- [ ] Documents under 32k chars produce exactly 1 chunk (chunk_index=0) +- [ ] Long documents split at paragraph boundaries with overlap **Dependencies:** - Requires sqlite-vec extension loaded at runtime - Extension loading already happens in `src/core/db.rs` - [ ] Migration runner must load sqlite-vec *before* applying migrations (including on fresh DB) +- [ ] Add `rand` crate to `Cargo.toml` (used by backoff jitter in `src/core/backoff.rs`) --- @@ -424,6 +472,104 @@ I think we should move to JWT-based auth... Agreed. What about refresh token strategy? ``` +**Extraction Queries:** + +Each `extract_*_document()` function queries the existing ingestion tables and assembles a `DocumentData`. +Functions return `Result>` — `None` means the source entity was deleted. + +**Issue extraction** (`extract_issue_document`): +```sql +-- Main entity +SELECT i.id, i.iid, i.title, i.description, i.state, i.author_username, + i.created_at, i.updated_at, i.web_url, + p.path_with_namespace, p.id AS project_id +FROM issues i +JOIN projects p ON p.id = i.project_id +WHERE i.id = ? + +-- Labels (via junction table) +SELECT l.name +FROM issue_labels il +JOIN labels l ON l.id = il.label_id +WHERE il.issue_id = ? +ORDER BY l.name +``` + +**MR extraction** (`extract_mr_document`): +```sql +-- Main entity +SELECT m.id, m.iid, m.title, m.description, m.state, m.author_username, + m.source_branch, m.target_branch, + m.created_at, m.updated_at, m.web_url, + p.path_with_namespace, p.id AS project_id +FROM merge_requests m +JOIN projects p ON p.id = m.project_id +WHERE m.id = ? + +-- Labels (via junction table) +SELECT l.name +FROM mr_labels ml +JOIN labels l ON l.id = ml.label_id +WHERE ml.merge_request_id = ? +ORDER BY l.name +``` + +**Discussion extraction** (`extract_discussion_document`): +```sql +-- Discussion with parent entity info +SELECT d.id, d.noteable_type, d.issue_id, d.merge_request_id, + p.path_with_namespace, p.id AS project_id +FROM discussions d +JOIN projects p ON p.id = d.project_id +WHERE d.id = ? + +-- Parent issue (when noteable_type = 'Issue') +SELECT i.iid, i.title, i.web_url +FROM issues i WHERE i.id = ? + +-- Parent MR (when noteable_type = 'MergeRequest') +SELECT m.iid, m.title, m.web_url +FROM merge_requests m WHERE m.id = ? + +-- Parent labels (issue or MR labels depending on noteable_type) +-- Use issue_labels or mr_labels junction table accordingly + +-- Non-system notes in thread order +SELECT n.author_username, n.body, n.created_at, n.gitlab_id, + n.note_type, n.position_old_path, n.position_new_path +FROM notes n +WHERE n.discussion_id = ? AND n.is_system = 0 +ORDER BY n.created_at ASC, n.id ASC +``` + +**Discussion URL construction:** +Constructed from parent entity's `web_url` + `#note_{first_note.gitlab_id}`: +```rust +let url = match first_note_gitlab_id { + Some(gid) => Some(format!("{}#note_{}", parent_web_url, gid)), + None => Some(parent_web_url.to_string()), +}; +``` + +**DiffNote path extraction:** +Collect both `position_old_path` and `position_new_path` from all non-system notes +in the discussion thread. Deduplicated automatically by `document_paths` PK: +```rust +let mut paths: Vec = Vec::new(); +for note in ¬es { + if let Some(ref old_path) = note.position_old_path { + paths.push(old_path.clone()); + } + if let Some(ref new_path) = note.position_new_path { + paths.push(new_path.clone()); + } +} +paths.sort(); +paths.dedup(); +``` + +**Discussion author:** First non-system note's `author_username`. + **Acceptance Criteria:** - [ ] Issue document: structured header with `[[Issue]]` prefix, project, URL, labels, state, author, then description - [ ] MR document: structured header with `[[MergeRequest]]` prefix, project, URL, labels, state, author, branches, then description @@ -438,11 +584,32 @@ Agreed. What about refresh token strategy? ### 2.3 Truncation Logic +**Scope:** Truncation applies to: +1. **Discussion documents** — note-boundary truncation at 32,000 chars (multi-note threads) +2. **All document types** — a hard safety cap at 2,000,000 chars (~2MB) to prevent + pathological outliers (pasted log files, vendor lockfiles, base64 blobs) from bloating + the DB, causing OOM in embedding chunking, or degrading FTS indexing performance. + +Normal issue/MR documents remain untruncated. The safety cap triggers only for extreme +content sizes that indicate non-prose content. The embedding pipeline handles long (but +non-pathological) documents via chunking at embed time (see Phase 4.4). + **File:** `src/documents/truncation.rs` ```rust -/// Maximum content length (~8,000 tokens at 4 chars/token estimate). -pub const MAX_CONTENT_CHARS: usize = 32_000; +/// Maximum content length for discussion threads (~8,000 tokens at 4 chars/token estimate). +/// NOTE: This only applies to discussion documents. Issue/MR documents store full text +/// (subject to MAX_DOCUMENT_CHARS_HARD safety cap below). +/// The embedding pipeline chunks long documents at embed time (see pipeline.rs). +pub const MAX_DISCUSSION_CHARS: usize = 32_000; + +/// Hard safety cap for ALL document types (~2MB). +/// Only triggers on pathological content (pasted logs, vendor lockfiles, base64 blobs). +/// Normal issue/MR documents are never this large. This prevents: +/// - DB bloat from single extreme documents +/// - OOM in embedding chunking pipeline +/// - FTS indexing degradation from megabyte-scale content +pub const MAX_DOCUMENT_CHARS_HARD: usize = 2_000_000; /// Truncation result with metadata. #[derive(Debug, Clone)] @@ -458,6 +625,7 @@ pub enum TruncationReason { TokenLimitMiddleDrop, SingleNoteOversized, FirstLastOversized, + HardCapOversized, // Safety cap for pathological content (any doc type) } impl TruncationReason { @@ -466,18 +634,21 @@ impl TruncationReason { Self::TokenLimitMiddleDrop => "token_limit_middle_drop", Self::SingleNoteOversized => "single_note_oversized", Self::FirstLastOversized => "first_last_oversized", + Self::HardCapOversized => "hard_cap_oversized", } } } -/// Truncate content at note boundaries. +/// Truncate discussion thread at note boundaries. /// /// Rules: /// - Max content: 32,000 characters /// - Truncate at NOTE boundaries (never mid-note) /// - Preserve first N notes and last M notes /// - Drop from middle, insert marker -pub fn truncate_content(notes: &[NoteContent], max_chars: usize) -> TruncationResult { +/// +/// NOT used for issue/MR documents — those store full text. +pub fn truncate_discussion(notes: &[NoteContent], max_chars: usize) -> TruncationResult { // Implementation handles edge cases per table below todo!() } @@ -493,16 +664,20 @@ pub struct NoteContent { **Edge Cases:** | Scenario | Handling | |----------|----------| -| Single note > 32000 chars | Truncate at char boundary, append `[truncated]`, reason = `single_note_oversized` | +| Single note > 32000 chars | Truncate at UTF-8-safe char boundary via `truncate_utf8()`, append `[truncated]`, reason = `single_note_oversized` | | First + last note > 32000 | Keep only first note (truncated if needed), reason = `first_last_oversized` | | Only one note | Truncate at char boundary if needed | +| Any doc type > 2,000,000 chars | Truncate at UTF-8-safe boundary, reason = `hard_cap_oversized` (applies to issues/MRs/discussions) | **Acceptance Criteria:** -- [ ] Notes never cut mid-content +- [ ] Discussion note-boundary truncation works correctly +- [ ] Issue/MR documents store full text unless exceeding 2MB hard safety cap +- [ ] Hard cap truncation records `hard_cap_oversized` reason for provenance - [ ] First and last notes preserved when possible - [ ] Truncation marker `\n\n[... N notes omitted for length ...]\n\n` inserted - [ ] Metadata fields set correctly - [ ] Edge cases handled per table above +- [ ] All byte-indexed truncation uses `truncate_utf8()` to avoid panics on multi-byte codepoints --- @@ -560,7 +735,7 @@ pub fn run_generate_docs( full: bool, project_filter: Option<&str>, ) -> Result { - let conn = open_db(config)?; + let conn = create_connection(&config.storage.db_path)?; if full { // Full mode: seed dirty_sources with all source entities, then drain. @@ -581,20 +756,33 @@ pub fn run_generate_docs( // - No divergence in write-optimization logic (labels_hash, paths_hash) // - FTS triggers fire identically in both modes - // Seed issues + // Seed issues using set-based INSERT...SELECT for throughput. + // Avoids N individual INSERT calls per chunk; single statement per page. let mut last_id: i64 = 0; loop { - let chunk = query_issue_ids_after(&conn, project_filter, FULL_MODE_CHUNK_SIZE, last_id)?; - if chunk.is_empty() { break; } let tx = conn.transaction()?; - for id in &chunk { - mark_dirty(&tx, SourceType::Issue, *id)?; - } + let inserted = tx.execute( + "INSERT INTO dirty_sources + (source_type, source_id, queued_at, attempt_count, last_attempt_at, last_error, next_attempt_at) + SELECT 'issue', id, ?, 0, NULL, NULL, NULL + FROM issues + WHERE id > ? + ORDER BY id + LIMIT ? + ON CONFLICT(source_type, source_id) DO NOTHING", + rusqlite::params![crate::core::time::now_ms(), last_id, FULL_MODE_CHUNK_SIZE as i64], + )?; + // Advance keyset cursor + if inserted == 0 { tx.commit()?; break; } + last_id = tx.query_row( + "SELECT MAX(source_id) FROM dirty_sources WHERE source_type = 'issue' AND source_id > ?", + [last_id], + |row| row.get(0), + )?; tx.commit()?; - last_id = *chunk.last().unwrap(); } - // Similar keyset-paginated seeding for MRs and discussions... + // Similar set-based keyset-paginated seeding for MRs and discussions... // Report: seeding complete, now regenerating } @@ -662,10 +850,15 @@ pub struct GenerateDocsArgs { - [ ] Default mode processes dirty_sources queue only (incremental) - [ ] `--full` regenerates all documents from scratch - [ ] `--full` uses chunked transactions (2k docs/tx) to bound WAL growth -- [ ] Final FTS rebuild after all chunks complete +- [ ] Final FTS optimize after all chunks complete (rebuild reserved for --repair) - [ ] Progress bar in human mode (via `indicatif`) - [ ] JSON output in robot mode +**Progress bar note:** The existing codebase uses `indicatif` for progress bars and +`tracing-indicatif` to integrate progress bars with `tracing` log output. New commands +should follow the same pattern — create a `ProgressBar` with `ProgressStyle::default_bar()` +and use `tracing` for structured logging. See existing ingestion commands for reference. + --- ## Phase 3: Lexical Search @@ -732,18 +925,42 @@ pub struct FtsResult { /// max_chars: Maximum snippet length (default 200) /// /// Returns a truncated string with ellipsis if truncated. +/// +/// IMPORTANT: Uses `truncate_utf8()` to avoid panicking on multi-byte UTF-8 +/// codepoint boundaries. Rust `&str` slicing is byte-indexed, so +/// `trimmed[..max_chars]` will panic if `max_chars` falls mid-codepoint. pub fn generate_fallback_snippet(content_text: &str, max_chars: usize) -> String { let trimmed = content_text.trim(); if trimmed.len() <= max_chars { return trimmed.to_string(); } - // Find word boundary near max_chars to avoid cutting mid-word - let truncation_point = trimmed[..max_chars] - .rfind(|c: char| c.is_whitespace()) - .unwrap_or(max_chars); + // Truncate at a valid UTF-8 char boundary first + let safe = truncate_utf8(trimmed, max_chars); - format!("{}...", &trimmed[..truncation_point]) + // Then find word boundary to avoid cutting mid-word + let truncation_point = safe + .rfind(|c: char| c.is_whitespace()) + .unwrap_or(safe.len()); + + format!("{}...", &safe[..truncation_point]) +} + +/// Truncate a string to at most `max_bytes` bytes on a valid UTF-8 char boundary. +/// +/// Rust &str slicing panics if the index falls inside a multi-byte codepoint. +/// This helper walks backward from max_bytes to find the nearest char boundary. +/// Used by snippet generation and discussion truncation to prevent panics on +/// content containing emoji, CJK characters, or other multi-byte UTF-8 text. +fn truncate_utf8(s: &str, max_bytes: usize) -> &str { + if s.len() <= max_bytes { + return s; + } + let mut cut = max_bytes; + while cut > 0 && !s.is_char_boundary(cut) { + cut -= 1; + } + &s[..cut] } /// Get snippet for search result, preferring FTS when available. @@ -856,6 +1073,8 @@ pub fn search_fts( let mut stmt = conn.prepare( "SELECT rowid, bm25(documents_fts), + -- NOTE: Column index 1 = content_text (0=title, 1=content_text). + -- If FTS column order changes, this index must be updated. snippet(documents_fts, 1, '', '', '...', 64) FROM documents_fts WHERE documents_fts MATCH ? @@ -1084,18 +1303,43 @@ pub fn apply_filters( |--------|-----------|-------| | `--type` | `source_type` | `issue`, `mr`, `discussion` | | `--author` | `author_username` | Exact match | -| `--project` | `project_id` | Resolve path to ID | -| `--after` | `created_at` | `>= date` (ms epoch) | -| `--updated-after` | `updated_at` | `>= date` (ms epoch), common triage filter | +| `--project` | `project_id` | Resolve path to ID (see resolution logic below) | +| `--after` | `created_at` | `>= date` — parsed via `time::parse_since()` (accepts `7d`, `2w`, `YYYY-MM-DD`) | +| `--updated-after` | `updated_at` | `>= date` — parsed via `time::parse_since()`, common triage filter | | `--label` | `document_labels` | JOIN, multiple = AND | | `--path` | `document_paths` | JOIN, trailing `/` = prefix | | `--limit` | N/A | Default 20, max 100 | +**Project Resolution Logic (`--project`):** + +The `--project` flag accepts a string and resolves it to a `project_id` using cascading match: + +1. **Exact match** on `projects.path_with_namespace` (e.g., `group/project`) +2. **Case-insensitive exact match** (e.g., `Group/Project` matches `group/project`) +3. **Suffix match** (e.g., `project-name` matches `group/project-name`) — only if unambiguous (1 result) +4. **No match or ambiguous** — return error listing available projects: + +``` +Error: Project 'auth-service' not found. + +Available projects: + backend/auth-service + frontend/auth-service-ui + infra/auth-proxy + +Hint: Use the full path, e.g., --project=backend/auth-service +``` + +Uses the existing `LoreError::Ambiguous` variant when multiple suffix matches are found. + **Acceptance Criteria:** - [ ] Each filter correctly restricts results - [ ] Multiple `--label` flags use AND logic - [ ] Path prefix vs exact match works correctly - [ ] `--updated-after` filters on updated_at (not created_at) +- [ ] `--after` and `--updated-after` accept relative (`7d`, `2w`) and absolute (`YYYY-MM-DD`) formats via `time::parse_since()` +- [ ] `--project` resolves via exact → case-insensitive → suffix match +- [ ] `--project` with no match or ambiguous match shows available projects list - [ ] Filters compose (all applied together) - [ ] Ranking order preserved after filtering (ORDER BY position) - [ ] Limit clamped to valid range [1, 100] @@ -1176,7 +1420,8 @@ pub fn run_search( // - prefer FTS snippet when doc hit FTS // - fallback: truncated content_text via generate_fallback_snippet() // 6. For --mode=semantic with 0% embedding coverage: - // return early with actionable error message (distinct from "Ollama down") + // return Err(LoreError::EmbeddingsNotBuilt) + // This is distinct from "Ollama unavailable" — tells user to run `lore embed` first todo!() } @@ -1188,7 +1433,7 @@ pub fn run_search( /// ```sql /// SELECT d.id, d.source_type, d.title, d.url, d.author_username, /// d.created_at, d.updated_at, d.content_text, -/// p.path AS project_path, +/// p.path_with_namespace AS project_path, /// (SELECT json_group_array(dl.label_name) /// FROM document_labels dl WHERE dl.document_id = d.id) AS labels, /// (SELECT json_group_array(dp.path) @@ -1370,16 +1615,53 @@ src/embedding/ //! Uses Ollama for embedding generation and sqlite-vec for storage. mod change_detector; +mod chunk_ids; mod ollama; mod pipeline; pub use change_detector::detect_embedding_changes; +pub use chunk_ids::{encode_rowid, decode_rowid, CHUNK_ROWID_MULTIPLIER}; pub use ollama::{OllamaClient, OllamaConfig, check_ollama_health}; pub use pipeline::{embed_documents, EmbedResult}; ``` --- +### 4.1b Chunk ID Encoding (Shared Module) + +**File:** `src/embedding/chunk_ids.rs` + +```rust +//! Shared chunk ID encoding for embedding rowids. +//! +//! Encoding: rowid = document_id * CHUNK_ROWID_MULTIPLIER + chunk_index +//! This encodes (document_id, chunk_index) into a single vec0 rowid. +//! Supports up to 1000 chunks per document. +//! +//! Used by: +//! - `embedding::pipeline` (when storing embeddings) +//! - `search::vector` (when decoding search results) +//! - `cli::commands::stats` (when detecting orphaned embeddings) +//! +//! Living in its own module prevents coupling between pipeline and search. + +/// Multiplier for encoding (document_id, chunk_index) into a single vec0 rowid. +/// Supports up to 1000 chunks per document. +pub const CHUNK_ROWID_MULTIPLIER: i64 = 1000; + +/// Encode (document_id, chunk_index) into a vec0 rowid. +pub fn encode_rowid(document_id: i64, chunk_index: usize) -> i64 { + document_id * CHUNK_ROWID_MULTIPLIER + chunk_index as i64 +} + +/// Decode a vec0 rowid into (document_id, chunk_index). +pub fn decode_rowid(rowid: i64) -> (i64, usize) { + (rowid / CHUNK_ROWID_MULTIPLIER, (rowid % CHUNK_ROWID_MULTIPLIER) as usize) +} +``` + +--- + ### 4.2 Ollama Client **File:** `src/embedding/ollama.rs` @@ -1388,7 +1670,7 @@ pub use pipeline::{embed_documents, EmbedResult}; use reqwest::Client; use serde::{Deserialize, Serialize}; -use crate::core::error::{GiError, Result}; +use crate::core::error::{LoreError, Result}; /// Ollama client configuration. #[derive(Debug, Clone)] @@ -1454,7 +1736,7 @@ impl OllamaClient { let url = format!("{}/api/tags", self.config.base_url); let response = self.client.get(&url).send().await.map_err(|e| { - GiError::OllamaUnavailable { + LoreError::OllamaUnavailable { base_url: self.config.base_url.clone(), source: Some(e), } @@ -1465,7 +1747,7 @@ impl OllamaClient { let model_available = tags.models.iter().any(|m| m.name.starts_with(&self.config.model)); if !model_available { - return Err(GiError::OllamaModelNotFound { + return Err(LoreError::OllamaModelNotFound { model: self.config.model.clone(), }); } @@ -1489,7 +1771,7 @@ impl OllamaClient { .json(&request) .send() .await - .map_err(|e| GiError::OllamaUnavailable { + .map_err(|e| LoreError::OllamaUnavailable { base_url: self.config.base_url.clone(), source: Some(e), })?; @@ -1497,7 +1779,7 @@ impl OllamaClient { if !response.status().is_success() { let status = response.status(); let body = response.text().await.unwrap_or_default(); - return Err(GiError::EmbeddingFailed { + return Err(LoreError::EmbeddingFailed { document_id: 0, // Batch failure reason: format!("HTTP {}: {}", status, body), }); @@ -1536,11 +1818,10 @@ pub async fn check_ollama_health(base_url: &str) -> bool { **File:** `src/core/error.rs` (extend existing) -Add to `ErrorCode`: +Add to `ErrorCode` (existing exit codes 1-13 are taken): ```rust pub enum ErrorCode { - // ... existing variants ... - InvalidEnumValue, + // ... existing variants (InternalError=1 through TransformError=13) ... OllamaUnavailable, OllamaModelNotFound, EmbeddingFailed, @@ -1550,7 +1831,6 @@ impl ErrorCode { pub fn exit_code(&self) -> i32 { match self { // ... existing mappings ... - Self::InvalidEnumValue => 13, Self::OllamaUnavailable => 14, Self::OllamaModelNotFound => 15, Self::EmbeddingFailed => 16, @@ -1562,7 +1842,6 @@ impl std::fmt::Display for ErrorCode { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { let code = match self { // ... existing mappings ... - Self::InvalidEnumValue => "INVALID_ENUM_VALUE", Self::OllamaUnavailable => "OLLAMA_UNAVAILABLE", Self::OllamaModelNotFound => "OLLAMA_MODEL_NOT_FOUND", Self::EmbeddingFailed => "EMBEDDING_FAILED", @@ -1572,11 +1851,14 @@ impl std::fmt::Display for ErrorCode { } ``` -Add to `GiError`: +Add to `LoreError`: ```rust -pub enum GiError { - // ... existing variants ... +pub enum LoreError { + // ... existing variants (including Http(reqwest::Error) for generic HTTP errors) ... + /// Ollama-specific connection failure. Use this instead of Http for Ollama errors + /// because it includes the base_url for actionable error messages. + /// The embedding pipeline should map Ollama HTTP errors to this variant, not Http. #[error("Cannot connect to Ollama at {base_url}. Is it running?")] OllamaUnavailable { base_url: String, @@ -1589,15 +1871,19 @@ pub enum GiError { #[error("Embedding failed for document {document_id}: {reason}")] EmbeddingFailed { document_id: i64, reason: String }, + + #[error("No embeddings found. Run: lore embed")] + EmbeddingsNotBuilt, } -impl GiError { +impl LoreError { pub fn code(&self) -> ErrorCode { match self { // ... existing mappings ... Self::OllamaUnavailable { .. } => ErrorCode::OllamaUnavailable, Self::OllamaModelNotFound { .. } => ErrorCode::OllamaModelNotFound, Self::EmbeddingFailed { .. } => ErrorCode::EmbeddingFailed, + Self::EmbeddingsNotBuilt => ErrorCode::EmbeddingFailed, } } @@ -1607,6 +1893,7 @@ impl GiError { Self::OllamaUnavailable { .. } => Some("Start Ollama: ollama serve"), Self::OllamaModelNotFound { model } => Some("Pull the model: ollama pull nomic-embed-text"), Self::EmbeddingFailed { .. } => Some("Check Ollama logs or retry with 'lore embed --retry-failed'"), + Self::EmbeddingsNotBuilt => Some("Generate embeddings first: lore embed"), } } } @@ -1625,7 +1912,7 @@ use rusqlite::Connection; use crate::core::error::Result; use crate::embedding::OllamaClient; -/// Batch size for embedding requests. +/// Batch size for embedding requests (texts per Ollama API call). const BATCH_SIZE: usize = 32; /// SQLite page size for paging through pending documents. @@ -1636,6 +1923,38 @@ const DB_PAGE_SIZE: usize = 500; /// IMPORTANT: Validates against this to prevent silent corruption. const EXPECTED_DIMS: usize = 768; +/// Maximum characters per chunk for embedding. +/// nomic-embed-text context window is ~8192 tokens. +/// At ~4 chars/token, 32k chars is a safe upper bound. +const CHUNK_MAX_CHARS: usize = 32_000; + +/// Overlap between chunks to preserve context at boundaries. +/// 500 chars (~125 tokens) provides enough context for sentence boundaries. +const CHUNK_OVERLAP_CHARS: usize = 500; + +// NOTE: encode_rowid, decode_rowid, and CHUNK_ROWID_MULTIPLIER are imported +// from `crate::embedding::chunk_ids` (shared module, see 4.1b). +// This prevents coupling between pipeline.rs and search/vector.rs. +use crate::embedding::chunk_ids::{encode_rowid, CHUNK_ROWID_MULTIPLIER}; + +/// Split document text into overlapping chunks at paragraph boundaries. +/// +/// Rules: +/// - Each chunk <= CHUNK_MAX_CHARS +/// - Split at paragraph boundaries (\n\n) when possible +/// - Fall back to sentence boundaries (. ! ?) then word boundaries +/// - Overlap of CHUNK_OVERLAP_CHARS between consecutive chunks +/// - Single documents under CHUNK_MAX_CHARS produce exactly 1 chunk (index 0) +/// +/// Returns: Vec<(chunk_index, chunk_text)> +fn split_into_chunks(content: &str) -> Vec<(usize, String)> { + if content.len() <= CHUNK_MAX_CHARS { + return vec![(0, content.to_string())]; + } + // Split at paragraph boundaries with overlap + todo!() +} + /// Which documents to embed. #[derive(Debug, Clone, Copy)] pub enum EmbedSelection { @@ -1661,11 +1980,17 @@ pub struct EmbedResult { /// - RetryFailed: embedding_metadata.last_error IS NOT NULL /// 2. Page through candidates using keyset pagination (id > last_id) /// to avoid rescanning already-processed rows -/// 3. Batch texts -> Ollama `/api/embed` with concurrent HTTP requests -/// 4. Write embeddings + embedding_metadata in per-batch transactions -/// 5. Failed batches record `last_error` in embedding_metadata +/// 3. For each document: split content into chunks via `split_into_chunks()` +/// - Documents <= 32k chars produce 1 chunk (common case, no overhead) +/// - Longer documents split at paragraph boundaries with 500-char overlap +/// 4. Clear existing chunks for changed documents (`clear_document_embeddings()`) +/// 5. Batch chunk texts -> Ollama `/api/embed` with concurrent HTTP requests +/// (concurrency = max in-flight HTTP requests to Ollama) +/// 6. Write chunk embeddings + embedding_metadata in per-batch transactions +/// - rowid = document_id * 1000 + chunk_index +/// 7. Failed batches record `last_error` in embedding_metadata /// (excluded from Pending selection; retried via RetryFailed) -/// 6. Progress reported as (embedded + failed) vs total_pending +/// 8. Progress reported as (embedded + failed) vs total_pending pub async fn embed_documents( conn: &Connection, client: &OllamaClient, @@ -1694,12 +2019,29 @@ pub async fn embed_documents( // Launch concurrent HTTP requests, collect results let mut futures = FuturesUnordered::new(); - for batch in pending.chunks(BATCH_SIZE) { - let texts: Vec = batch.iter().map(|d| d.content.clone()).collect(); - let batch_meta: Vec<(i64, String)> = batch - .iter() - .map(|d| (d.id, d.content_hash.clone())) - .collect(); + // Build ChunkWork items: split each document into chunks, clear old embeddings + let mut work: Vec = Vec::new(); + { + let tx = conn.transaction()?; + for doc in &pending { + clear_document_embeddings(&tx, doc.id)?; + for (chunk_index, chunk_text) in split_into_chunks(&doc.content) { + work.push(ChunkWork { + doc_id: doc.id, + chunk_index, + doc_hash: doc.content_hash.clone(), + chunk_hash: crate::documents::compute_content_hash(&chunk_text), + text: chunk_text, + }); + } + } + tx.commit()?; + } + + // Batch ChunkWork texts into Ollama API calls + for batch in work.chunks(BATCH_SIZE) { + let texts: Vec = batch.iter().map(|w| w.text.clone()).collect(); + let batch_meta: Vec = batch.to_vec(); futures.push(async move { let embed_result = client.embed_batch(texts).await; @@ -1732,27 +2074,42 @@ pub async fn embed_documents( Ok(result) } +/// Chunk work item prepared for embedding. +struct ChunkWork { + doc_id: i64, + chunk_index: usize, + doc_hash: String, // SHA-256 of full document content (staleness detection) + chunk_hash: String, // SHA-256 of this chunk's text (provenance) + text: String, +} + /// Collect embedding results and write to DB (sequential, on main thread). /// /// IMPORTANT: Validates embedding dimensions to prevent silent corruption. /// If model returns wrong dimensions (e.g., different model configured), /// the document is marked as failed rather than storing corrupt data. +/// +/// NOTE: batch_meta contains ChunkWork items (not raw documents) because the +/// caller must split documents into chunks before batching for Ollama /api/embed. +/// Each ChunkWork carries both doc_hash (for staleness) and chunk_hash (for provenance). fn collect_writes( conn: &Connection, - batch_meta: &[(i64, String)], + batch_meta: &[ChunkWork], embed_result: Result>>, result: &mut EmbedResult, ) -> Result<()> { let tx = conn.transaction()?; match embed_result { Ok(embeddings) => { - for ((doc_id, hash), embedding) in batch_meta.iter().zip(embeddings.iter()) { + for (chunk, embedding) in batch_meta.iter().zip(embeddings.iter()) { // Validate dimensions to prevent silent corruption if embedding.len() != EXPECTED_DIMS { record_embedding_error( &tx, - *doc_id, - hash, + chunk.doc_id, + chunk.chunk_index, + &chunk.doc_hash, + &chunk.chunk_hash, &format!( "embedding dimension mismatch: got {}, expected {}", embedding.len(), @@ -1762,13 +2119,13 @@ fn collect_writes( result.failed += 1; continue; } - store_embedding(&tx, *doc_id, embedding, hash)?; + store_embedding(&tx, chunk.doc_id, chunk.chunk_index, embedding, &chunk.doc_hash, &chunk.chunk_hash)?; result.embedded += 1; } } Err(e) => { - for (doc_id, hash) in batch_meta { - record_embedding_error(&tx, *doc_id, hash, &e.to_string())?; + for chunk in batch_meta { + record_embedding_error(&tx, chunk.doc_id, chunk.chunk_index, &chunk.doc_hash, &chunk.chunk_hash, &e.to_string())?; result.failed += 1; } } @@ -1784,16 +2141,25 @@ struct PendingDocument { } /// Count total pending documents (for progress reporting). +/// +/// NOTE: With chunked embeddings, we detect pending at the document level. +/// A document needs embedding if: +/// - No embedding_metadata rows exist for it (new document) +/// - Its document_hash changed (chunk_index=0 row has mismatched hash) +/// This avoids counting individual chunks as separate pending items. +/// +/// IMPORTANT: Uses `document_hash` (not `chunk_hash`) because staleness is +/// a document-level condition. See migration 009 comment for details. fn count_pending_documents(conn: &Connection, selection: EmbedSelection) -> Result { let sql = match selection { EmbedSelection::Pending => "SELECT COUNT(*) FROM documents d - LEFT JOIN embedding_metadata em ON d.id = em.document_id + LEFT JOIN embedding_metadata em ON d.id = em.document_id AND em.chunk_index = 0 WHERE em.document_id IS NULL - OR em.content_hash != d.content_hash", + OR em.document_hash != d.content_hash", EmbedSelection::RetryFailed => - "SELECT COUNT(*) + "SELECT COUNT(DISTINCT d.id) FROM documents d JOIN embedding_metadata em ON d.id = em.document_id WHERE em.last_error IS NOT NULL", @@ -1814,18 +2180,22 @@ fn find_pending_documents( last_id: i64, selection: EmbedSelection, ) -> Result> { + // NOTE: Finds documents needing embedding at the document level. + // The caller is responsible for chunking content_text via split_into_chunks() + // and calling clear_document_embeddings() before storing new chunks. + // Uses document_hash (not chunk_hash) for staleness detection. let sql = match selection { EmbedSelection::Pending => "SELECT d.id, d.content_text, d.content_hash FROM documents d - LEFT JOIN embedding_metadata em ON d.id = em.document_id + LEFT JOIN embedding_metadata em ON d.id = em.document_id AND em.chunk_index = 0 WHERE (em.document_id IS NULL - OR em.content_hash != d.content_hash) + OR em.document_hash != d.content_hash) AND d.id > ? ORDER BY d.id LIMIT ?", EmbedSelection::RetryFailed => - "SELECT d.id, d.content_text, d.content_hash + "SELECT DISTINCT d.id, d.content_text, d.content_hash FROM documents d JOIN embedding_metadata em ON d.id = em.document_id WHERE em.last_error IS NOT NULL @@ -1848,12 +2218,35 @@ fn find_pending_documents( Ok(docs) } +/// Clear all existing chunk embeddings for a document before re-embedding. +/// Called when content_hash changes (chunk count may differ). +fn clear_document_embeddings(tx: &rusqlite::Transaction, document_id: i64) -> Result<()> { + // Delete vec0 rows (range covers all possible chunk indices) + tx.execute( + "DELETE FROM embeddings WHERE rowid >= ? AND rowid < ?", + rusqlite::params![ + document_id * CHUNK_ROWID_MULTIPLIER, + (document_id + 1) * CHUNK_ROWID_MULTIPLIER, + ], + )?; + // Delete metadata rows + tx.execute( + "DELETE FROM embedding_metadata WHERE document_id = ?", + rusqlite::params![document_id], + )?; + Ok(()) +} + fn store_embedding( tx: &rusqlite::Transaction, document_id: i64, + chunk_index: usize, embedding: &[f32], - content_hash: &str, + doc_hash: &str, + chunk_hash: &str, ) -> Result<()> { + let rowid = encode_rowid(document_id, chunk_index); + // Convert embedding to bytes for sqlite-vec // sqlite-vec expects raw little-endian bytes, not the array directly let embedding_bytes: Vec = embedding @@ -1861,19 +2254,21 @@ fn store_embedding( .flat_map(|f| f.to_le_bytes()) .collect(); - // Store in sqlite-vec (rowid = document_id) + // Store in sqlite-vec (rowid encodes doc_id + chunk_index) tx.execute( "INSERT OR REPLACE INTO embeddings(rowid, embedding) VALUES (?, ?)", - rusqlite::params![document_id, embedding_bytes], + rusqlite::params![rowid, embedding_bytes], )?; - // Update metadata + // Update metadata (per-chunk) with both hashes: + // - document_hash for staleness detection (compared to documents.content_hash) + // - chunk_hash for provenance/debug let now = crate::core::time::now_ms(); tx.execute( "INSERT OR REPLACE INTO embedding_metadata - (document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at) - VALUES (?, 'nomic-embed-text', 768, ?, ?, NULL, 0, ?)", - rusqlite::params![document_id, content_hash, now, now], + (document_id, chunk_index, model, dims, document_hash, chunk_hash, created_at, last_error, attempt_count, last_attempt_at) + VALUES (?, ?, 'nomic-embed-text', 768, ?, ?, ?, NULL, 0, ?)", + rusqlite::params![document_id, chunk_index, doc_hash, chunk_hash, now, now], )?; Ok(()) @@ -1882,19 +2277,21 @@ fn store_embedding( fn record_embedding_error( tx: &rusqlite::Transaction, document_id: i64, - content_hash: &str, + chunk_index: usize, + doc_hash: &str, + chunk_hash: &str, error: &str, ) -> Result<()> { let now = crate::core::time::now_ms(); tx.execute( "INSERT INTO embedding_metadata - (document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at) - VALUES (?, 'nomic-embed-text', 768, ?, ?, ?, 1, ?) - ON CONFLICT(document_id) DO UPDATE SET + (document_id, chunk_index, model, dims, document_hash, chunk_hash, created_at, last_error, attempt_count, last_attempt_at) + VALUES (?, ?, 'nomic-embed-text', 768, ?, ?, ?, ?, 1, ?) + ON CONFLICT(document_id, chunk_index) DO UPDATE SET last_error = excluded.last_error, attempt_count = attempt_count + 1, last_attempt_at = excluded.last_attempt_at", - rusqlite::params![document_id, content_hash, now, error, now], + rusqlite::params![document_id, chunk_index, doc_hash, chunk_hash, now, error, now], )?; Ok(()) @@ -1902,13 +2299,15 @@ fn record_embedding_error( ``` **Acceptance Criteria:** -- [ ] New documents get embedded -- [ ] Changed documents (hash mismatch) get re-embedded +- [ ] New documents get embedded (all chunks) +- [ ] Changed documents (hash mismatch) get re-embedded: existing chunks cleared first via `clear_document_embeddings()` +- [ ] Documents under 32k chars produce exactly 1 chunk (chunk_index=0) +- [ ] Long documents split at paragraph boundaries with 500-char overlap - [ ] Unchanged documents skipped -- [ ] Failures recorded in `embedding_metadata.last_error` +- [ ] Failures recorded in `embedding_metadata.last_error` per chunk - [ ] Failures record actual content_hash (not empty string) - [ ] Writes batched in transactions for performance -- [ ] Concurrency parameter respected +- [ ] `concurrency` parameter controls max concurrent HTTP requests to Ollama API - [ ] Progress reported during embedding - [ ] Deterministic `ORDER BY d.id` ensures consistent paging - [ ] `EmbedSelection` parameter controls pending vs retry-failed mode @@ -1934,7 +2333,7 @@ pub async fn run_embed( config: &Config, retry_failed: bool, ) -> Result { - use crate::core::db::open_database; + use crate::core::db::create_connection; use crate::embedding::pipeline::EmbedSelection; let ollama_config = OllamaConfig { @@ -1949,7 +2348,7 @@ pub async fn run_embed( client.health_check().await?; // Open database connection - let conn = open_database(config)?; + let conn = create_connection(&config.storage.db_path)?; // Determine selection mode let selection = if retry_failed { @@ -2049,10 +2448,14 @@ pub struct DocumentStats { #[derive(Debug, Serialize)] pub struct EmbeddingStats { + /// Documents with at least one embedding (chunk_index=0 exists in embedding_metadata) pub embedded: usize, pub pending: usize, pub failed: usize, + /// embedded / total_documents * 100 (document-level, not chunk-level) pub coverage_pct: f64, + /// Total chunks across all embedded documents + pub total_chunks: usize, } #[derive(Debug, Serialize)] @@ -2098,13 +2501,13 @@ pub fn run_stats(config: &Config) -> Result { /// /// Verifies: /// - documents count == documents_fts count -/// - embeddings.rowid all exist in documents.id -/// - embedding_metadata.content_hash == documents.content_hash +/// - All embeddings.rowid decode to valid document IDs (rowid / 1000 exists in documents.id) +/// - embedding_metadata.document_hash == documents.content_hash for chunk_index=0 rows pub fn run_integrity_check(config: &Config) -> Result { // 1. Count documents // 2. Count FTS entries - // 3. Find orphaned embeddings (no matching document) - // 4. Find hash mismatches between embedding_metadata and documents + // 3. Find orphaned embeddings (rowid / 1000 not in documents.id) + // 4. Find hash mismatches between embedding_metadata (chunk_index=0) and documents // 5. Return check results todo!() } @@ -2130,30 +2533,50 @@ pub struct RepairResult { /// and FTS content diverge in any way, partial fixes can leave the index /// in an inconsistent state. A full rebuild is slower but guaranteed correct. pub fn run_repair(config: &Config) -> Result { - let conn = open_db(config)?; + let conn = create_connection(&config.storage.db_path)?; - // Delete orphaned embeddings (no matching document) + // Delete orphaned embedding_metadata (no matching document) let orphaned_deleted = conn.execute( "DELETE FROM embedding_metadata WHERE document_id NOT IN (SELECT id FROM documents)", [], )?; - // Also delete from embeddings virtual table (sqlite-vec) + // Also delete orphaned chunks from embeddings virtual table (sqlite-vec). + // Chunked rowids: doc_id * 1000 + chunk_index. Orphan = rowid / 1000 not in documents. conn.execute( "DELETE FROM embeddings - WHERE rowid NOT IN (SELECT id FROM documents)", + WHERE rowid / 1000 NOT IN (SELECT id FROM documents)", [], )?; - // Clear stale embedding_metadata (hash mismatch) - will be re-embedded - let stale_cleared = conn.execute( - "DELETE FROM embedding_metadata - WHERE (document_id, content_hash) NOT IN ( - SELECT id, content_hash FROM documents - )", - [], - )?; + // Clear stale embeddings (document-level hash mismatch) - will be re-embedded. + // Uses document_hash (not chunk_hash) because staleness is document-level: + // if the document changed, ALL chunks for that document are stale. + // Must also delete corresponding vec0 rows (range deletion for all chunks). + let stale_doc_ids: Vec = { + let mut stmt = conn.prepare( + "SELECT DISTINCT d.id + FROM documents d + JOIN embedding_metadata em + ON em.document_id = d.id AND em.chunk_index = 0 + WHERE em.document_hash != d.content_hash" + )?; + stmt.query_map([], |r| r.get(0))? + .collect::, _>>()? + }; + let mut stale_cleared: usize = 0; + for doc_id in &stale_doc_ids { + stale_cleared += conn.execute( + "DELETE FROM embedding_metadata WHERE document_id = ?", + [doc_id], + )? as usize; + // Also clean up vec0 rows (range covers all chunk indices for this doc) + conn.execute( + "DELETE FROM embeddings WHERE rowid >= ? AND rowid < ?", + rusqlite::params![doc_id * 1000, (doc_id + 1) * 1000], + )?; + } // Rebuild FTS index from scratch — correct-by-construction. // This re-reads all rows from the external content table (documents) @@ -2293,49 +2716,80 @@ pub struct VectorResult { /// Search documents using vector similarity. /// -/// Uses sqlite-vec for efficient vector search. -/// Returns document IDs sorted by distance (lower = better match). +/// Returns document-level results by taking the best (lowest distance) +/// chunk match per document. This deduplicates across chunks so the +/// caller sees one result per document. /// /// IMPORTANT: sqlite-vec KNN queries require: /// - k parameter for number of results /// - embedding passed as raw little-endian bytes +/// +/// With chunked embeddings, rowid = document_id * 1000 + chunk_index. +/// We over-fetch chunks (3x limit) to ensure enough unique documents after dedup. pub fn search_vector( conn: &Connection, query_embedding: &[f32], limit: usize, ) -> Result> { + use crate::embedding::{decode_rowid, CHUNK_ROWID_MULTIPLIER}; + // Convert embedding to bytes for sqlite-vec let embedding_bytes: Vec = query_embedding .iter() .flat_map(|f| f.to_le_bytes()) .collect(); + // Over-fetch to ensure enough unique documents after dedup. + // With avg ~1-2 chunks per doc, 3x is conservative. + // Explicit i64 type for sqlite-vec k parameter binding. + let chunk_limit: i64 = (limit * 3) as i64; + let mut stmt = conn.prepare( "SELECT rowid, distance FROM embeddings WHERE embedding MATCH ? AND k = ? - ORDER BY distance - LIMIT ?" + ORDER BY distance" )?; - let results = stmt - .query_map(rusqlite::params![embedding_bytes, limit, limit], |row| { - Ok(VectorResult { - document_id: row.get(0)?, - distance: row.get(1)?, - }) + let chunk_results: Vec<(i64, f64)> = stmt + .query_map(rusqlite::params![embedding_bytes, chunk_limit], |row| { + Ok((row.get::<_, i64>(0)?, row.get::<_, f64>(1)?)) })? .collect::, _>>()?; + // Deduplicate: keep best (lowest distance) chunk per document + let mut best_by_doc: std::collections::HashMap = + std::collections::HashMap::new(); + for (rowid, distance) in &chunk_results { + let (doc_id, _chunk_idx) = decode_rowid(*rowid); + best_by_doc + .entry(doc_id) + .and_modify(|d| { if *distance < *d { *d = *distance; } }) + .or_insert(*distance); + } + + // Sort by distance and take top `limit` + let mut results: Vec = best_by_doc + .into_iter() + .map(|(doc_id, dist)| VectorResult { + document_id: doc_id, + distance: dist, + }) + .collect(); + results.sort_by(|a, b| a.distance.partial_cmp(&b.distance).unwrap()); + results.truncate(limit); + Ok(results) } ``` **Acceptance Criteria:** -- [ ] Returns document IDs with distances +- [ ] Returns document IDs with distances (deduplicated across chunks) +- [ ] Best chunk distance used per document - [ ] Lower distance = better match - [ ] Works with 768-dim vectors - [ ] Uses k parameter for KNN query +- [ ] Over-fetches chunks (3x limit) to handle dedup without missing documents - [ ] Embedding passed as bytes --- @@ -2535,7 +2989,7 @@ pub async fn search_hybrid( } SearchMode::Semantic => { // Vector only - requires client - let client = client.ok_or_else(|| crate::core::error::GiError::OllamaUnavailable { + let client = client.ok_or_else(|| crate::core::error::LoreError::OllamaUnavailable { base_url: ollama_base_url.unwrap_or("http://localhost:11434").into(), source: None, })?; @@ -2652,26 +3106,57 @@ use crate::core::error::Result; use crate::core::time::now_ms; use crate::documents::SourceType; -/// Maximum dirty sources to process per sync run. -const MAX_DIRTY_SOURCES_PER_RUN: usize = 500; +/// Batch size when draining dirty_sources queue. +/// We drain repeatedly in a loop until no ready items remain, +/// so this is a batch size (not a hard ceiling per invocation). +/// This bounds WAL growth and memory per iteration while ensuring +/// `generate-docs` and `sync` converge in a single invocation. +const DIRTY_SOURCES_BATCH_SIZE: usize = 500; -/// Mark a source as dirty (needs document regeneration). +/// Mark a source as dirty (needs document regeneration) — transactional variant. /// -/// Called during entity upsert operations. -/// Uses INSERT OR IGNORE to avoid duplicates. -pub fn mark_dirty( - conn: &Connection, +/// Called during entity upsert operations INSIDE ingestion transactions. +/// The `_tx` suffix enforces the invariant that dirty marking happens +/// atomically with the entity upsert — callers cannot accidentally use +/// a bare Connection and break atomicity. +/// +/// IMPORTANT: Uses upsert with backoff reset, NOT `INSERT OR IGNORE`. +/// `INSERT OR IGNORE` would silently drop re-queues for entities that are +/// already in the queue with a long `next_attempt_at` due to previous failures. +/// This means fresh updates would wait behind stale backoff windows — exactly +/// the opposite of what we want for incremental sync. +/// +/// The ON CONFLICT clause resets all backoff/error state so the entity is +/// immediately eligible for processing on the next regeneration pass. +pub fn mark_dirty_tx( + tx: &rusqlite::Transaction<'_>, source_type: SourceType, source_id: i64, ) -> Result<()> { - conn.execute( - "INSERT OR IGNORE INTO dirty_sources (source_type, source_id, queued_at) - VALUES (?, ?, ?)", + tx.execute( + "INSERT INTO dirty_sources + (source_type, source_id, queued_at, attempt_count, last_attempt_at, last_error, next_attempt_at) + VALUES (?, ?, ?, 0, NULL, NULL, NULL) + ON CONFLICT(source_type, source_id) DO UPDATE SET + queued_at = excluded.queued_at, + attempt_count = 0, + last_attempt_at = NULL, + last_error = NULL, + next_attempt_at = NULL", rusqlite::params![source_type.as_str(), source_id, now_ms()], )?; Ok(()) } +/// Convenience wrapper for non-transactional contexts (e.g., CLI `--full` seeding). +/// Wraps a single mark_dirty call in its own transaction. +pub fn mark_dirty(conn: &Connection, source_type: SourceType, source_id: i64) -> Result<()> { + let tx = conn.transaction()?; + mark_dirty_tx(&tx, source_type, source_id)?; + tx.commit()?; + Ok(()) +} + /// Get dirty sources ready for processing. /// /// Uses `next_attempt_at` for efficient, index-friendly backoff queries. @@ -2697,7 +3182,7 @@ pub fn get_dirty_sources(conn: &Connection) -> Result> { )?; let results = stmt - .query_map(rusqlite::params![now, MAX_DIRTY_SOURCES_PER_RUN], |row| { + .query_map(rusqlite::params![now, DIRTY_SOURCES_BATCH_SIZE], |row| { let type_str: String = row.get(0)?; let source_type = match type_str.as_str() { "issue" => SourceType::Issue, @@ -2733,14 +3218,62 @@ pub fn clear_dirty( } ``` +**Integration Points — Where `mark_dirty_tx()` is Called:** + +All calls use `mark_dirty_tx(&tx, ...)` **inside the existing ingestion transaction** for +atomic consistency. The `_tx` suffix on the function signature enforces this at the API level — +callers cannot accidentally use a bare `Connection` and break atomicity. + +Mark ALL upserted entities (not just changed ones). The regenerator's hash comparison +handles "unchanged" detection cheaply — this avoids needing change detection in ingestion. + +| Location | Source Type | Call | When | +|----------|------------|------|------| +| `src/ingestion/issues.rs` — issue upsert loop | `SourceType::Issue` | `mark_dirty_tx(&tx, ...)` | After each issue INSERT/UPDATE within the batch transaction | +| `src/ingestion/merge_requests.rs` — MR upsert loop | `SourceType::MergeRequest` | `mark_dirty_tx(&tx, ...)` | After each MR INSERT/UPDATE within the batch transaction | +| `src/ingestion/discussions.rs` — issue discussion insert | `SourceType::Discussion` | `mark_dirty_tx(&tx, ...)` | After each discussion INSERT within the full-refresh transaction | +| `src/ingestion/mr_discussions.rs` — MR discussion upsert | `SourceType::Discussion` | `mark_dirty_tx(&tx, ...)` | After each discussion upsert in the write phase | + +**Discussion Sweep Cleanup:** + +When the MR discussion sweep deletes stale discussions (`last_seen_at < run_start_time`), +also delete the corresponding document rows directly. The `ON DELETE CASCADE` on +`document_labels`/`document_paths` and the `documents_embeddings_ad` trigger handle +all downstream cleanup. + +```sql +-- In src/ingestion/mr_discussions.rs, during sweep phase. +-- Uses a CTE to capture stale IDs atomically before cascading deletes. +-- This is more defensive than two separate statements because the CTE +-- guarantees the ID set is captured before any row is deleted. +WITH stale AS ( + SELECT id FROM discussions + WHERE merge_request_id = ? AND last_seen_at < ? +) +-- Step 1: delete orphaned documents (must happen while source_id still resolves) +DELETE FROM documents + WHERE source_type = 'discussion' AND source_id IN (SELECT id FROM stale); +-- Step 2: delete the stale discussions themselves +DELETE FROM discussions + WHERE id IN (SELECT id FROM stale); +-- NOTE: If your SQLite version doesn't support CTE-based multi-statement, +-- execute as two sequential statements capturing IDs in Rust first: +-- let stale_ids: Vec = query("SELECT id FROM discussions WHERE ...").collect(); +-- DELETE FROM documents WHERE source_type='discussion' AND source_id IN (...) +-- DELETE FROM discussions WHERE id IN (...) +``` + **Acceptance Criteria:** -- [ ] Upserted entities added to dirty_sources -- [ ] Duplicates ignored +- [ ] Upserted entities added to dirty_sources inside ingestion transactions via `mark_dirty_tx(&tx, ...)` +- [ ] `mark_dirty_tx` enforces transactional usage at API level (takes `&Transaction`, not `&Connection`) +- [ ] ALL upserted entities marked dirty (not just changed ones) +- [ ] Re-queue resets backoff state (upsert with ON CONFLICT, not INSERT OR IGNORE) - [ ] Queue cleared after document regeneration -- [ ] Processing bounded per run (max 500) +- [ ] Queue fully drained via loop (bounded batch size, not hard ceiling per invocation) - [ ] Exponential backoff uses `next_attempt_at` (index-friendly, no overflow) - [ ] Backoff computed with jitter to prevent thundering herd - [ ] Failed items prioritized lower than fresh items (ORDER BY attempt_count ASC) +- [ ] Discussion sweep uses CTE (or Rust-side ID capture) to atomically identify stale IDs before cascading deletes --- @@ -2778,6 +3311,10 @@ pub struct PendingFetch { } /// Queue a discussion fetch for an entity. +/// +/// Uses ON CONFLICT DO UPDATE (not INSERT OR REPLACE) for consistency with +/// dirty_sources and to avoid DELETE+INSERT semantics that can trigger +/// unexpected FK behavior and reset fields unintentionally. pub fn queue_discussion_fetch( conn: &Connection, project_id: i64, @@ -2785,9 +3322,15 @@ pub fn queue_discussion_fetch( noteable_iid: i64, ) -> Result<()> { conn.execute( - "INSERT OR REPLACE INTO pending_discussion_fetches - (project_id, noteable_type, noteable_iid, queued_at, attempt_count, last_attempt_at, last_error) - VALUES (?, ?, ?, ?, 0, NULL, NULL)", + "INSERT INTO pending_discussion_fetches + (project_id, noteable_type, noteable_iid, queued_at, attempt_count, last_attempt_at, last_error, next_attempt_at) + VALUES (?, ?, ?, ?, 0, NULL, NULL, NULL) + ON CONFLICT(project_id, noteable_type, noteable_iid) DO UPDATE SET + queued_at = excluded.queued_at, + attempt_count = 0, + last_attempt_at = NULL, + last_error = NULL, + next_attempt_at = NULL", rusqlite::params![project_id, noteable_type.as_str(), noteable_iid, now_ms()], )?; Ok(()) @@ -2886,6 +3429,8 @@ pub fn record_fetch_error( **Acceptance Criteria:** - [ ] Updated entities queued for discussion fetch +- [ ] Queue uses ON CONFLICT DO UPDATE (not INSERT OR REPLACE) consistent with dirty_sources +- [ ] Re-queue resets backoff state immediately (same semantics as dirty_sources) - [ ] Success removes from queue - [ ] Failure increments attempt_count and sets next_attempt_at - [ ] Processing bounded per run (max 100) @@ -2969,36 +3514,57 @@ pub struct RegenerateResult { /// Regenerate documents from dirty queue. /// /// Process: -/// 1. Query dirty_sources ordered by queued_at +/// 1. Query dirty_sources in bounded batches until queue is drained /// 2. For each: regenerate document, compute new hash /// 3. ALWAYS upsert document (labels/paths may change even if content_hash unchanged) /// 4. Track whether content_hash changed (for stats) /// 5. Delete from dirty_sources (or record error on failure) +/// +/// IMPORTANT: Each dirty item is wrapped in its own transaction. This ensures +/// that a crash or interrupt cannot leave partial writes (e.g., document row +/// updated but document_labels only partially repopulated). The queue clear +/// or error record happens inside the same transaction, guaranteeing atomicity. +/// +/// NOTE: Drains the queue completely in a loop (not single-pass). This ensures +/// `lore generate-docs` and `lore sync` converge in one invocation even when +/// the queue has more items than the batch size. pub fn regenerate_dirty_documents(conn: &Connection) -> Result { - let dirty = get_dirty_sources(conn)?; let mut result = RegenerateResult::default(); + loop { + let dirty = get_dirty_sources(conn)?; + if dirty.is_empty() { break; } + for (source_type, source_id) in &dirty { - match regenerate_one(conn, *source_type, *source_id) { + let tx = conn.transaction()?; + match regenerate_one_tx(&tx, *source_type, *source_id) { Ok(changed) => { if changed { result.regenerated += 1; } else { result.unchanged += 1; } - clear_dirty(conn, *source_type, *source_id)?; + clear_dirty_tx(&tx, *source_type, *source_id)?; + tx.commit()?; } Err(e) => { - // Fail-soft: record error but continue processing remaining items - record_dirty_error(conn, *source_type, *source_id, &e.to_string())?; + // Fail-soft: record error inside same transaction, then commit + record_dirty_error_tx(&tx, *source_type, *source_id, &e.to_string())?; + tx.commit()?; result.errored += 1; } } } + } // end drain loop Ok(result) } +// NOTE: regenerate_one_tx, clear_dirty_tx, record_dirty_error_tx, +// upsert_document_tx, delete_document_tx all take &Transaction instead +// of &Connection. This ensures all writes for a single dirty item are +// atomic. The _tx suffix distinguishes from non-transactional helpers. + /// Regenerate a single document. Returns true if content_hash changed. /// /// If the source entity has been deleted, the corresponding document @@ -3109,15 +3675,27 @@ fn upsert_document(conn: &Connection, doc: &DocumentData) -> Result<()> { use rusqlite::OptionalExtension; // Check existing hashes before upserting (for write optimization) - let existing: Option<(i64, String, String)> = conn + let existing: Option<(i64, String, String, String)> = conn .query_row( - "SELECT id, labels_hash, paths_hash FROM documents + "SELECT id, content_hash, labels_hash, paths_hash FROM documents WHERE source_type = ? AND source_id = ?", rusqlite::params![doc.source_type.as_str(), doc.source_id], - |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)), + |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?, row.get(3)?)), ) .optional()?; + // Fast path: skip ALL writes when nothing changed. + // This prevents WAL churn on the "mark ALL upserted entities dirty" strategy, + // where most entities will be unchanged on incremental syncs. + if let Some((_, ref old_content_hash, ref old_labels_hash, ref old_paths_hash)) = existing { + if old_content_hash == &doc.content_hash + && old_labels_hash == &doc.labels_hash + && old_paths_hash == &doc.paths_hash + { + return Ok(()); + } + } + // Upsert main document (includes labels_hash, paths_hash) conn.execute( "INSERT INTO documents @@ -3159,13 +3737,13 @@ fn upsert_document(conn: &Connection, doc: &DocumentData) -> Result<()> { // Get document ID (either existing or newly inserted) let doc_id = match existing { - Some((id, _, _)) => id, + Some((id, _, _, _)) => id, None => get_document_id(conn, doc.source_type, doc.source_id)?, }; // Only update labels if hash changed (reduces write amplification) let labels_changed = match &existing { - Some((_, old_hash, _)) => old_hash != &doc.labels_hash, + Some((_, _, old_hash, _)) => old_hash != &doc.labels_hash, None => true, // New document, must insert }; if labels_changed { @@ -3183,7 +3761,7 @@ fn upsert_document(conn: &Connection, doc: &DocumentData) -> Result<()> { // Only update paths if hash changed (reduces write amplification) let paths_changed = match &existing { - Some((_, _, old_hash)) => old_hash != &doc.paths_hash, + Some((_, _, _, old_hash)) => old_hash != &doc.paths_hash, None => true, // New document, must insert }; if paths_changed { @@ -3218,9 +3796,12 @@ fn get_document_id( **Acceptance Criteria:** - [ ] Dirty sources get documents regenerated -- [ ] Hash comparison prevents unnecessary updates +- [ ] Queue fully drained via loop (not single-pass ceiling) +- [ ] Each dirty item processed in its own transaction (crash-safe labels/paths writes) +- [ ] Triple-hash fast path (content_hash + labels_hash + paths_hash) skips all writes when unchanged +- [ ] Hash comparison prevents unnecessary updates when only some hashes changed - [ ] FTS triggers fire on document update -- [ ] Queue cleared after processing +- [ ] Queue cleared after processing (inside same transaction as document write) --- @@ -3255,19 +3836,36 @@ pub struct SyncOptions { pub no_docs: bool, // Skip document regeneration } -/// Run sync orchestration. +/// Run sync orchestration — the unified pipeline command. /// -/// Steps: -/// 1. Acquire app lock with heartbeat -/// 2. Ingest delta (issues, MRs) based on cursors +/// `lore sync` is the primary command for keeping the search index up to date. +/// It wraps the existing ingestion orchestrator and adds document generation +/// and embedding as post-ingestion steps. Replaces `lore ingest` as the +/// recommended user-facing command for all data flow. +/// +/// Individual commands (`lore generate-docs`, `lore embed`) still exist for +/// manual recovery and debugging use cases. +/// +/// Steps (sequential): +/// 1. Acquire app lock with heartbeat (via existing `src/core/lock.rs`) +/// 2. Ingest delta: fetch issues + MRs via cursor-based sync +/// (calls existing ingestion orchestrator in `src/ingestion/orchestrator.rs`) +/// - Each upserted entity marked dirty via mark_dirty_tx(&tx) inside ingestion transaction /// 3. Process pending_discussion_fetches queue (bounded) -/// 4. Apply rolling backfill window (configurable, default 14 days) -/// 5. Regenerate documents from dirty_sources -/// 6. Embed documents with changed content_hash -/// 7. Release lock, record sync_run +/// - Discussion sweep uses CTE to capture stale IDs, then cascading deletes +/// 4. Regenerate documents from dirty_sources queue +/// (unless --no-docs) +/// 5. Embed documents with changed content_hash +/// (unless --no-embed; skipped gracefully if Ollama unavailable) +/// 6. Release lock, record sync_run +/// +/// NOTE: Rolling backfill window removed — the existing cursor + watermark +/// design handles old issues with resumed activity. GitLab updates +/// `updated_at` when new comments are added, so the cursor naturally +/// picks up old issues that receive new activity. pub async fn run_sync(config: &Config, options: SyncOptions) -> Result { - // Implementation uses existing ingestion orchestrator - // and new document/embedding pipelines + // Implementation wraps existing ingestion orchestrator + // and chains document generation + embedding pipelines todo!() } @@ -3371,6 +3969,11 @@ pub struct SyncArgs { Each query must have at least one expected URL in top 10 results. +**Test fixture note:** Golden query tests require a seeded SQLite database with known +documents, FTS entries, and pre-computed embeddings. The test setup should insert +deterministic fixture data (not call Ollama) so golden tests run in CI without external +dependencies. Use fixed embedding vectors that produce known similarity distances. + --- ## CLI Smoke Tests @@ -3407,11 +4010,12 @@ Each query must have at least one expected URL in top 10 results. - [ ] `documents` count = issues + MRs + discussions - [ ] `documents_fts` count = `documents` count -- [ ] `embeddings` count = `documents` count (after full embed) -- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows +- [ ] `embedding_metadata` has at least one row per document (after full embed) +- [ ] All `embeddings.rowid / 1000` map to valid `documents.id` (SQLite integer division is floor-correct for integer operands) +- [ ] `embedding_metadata.document_hash` = `documents.content_hash` for all chunk_index=0 rows - [ ] All `document_labels` reference valid documents - [ ] All `document_paths` reference valid documents -- [ ] No orphaned embeddings (embeddings.rowid without matching documents.id) +- [ ] No orphaned embeddings (embeddings.rowid / 1000 without matching documents.id) - [ ] Discussion documents exclude system notes - [ ] Discussion documents include parent title - [ ] All `dirty_sources` entries reference existing source entities @@ -3425,30 +4029,41 @@ Each query must have at least one expected URL in top 10 results. Checkpoint 3 is complete when all three gates pass: -### Gate A: Lexical MVP +**Prerequisite:** `GiError` → `LoreError` rename completed across codebase (16 files). + +### Gate A: Lexical MVP (no sqlite-vec dependency) 1. **Lexical search works without Ollama** - `lore search "query" --mode=lexical` returns relevant results - - All filters functional (including `--updated-after`) + - All filters functional (including `--updated-after` via `time::parse_since()`) - FTS5 syntax errors prevented by query sanitization - Special characters in queries work correctly (`-DWITH_SSL`, `C++`) - Search results hydrated in single DB round-trip (no N+1) + - Project filter uses cascading resolution (exact → case-insensitive → suffix → error with suggestions) 2. **Document generation is correct** - Full and incremental modes use the same regenerator codepath - - `--full` uses keyset pagination (no OFFSET degradation) + - `--full` uses set-based INSERT...SELECT with keyset pagination (no OFFSET degradation, no N INSERTs) - FTS triggers use COALESCE for NULL-safe operation + - Extraction queries pull all relevant metadata (labels, DiffNote old+new paths, discussion URLs) + - Discussion URLs reconstructed from parent `web_url` + `#note_{first_note.gitlab_id}` + - Hard safety cap (2MB) prevents pathological content from bloating DB + - Triple-hash fast path skips all writes when content, labels, and paths are unchanged ### Gate B: Hybrid MVP 3. **Semantic search works with Ollama** + - Gate B introduces sqlite-vec extension and migration 009 - `lore embed` completes successfully + - Chunked embeddings: long documents split at paragraph boundaries with overlap - `lore search "query"` returns semantically relevant results - `--explain` shows ranking breakdown - - `--mode=semantic` with 0% embedding coverage returns actionable error + - `--mode=semantic` with 0% embedding coverage returns `LoreError::EmbeddingsNotBuilt` + (distinct from `OllamaUnavailable` — tells user to run `lore embed` first) 4. **Hybrid search combines both** - Documents appearing in both retrievers rank higher + - Vector search over-fetches 3x and deduplicates by document_id (best chunk per doc) - Graceful degradation when Ollama unavailable (falls back to FTS) - Transient embed failures don't fail the entire search - Warning message included in response on degradation @@ -3457,25 +4072,36 @@ Checkpoint 3 is complete when all three gates pass: ### Gate C: Sync MVP 5. **Incremental sync is efficient** - - `lore sync` only processes changed entities - - Re-embedding only happens for changed documents - - Progress visible during long syncs + - `lore sync` is the unified command (replaces `lore ingest` + `lore generate-docs` + `lore embed`) + - Individual commands still exist for recovery/debugging + - Re-embedding only happens for changed documents (content_hash comparison) + - Progress visible during long syncs (via `indicatif`) - Queue backoff actually prevents hot-loop retries (both queues set `next_attempt_at`) - Shared backoff utility ensures consistent behavior across queues + - Dirty source marking uses `mark_dirty_tx(&tx, ...)` inside ingestion transactions (enforced at API level) + - Re-queuing resets backoff state (upsert, not INSERT OR IGNORE) so fresh updates aren't stuck behind stale backoff + - Both queues (dirty_sources, pending_discussion_fetches) use ON CONFLICT DO UPDATE consistently + - Queue draining is complete (bounded batch loop, not single-pass ceiling) + - Discussion sweep uses CTE to capture stale IDs before cascading deletes 6. **Data integrity maintained** - All counts match between tables - - No orphaned records - - Hashes consistent + - No orphaned records (embedding orphan check uses `rowid / 1000` for chunked scheme — SQLite integer division is floor-correct) + - Hashes consistent (embedding_metadata.document_hash matches documents.content_hash) - `get_existing_hash()` properly distinguishes "not found" from DB errors - `--repair` uses FTS `rebuild` for correct-by-construction repair + - `--repair` orphan deletion uses range-based cleanup for chunked embeddings + - `--repair` stale detection uses `document_hash` (not `chunk_hash`) for document-level staleness + - Per-item transactions prevent partial label/path writes on crash + - Triple-hash fast path prevents WAL churn from unchanged documents 7. **Observability** - `lore stats` shows queue depths and failed item counts + - `lore stats` embedding coverage is document-level (not chunk-level) with total chunk count - Failed items visible for operator intervention - Deterministic ordering ensures consistent paging 8. **Tests pass** - Unit tests for core algorithms (including FTS sanitization, shared backoff, hydration) - Integration tests for pipelines - - Golden queries return expected results + - Golden queries return expected results (seeded DB with fixed embedding vectors, no Ollama dependency)