diff --git a/SPEC-REVISIONS-2.md b/SPEC-REVISIONS-2.md new file mode 100644 index 0000000..b134f13 --- /dev/null +++ b/SPEC-REVISIONS-2.md @@ -0,0 +1,373 @@ +# SPEC.md Revision Document - Round 2 + +This document provides git-diff style changes for the second round of improvements from ChatGPT's review. These are primarily correctness fixes and optimizations. + +--- + +## Change 1: Fix Tuple Cursor Correctness Gap (Cursor Rewind + Local Filtering) + +**Why this is critical:** The spec specifies tuple cursor semantics `(updated_at, gitlab_id)` but GitLab's API only supports `updated_after` which is strictly "after" - it cannot express `WHERE updated_at = X AND id > Y` server-side. This creates a real risk of missed items on crash/resume and on dense timestamp buckets. + +**Fix:** Cursor rewind + local filtering. Call GitLab with `updated_after = cursor_updated_at - rewindSeconds`, then locally discard items we've already processed. + +```diff +@@ Correctness Rules (MVP): @@ + 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC` + 2. Cursor is a stable tuple `(updated_at, gitlab_id)`: +- - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)` ++ - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.** ++ - Use **cursor rewind + local filtering**: ++ - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable) ++ - Locally discard items where: ++ - `updated_at < cursor_updated_at`, OR ++ - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id` ++ - This makes the tuple cursor rule true in practice while keeping API calls simple. + - Cursor advances only after successful DB commit for that page + - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)` +``` + +```diff +@@ Configuration (MVP): @@ + "sync": { + "backfillDays": 14, + "staleLockMinutes": 10, +- "heartbeatIntervalSeconds": 30 ++ "heartbeatIntervalSeconds": 30, ++ "cursorRewindSeconds": 2 + }, +``` + +--- + +## Change 2: Make App Lock Actually Safe (BEGIN IMMEDIATE CAS) + +**Why this is critical:** INSERT OR REPLACE can overwrite an active lock if two processes start close together (both do "stale check" outside a write transaction, then both INSERT OR REPLACE). SQLite's BEGIN IMMEDIATE provides a proper compare-and-swap. + +```diff +@@ Reliability/Idempotency Rules: @@ + - Every ingest/sync creates a `sync_runs` row + - Single-flight via DB-enforced app lock: +- - On start: INSERT OR REPLACE lock row with new owner token ++ - On start: acquire lock via transactional compare-and-swap: ++ - `BEGIN IMMEDIATE` (acquires write lock immediately) ++ - If no row exists → INSERT new lock ++ - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps ++ - Else if `owner` matches current run → UPDATE heartbeat (re-entrant) ++ - Else → ROLLBACK and fail fast (another run is active) ++ - `COMMIT` + - During run: update `heartbeat_at` every 30 seconds + - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire + - `--force` remains as operator override for edge cases, but should rarely be needed +``` + +--- + +## Change 3: Dependent Resource Pagination + Bounded Concurrency + +**Why this is important:** Discussions endpoints are paginated on many GitLab instances. Without pagination, we silently lose data. Without bounded concurrency, initial sync can become unstable (429s, long tail retries). + +```diff +@@ Dependent Resources (Per-Parent Fetch): @@ +-GET /projects/:id/issues/:iid/discussions +-GET /projects/:id/merge_requests/:iid/discussions ++GET /projects/:id/issues/:iid/discussions?per_page=100&page=N ++GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N ++ ++**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent. +``` + +```diff +@@ Rate Limiting: @@ + - Default: 10 requests/second with exponential backoff + - Respect `Retry-After` headers on 429 responses + - Add jitter to avoid thundering herd on retry ++- **Separate concurrency limits:** ++ - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4) ++ - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s) ++ - Bound concurrency per-project to avoid one repo starving the other + - Initial sync estimate: 10-20 minutes depending on rate limits +``` + +```diff +@@ Configuration (MVP): @@ + "sync": { + "backfillDays": 14, + "staleLockMinutes": 10, + "heartbeatIntervalSeconds": 30, +- "cursorRewindSeconds": 2 ++ "cursorRewindSeconds": 2, ++ "primaryConcurrency": 4, ++ "dependentConcurrency": 2 + }, +``` + +--- + +## Change 4: Track last_seen_at for Eventual Consistency Debugging + +**Why this is valuable:** Even without implementing deletions, you want to know: (a) whether a record is actively refreshed under backfill/sync, (b) whether a sync run is "covering" the dataset, (c) whether a particular item hasn't been seen in months (helps diagnose missed updates). + +```diff +@@ Schema Preview - issues: @@ + CREATE TABLE issues ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + iid INTEGER NOT NULL, + title TEXT, + description TEXT, + state TEXT, + author_username TEXT, + created_at INTEGER, + updated_at INTEGER, ++ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync + web_url TEXT, + raw_payload_id INTEGER REFERENCES raw_payloads(id) + ); +``` + +```diff +@@ Schema Additions - merge_requests: @@ + CREATE TABLE merge_requests ( +@@ + updated_at INTEGER, ++ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync + merged_at INTEGER, +@@ + ); +``` + +```diff +@@ Schema Additions - discussions: @@ + CREATE TABLE discussions ( +@@ + last_note_at INTEGER, ++ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync + resolvable BOOLEAN, +@@ + ); +``` + +```diff +@@ Schema Additions - notes: @@ + CREATE TABLE notes ( +@@ + updated_at INTEGER, ++ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync + position INTEGER, +@@ + ); +``` + +--- + +## Change 5: Raw Payload Compression + +**Why this is valuable:** At 50-100K documents plus threaded discussions, raw JSON is likely the largest storage consumer. Supporting gzip compression reduces DB size while preserving replay capability. + +```diff +@@ Schema (Checkpoint 0) - raw_payloads: @@ + CREATE TABLE raw_payloads ( + id INTEGER PRIMARY KEY, + source TEXT NOT NULL, -- 'gitlab' + project_id INTEGER REFERENCES projects(id), + resource_type TEXT NOT NULL, + gitlab_id INTEGER NOT NULL, + fetched_at INTEGER NOT NULL, +- json TEXT NOT NULL ++ content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip' ++ payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON + ); +``` + +```diff +@@ Configuration (MVP): @@ ++ "storage": { ++ "compressRawPayloads": true -- gzip raw payloads to reduce DB size ++ }, +``` + +--- + +## Change 6: Scope Discussions Unique by Project + +**Why this is important:** `gitlab_discussion_id TEXT UNIQUE` assumes global uniqueness across all projects. While likely true for GitLab, it's safer to scope by project_id. This avoids rare but painful collisions and makes it easier to support more repos later. + +```diff +@@ Schema Additions - discussions: @@ + CREATE TABLE discussions ( + id INTEGER PRIMARY KEY, +- gitlab_discussion_id TEXT UNIQUE NOT NULL, ++ gitlab_discussion_id TEXT NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), +@@ + ); ++CREATE UNIQUE INDEX uq_discussions_project_discussion_id ++ ON discussions(project_id, gitlab_discussion_id); +``` + +--- + +## Change 7: Dirty Queue for Document Regeneration + +**Why this is valuable:** The orchestration says "Regenerate documents for changed entities" but doesn't define how "changed" is computed without scanning large tables. A dirty queue populated during ingestion makes doc regen deterministic and fast. + +```diff +@@ Schema Additions (Checkpoint 3): @@ ++-- Track sources that require document regeneration (populated during ingestion) ++CREATE TABLE dirty_sources ( ++ source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' ++ source_id INTEGER NOT NULL, -- local DB id ++ queued_at INTEGER NOT NULL, ++ PRIMARY KEY(source_type, source_id) ++); +``` + +```diff +@@ Orchestration steps (in order): @@ + 1. Acquire app lock with heartbeat + 2. Ingest delta (issues, MRs, discussions) based on cursors ++ - During ingestion, INSERT into dirty_sources for each upserted entity + 3. Apply rolling backfill window +-4. Regenerate documents for changed entities ++4. Regenerate documents for entities in dirty_sources (process + delete from queue) + 5. Embed documents with changed content_hash + 6. FTS triggers auto-sync (no explicit step needed) + 7. Release lock, record sync_run as succeeded +``` + +--- + +## Change 8: document_paths + --path Filter + +**Why this is high value:** We're already capturing DiffNote file paths in CP2. Adding a `--path` filter now makes the MVP dramatically more compelling for engineers who search by file path constantly. + +```diff +@@ Checkpoint 4 Scope: @@ +-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path` ++- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file` ++ - `--path` filters documents by referenced file paths (from DiffNote positions) ++ - MVP: substring/exact match; glob patterns deferred +``` + +```diff +@@ Schema Additions (Checkpoint 3): @@ ++-- Fast path filtering for documents (extracted from DiffNote positions) ++CREATE TABLE document_paths ( ++ document_id INTEGER NOT NULL REFERENCES documents(id), ++ path TEXT NOT NULL, ++ PRIMARY KEY(document_id, path) ++); ++CREATE INDEX idx_document_paths_path ON document_paths(path); +``` + +```diff +@@ CLI Interface: @@ + # Search within specific project + gi search "authentication" --project=group/project-one + ++# Search by file path (finds discussions/MRs touching this file) ++gi search "rate limit" --path=src/client.ts ++ + # Pure FTS search (fallback if embeddings unavailable) + gi search "redis" --mode=lexical +``` + +```diff +@@ Manual CLI Smoke Tests: @@ ++| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ | +``` + +--- + +## Change 9: Character-Based Truncation (Not Exact Tokens) + +**Why this is practical:** "8000 tokens" sounds precise, but tokenizers vary. Exact token counting adds dependency complexity. A conservative character budget is simpler and avoids false precision. + +```diff +@@ Document Extraction Rules: @@ + - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192) ++ - **Implementation:** Use character budget, not exact token count ++ - `maxChars = 32000` (conservative 4 chars/token estimate) ++ - `approxTokens = ceil(charCount / 4)` for reporting/logging only ++ - This avoids tokenizer dependency while preventing embedding failures +``` + +```diff +@@ Truncation: @@ + If content exceeds 8000 tokens: ++**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`. ++ + 1. Truncate from the middle (preserve first + last notes for context) + 2. Set `documents.is_truncated = 1` + 3. Set `documents.truncated_reason = 'token_limit_middle_drop'` + 4. Log a warning with document ID and original token count +``` + +--- + +## Change 10: Move Lexical Search to CP3 (Reorder, Not New Scope) + +**Why this is better:** Nothing "search-like" exists until CP4, but FTS5 is already a dependency for graceful degradation. Moving FTS setup to CP3 (when documents exist) gives an earlier usable artifact and better validation. CP4 becomes "hybrid ranking upgrade." + +```diff +@@ Checkpoint 3: Embedding Generation @@ +-### Checkpoint 3: Embedding Generation +-**Deliverable:** Vector embeddings generated for all text content ++### Checkpoint 3: Document + Embedding Generation with Lexical Search ++**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end +``` + +```diff +@@ Checkpoint 3 Scope: @@ + - Ollama integration (nomic-embed-text model) + - Embedding generation pipeline: +@@ + - Fast label filtering via `document_labels` join table ++- FTS5 index for lexical search (moved from CP4) ++- `gi search --mode=lexical` CLI command (works without Ollama) +``` + +```diff +@@ Checkpoint 3 Manual CLI Smoke Tests: @@ ++| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required | +``` + +```diff +@@ Checkpoint 4: Semantic Search @@ +-### Checkpoint 4: Semantic Search +-**Deliverable:** Working semantic search across all indexed content ++### Checkpoint 4: Hybrid Search (Semantic + Lexical) ++**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content +``` + +```diff +@@ Checkpoint 4 Scope: @@ + **Scope:** + - Hybrid retrieval: + - Vector recall (sqlite-vss) + FTS lexical recall (fts5) + - Merge + rerank results using Reciprocal Rank Fusion (RRF) ++- Query embedding generation (same Ollama pipeline as documents) + - Result ranking and scoring (document-level) +-- Search filters: ... ++- Filters work identically in hybrid and lexical modes +``` + +--- + +## Summary of All Changes (Round 2) + +| # | Change | Impact | +|---|--------|--------| +| 1 | **Cursor rewind + local filtering** | Fixes real correctness gap in tuple cursor implementation | +| 2 | **BEGIN IMMEDIATE CAS for lock** | Prevents race condition in lock acquisition | +| 3 | **Discussions pagination + concurrency** | Prevents silent data loss on large discussion threads | +| 4 | **last_seen_at columns** | Enables debugging of sync coverage without deletions | +| 5 | **Raw payload compression** | Reduces DB size significantly at scale | +| 6 | **Scope discussions unique by project** | Defensive uniqueness for multi-project safety | +| 7 | **Dirty queue for doc regen** | Makes document regeneration deterministic and fast | +| 8 | **document_paths + --path filter** | High-value file search with minimal scope | +| 9 | **Character-based truncation** | Practical implementation without tokenizer dependency | +| 10 | **Lexical search in CP3** | Earlier usable artifact; better checkpoint validation | + +**Net effect:** These changes fix several correctness gaps (cursor, lock, pagination) while adding high-value features (--path filter) and operational improvements (compression, dirty queue, last_seen_at). diff --git a/SPEC-REVISIONS-3.md b/SPEC-REVISIONS-3.md new file mode 100644 index 0000000..818ac87 --- /dev/null +++ b/SPEC-REVISIONS-3.md @@ -0,0 +1,427 @@ +# SPEC.md Revisions - First-Time User Experience + +**Date:** 2026-01-21 +**Purpose:** Document all changes adding installation, setup, and user flow documentation to SPEC.md + +--- + +## Summary of Changes + +| Change | Location | Description | +|--------|----------|-------------| +| 1. Quick Start | After Executive Summary | Prerequisites, installation, first-run walkthrough | +| 2. `gi init` Command | Checkpoint 0 | Interactive setup wizard with GitLab validation | +| 3. CLI Command Reference | Before Future Work | Unified table of all commands | +| 4. Error Handling | After CLI Reference | Common errors with recovery guidance | +| 5. Database Management | After Error Handling | Location, backup, reset, migrations | +| 6. Empty State Handling | Checkpoint 4 scope | Behavior when no data indexed | +| 7. Resolved Decisions | Resolved Decisions table | New decisions from this revision | + +--- + +## Change 1: Quick Start Section + +**Location:** Insert after line 6 (after Executive Summary), before Discovery Summary + +```diff + A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability. + + --- + ++## Quick Start ++ ++### Prerequisites ++ ++| Requirement | Version | Notes | ++|-------------|---------|-------| ++| Node.js | 20+ | LTS recommended | ++| npm | 10+ | Comes with Node.js | ++| Ollama | Latest | Optional for semantic search; lexical search works without it | ++ ++### Installation ++ ++```bash ++# Clone and install ++git clone https://github.com/your-org/gitlab-inbox.git ++cd gitlab-inbox ++npm install ++npm run build ++npm link # Makes `gi` available globally ++``` ++ ++### First Run ++ ++1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope): ++ ```bash ++ export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx" ++ ``` ++ ++2. **Run the setup wizard:** ++ ```bash ++ gi init ++ ``` ++ This creates `gi.config.json` with your GitLab URL and project paths. ++ ++3. **Verify your environment:** ++ ```bash ++ gi doctor ++ ``` ++ All checks should pass (Ollama warning is OK if you only need lexical search). ++ ++4. **Sync your data:** ++ ```bash ++ gi sync ++ ``` ++ Initial sync takes 10-20 minutes depending on repo size and rate limits. ++ ++5. **Search:** ++ ```bash ++ gi search "authentication redesign" ++ ``` ++ ++### Troubleshooting First Run ++ ++| Symptom | Solution | ++|---------|----------| ++| `Config file not found` | Run `gi init` first | ++| `GITLAB_TOKEN not set` | Export the environment variable | ++| `401 Unauthorized` | Check token has `read_api` scope | ++| `Project not found: group/project` | Verify project path in GitLab URL | ++| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search | ++ ++--- ++ + ## Discovery Summary +``` + +--- + +## Change 2: `gi init` Command in Checkpoint 0 + +**Location:** Insert in Checkpoint 0 Manual CLI Smoke Tests table and Scope section + +### 2a: Add to Manual CLI Smoke Tests table (after line 193) + +```diff + | `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure | ++| `gi init` | Interactive prompts | Creates valid gi.config.json | ++| `gi init` (config exists) | Confirmation prompt | Warns before overwriting | ++| `gi --help` | Command list | Shows all available commands | ++| `gi version` | Version number | Shows installed version | +``` + +### 2b: Add Automated Tests for init (after line 185) + +```diff + tests/integration/app-lock.test.ts + ✓ acquires lock successfully + ✓ updates heartbeat during operation + ✓ detects stale lock and recovers + ✓ refuses concurrent acquisition ++ ++tests/integration/init.test.ts ++ ✓ creates config file with valid structure ++ ✓ validates GitLab URL format ++ ✓ validates GitLab connection before writing config ++ ✓ validates each project path exists in GitLab ++ ✓ fails if token not set ++ ✓ fails if GitLab auth fails ++ ✓ fails if any project path not found ++ ✓ prompts before overwriting existing config ++ ✓ respects --force to skip confirmation +``` + +### 2c: Add to Checkpoint 0 Scope (after line 209) + +```diff + - Rate limit handling with exponential backoff + jitter ++- `gi init` command for guided setup: ++ - Prompts for GitLab base URL ++ - Prompts for project paths (comma-separated or multiple prompts) ++ - Prompts for token environment variable name (default: GITLAB_TOKEN) ++ - **Validates before writing config:** ++ - Token must be set in environment ++ - Tests auth with `GET /user` endpoint ++ - Validates each project path with `GET /projects/:path` ++ - Only writes config after all validations pass ++ - Generates `gi.config.json` with sensible defaults ++- `gi --help` shows all available commands ++- `gi --help` shows command-specific help ++- `gi version` shows installed version ++- First-run detection: if no config exists, suggest `gi init` +``` + +--- + +## Change 3: CLI Command Reference Section + +**Location:** Insert before "## Future Work (Post-MVP)" (before line 1174) + +```diff ++## CLI Command Reference ++ ++All commands support `--help` for detailed usage information. ++ ++### Setup & Diagnostics ++ ++| Command | CP | Description | ++|---------|-----|-------------| ++| `gi init` | 0 | Interactive setup wizard; creates gi.config.json | ++| `gi auth-test` | 0 | Verify GitLab authentication | ++| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) | ++| `gi doctor --json` | 0 | JSON output for scripting | ++| `gi version` | 0 | Show installed version | ++ ++### Data Ingestion ++ ++| Command | CP | Description | ++|---------|-----|-------------| ++| `gi ingest --type=issues` | 1 | Fetch issues from GitLab | ++| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions | ++| `gi embed --all` | 3 | Generate embeddings for all documents | ++| `gi embed --retry-failed` | 3 | Retry failed embeddings | ++| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) | ++| `gi sync --full` | 5 | Force complete re-sync (reset cursors) | ++| `gi sync --force` | 5 | Override stale lock after operator review | ++| `gi sync --no-embed` | 5 | Sync without embedding (faster) | ++ ++### Data Inspection ++ ++| Command | CP | Description | ++|---------|-----|-------------| ++| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues | ++| `gi list mrs [--limit=N]` | 2 | List merge requests | ++| `gi count issues` | 1 | Count issues | ++| `gi count mrs` | 2 | Count merge requests | ++| `gi count discussions` | 2 | Count discussions | ++| `gi count notes` | 2 | Count notes | ++| `gi show issue ` | 1 | Show issue details | ++| `gi show mr ` | 2 | Show MR details with discussions | ++| `gi stats` | 3 | Embedding coverage statistics | ++| `gi stats --json` | 3 | JSON stats for scripting | ++| `gi sync-status` | 1 | Show cursor positions and last sync | ++ ++### Search ++ ++| Command | CP | Description | ++|---------|-----|-------------| ++| `gi search "query"` | 4 | Hybrid semantic + lexical search | ++| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) | ++| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type | ++| `gi search "query" --author=USERNAME` | 4 | Filter by author | ++| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date | ++| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) | ++| `gi search "query" --project=PATH` | 4 | Filter by project | ++| `gi search "query" --path=FILE` | 4 | Filter by file path | ++| `gi search "query" --json` | 4 | JSON output for scripting | ++| `gi search "query" --explain` | 4 | Show ranking breakdown | ++ ++### Database Management ++ ++| Command | CP | Description | ++|---------|-----|-------------| ++| `gi backup` | 0 | Create timestamped database backup | ++| `gi reset --confirm` | 0 | Delete database and reset cursors | ++ ++--- ++ + ## Future Work (Post-MVP) +``` + +--- + +## Change 4: Error Handling Section + +**Location:** Insert after CLI Command Reference, before Future Work + +```diff ++## Error Handling ++ ++Common errors and their resolutions: ++ ++### Configuration Errors ++ ++| Error | Cause | Resolution | ++|-------|-------|------------| ++| `Config file not found` | No gi.config.json | Run `gi init` to create configuration | ++| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually | ++| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config | ++ ++### Authentication Errors ++ ++| Error | Cause | Resolution | ++|-------|-------|------------| ++| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` | ++| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope | ++| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope | ++ ++### GitLab API Errors ++ ++| Error | Cause | Resolution | ++|-------|-------|------------| ++| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) | ++| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry | ++| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity | ++ ++### Data Errors ++ ++| Error | Cause | Resolution | ++|-------|-------|------------| ++| `No documents indexed` | Sync not run | Run `gi sync` first | ++| `No results found` | Query too specific | Try broader search terms | ++| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale | ++ ++### Embedding Errors ++ ++| Error | Cause | Resolution | ++|-------|-------|------------| ++| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` | ++| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` | ++| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` | ++ ++### Operational Behavior ++ ++| Scenario | Behavior | ++|----------|----------| ++| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. | ++| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. | ++| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. | ++| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. | ++ ++--- ++ + ## Future Work (Post-MVP) +``` + +--- + +## Change 5: Database Management Section + +**Location:** Insert after Error Handling, before Future Work + +```diff ++## Database Management ++ ++### Database Location ++ ++The SQLite database is stored at an XDG-compliant location: ++ ++``` ++~/.local/share/gi/data.db ++``` ++ ++This can be overridden in `gi.config.json`: ++ ++```json ++{ ++ "storage": { ++ "dbPath": "/custom/path/to/data.db" ++ } ++} ++``` ++ ++### Backup ++ ++Create a timestamped backup of the database: ++ ++```bash ++gi backup ++# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db ++``` ++ ++Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode). ++ ++### Reset ++ ++To completely reset the database and all sync cursors: ++ ++```bash ++gi reset --confirm ++``` ++ ++This deletes: ++- The database file ++- All sync cursors ++- All embeddings ++ ++You'll need to run `gi sync` again to repopulate. ++ ++### Schema Migrations ++ ++Database schema is version-tracked and migrations auto-apply on startup: ++ ++1. On first run, schema is created at latest version ++2. On subsequent runs, pending migrations are applied automatically ++3. Migration version is stored in `schema_version` table ++4. Migrations are idempotent and reversible where possible ++ ++**Manual migration check:** ++```bash ++gi doctor --json | jq '.checks.database' ++# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 } ++``` ++ ++--- ++ + ## Future Work (Post-MVP) +``` + +--- + +## Change 6: Empty State Handling in Checkpoint 4 + +**Location:** Add to Checkpoint 4 scope section (around line 885, after "Graceful degradation") + +```diff + - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning ++- Empty state handling: ++ - No documents indexed: `No data indexed. Run 'gi sync' first.` ++ - Query returns no results: `No results found for "query".` ++ - Filters exclude all results: `No results match the specified filters.` ++ - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search") +``` + +**Location:** Add to Manual CLI Smoke Tests table (after `gi search "xyznonexistent123"` row) + +```diff + | `gi search "xyznonexistent123"` | No results message | Graceful empty state | ++| `gi search "auth"` (no data synced) | No data message | Shows "Run gi sync first" | +``` + +--- + +## Change 7: Update Resolved Decisions Table + +**Location:** Add new rows to Resolved Decisions table (around line 1280) + +```diff + | JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption | ++| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable | ++| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX | ++| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly | ++| Empty state UX | **Actionable messages** | Guide user to next step | +``` + +--- + +## Files Modified + +| File | Action | +|------|--------| +| `SPEC.md` | 7 changes applied | +| `SPEC-REVISIONS-3.md` | Created (this file) | + +--- + +## Verification Checklist + +After applying changes: + +- [ ] Quick Start section provides clear 5-step onboarding +- [ ] `gi init` fully specified with validation behavior +- [ ] All CLI commands documented in reference table +- [ ] Error scenarios have recovery guidance +- [ ] Database location and management documented +- [ ] Empty states have helpful messages +- [ ] Resolved Decisions updated with new choices +- [ ] No orphaned command references diff --git a/SPEC-REVISIONS.md b/SPEC-REVISIONS.md new file mode 100644 index 0000000..caa9573 --- /dev/null +++ b/SPEC-REVISIONS.md @@ -0,0 +1,716 @@ +# SPEC.md Revision Document + +This document provides git-diff style changes to integrate improvements from ChatGPT's review into the original SPEC.md. The goal is a "best of all worlds" hybrid that maintains the original architecture while adding production-grade hardening. + +--- + +## Change 1: Crash-safe Single-flight with Heartbeat Lock + +**Why this is better:** The original plan's single-flight protection is policy-based, not DB-enforced. A race condition exists where two processes could both start before either writes to `sync_runs`. The heartbeat approach provides DB-enforced atomicity, automatic crash recovery, and less manual intervention. + +```diff +@@ Schema (Checkpoint 0): @@ + CREATE TABLE sync_runs ( + id INTEGER PRIMARY KEY, + started_at INTEGER NOT NULL, ++ heartbeat_at INTEGER NOT NULL, + finished_at INTEGER, + status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' + command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc. + error TEXT + ); + ++-- Crash-safe single-flight lock (DB-enforced) ++CREATE TABLE app_locks ( ++ name TEXT PRIMARY KEY, -- 'sync' ++ owner TEXT NOT NULL, -- random run token (UUIDv4) ++ acquired_at INTEGER NOT NULL, ++ heartbeat_at INTEGER NOT NULL ++); +``` + +```diff +@@ Checkpoint 0: Project Setup - Scope @@ + **Scope:** + - Project structure (TypeScript, ESLint, Vitest) + - GitLab API client with PAT authentication + - Environment and project configuration + - Basic CLI scaffold with `auth-test` command + - `doctor` command for environment verification +-- Projects table and initial sync ++- Projects table and initial project resolution (no issue/MR ingestion yet) ++- DB migrations + WAL + FK + app lock primitives ++- Crash-safe single-flight lock with heartbeat +``` + +```diff +@@ Reliability/Idempotency Rules: @@ + - Every ingest/sync creates a `sync_runs` row +-- Single-flight: refuse to start if an existing run is `running` (unless `--force`) ++- Single-flight: acquire `app_locks('sync')` before starting ++ - On start: INSERT OR REPLACE lock row with new owner token ++ - During run: update `heartbeat_at` every 30 seconds ++ - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire ++ - `--force` remains as operator override for edge cases, but should rarely be needed + - Cursor advances only after successful transaction commit per page/batch + - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC` + - Use explicit transactions for batch inserts +``` + +```diff +@@ Configuration (MVP): @@ + // gi.config.json + { + "gitlab": { + "baseUrl": "https://gitlab.example.com", + "tokenEnvVar": "GITLAB_TOKEN" + }, + "projects": [ + { "path": "group/project-one" }, + { "path": "group/project-two" } + ], ++ "sync": { ++ "backfillDays": 14, ++ "staleLockMinutes": 10, ++ "heartbeatIntervalSeconds": 30 ++ }, + "embedding": { + "provider": "ollama", + "model": "nomic-embed-text", +- "baseUrl": "http://localhost:11434" ++ "baseUrl": "http://localhost:11434", ++ "concurrency": 4 + } + } +``` + +--- + +## Change 2: Harden Cursor Semantics + Rolling Backfill Window + +**Why this is better:** The original plan's "critical assumption" that comments update parent `updated_at` is mostly true but the failure mode is catastrophic (silently missing new discussion content). The rolling backfill provides a safety net without requiring weekly full resyncs. + +```diff +@@ GitLab API Strategy - Critical Assumption @@ +-### Critical Assumption +- +-**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed. +- +-Mitigation: Periodic full re-sync (weekly) as a safety net. ++### Critical Assumption (Softened) ++ ++We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively. ++ ++**Mitigations (MVP):** ++1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor. ++2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly. ++3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`). ++ ++The backfill window provides 80% of the safety of full resync at <5% of the API cost. +``` + +```diff +@@ Checkpoint 5: Incremental Sync - Scope @@ + **Scope:** +-- Delta sync based on stable cursor (updated_at + tie-breaker id) ++- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)` ++- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates + - Dependent resources sync strategy (discussions refetched when parent updates) + - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) + - Sync status reporting + - Recommended: run via cron every 10 minutes +``` + +```diff +@@ Correctness Rules (MVP): @@ +-1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` +-2. Cursor advances only after successful DB commit for that page ++1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC` ++2. Cursor is a stable tuple `(updated_at, gitlab_id)`: ++ - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)` ++ - Cursor advances only after successful DB commit for that page ++ - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)` + 3. Dependent resources: + - For each updated issue/MR, refetch ALL its discussions + - Discussion documents are regenerated and re-embedded if content_hash changes +-4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` +-5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) ++4. Rolling backfill window: ++ - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays` ++ - This catches any items whose timestamps were updated without triggering our cursor ++5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` ++6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) +``` + +--- + +## Change 3: Raw Payload Scoping + project_id + +**Why this is better:** The original `raw_payloads(resource_type, gitlab_id)` index could have collisions in edge cases (especially if later adding more projects or resource types). Adding `project_id` is defensive and enables project-scoped lookups. + +```diff +@@ Schema (Checkpoint 0) - raw_payloads @@ + CREATE TABLE raw_payloads ( + id INTEGER PRIMARY KEY, + source TEXT NOT NULL, -- 'gitlab' ++ project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources + resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion' + gitlab_id INTEGER NOT NULL, + fetched_at INTEGER NOT NULL, + json TEXT NOT NULL + ); +-CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id); ++CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id); ++CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at); +``` + +--- + +## Change 4: Tighten Uniqueness Constraints (project_id + iid) + +**Why this is better:** Users think in terms of "issue 123 in project X," not global IDs. This enables O(1) `gi show issue 123 --project=X` and prevents subtle ingestion bugs from creating duplicate rows. + +```diff +@@ Schema Preview - issues @@ + CREATE TABLE issues ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + iid INTEGER NOT NULL, + title TEXT, + description TEXT, + state TEXT, + author_username TEXT, + created_at INTEGER, + updated_at INTEGER, + web_url TEXT, + raw_payload_id INTEGER REFERENCES raw_payloads(id) + ); + CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); + CREATE INDEX idx_issues_author ON issues(author_username); ++CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid); +``` + +```diff +@@ Schema Additions - merge_requests @@ + CREATE TABLE merge_requests ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + iid INTEGER NOT NULL, + ... + ); + CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); + CREATE INDEX idx_mrs_author ON merge_requests(author_username); ++CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid); +``` + +--- + +## Change 5: Store System Notes (Flagged) + Capture DiffNote Paths + +**Why this is better:** Two problems with dropping system notes entirely: (1) Some system notes carry decision trace context ("marked as resolved", "changed milestone"). (2) File/path search is disproportionately valuable for engineers. DiffNote positions already contain path metadata - capturing it now enables immediate filename search. + +```diff +@@ Checkpoint 2 Scope @@ + - Discussions fetcher (issue discussions + MR discussions) as a dependent resource: + - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions` + - During initial ingest: fetch discussions for every issue/MR + - During sync: refetch discussions only for issues/MRs updated since cursor +- - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise ++ - Preserve system notes but flag them with `is_system=1`; exclude from embeddings by default ++ - Capture DiffNote file path/line metadata from `position` field for immediate filename search value +``` + +```diff +@@ Schema Additions - notes @@ + CREATE TABLE notes ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + discussion_id INTEGER NOT NULL REFERENCES discussions(id), + project_id INTEGER NOT NULL REFERENCES projects(id), + type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API) ++ is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.) + author_username TEXT, + body TEXT, + created_at INTEGER, + updated_at INTEGER, + position INTEGER, -- derived from array order in API response (0-indexed) + resolvable BOOLEAN, + resolved BOOLEAN, + resolved_by TEXT, + resolved_at INTEGER, ++ -- DiffNote position metadata (nullable, from GitLab API position object) ++ position_old_path TEXT, ++ position_new_path TEXT, ++ position_old_line INTEGER, ++ position_new_line INTEGER, + raw_payload_id INTEGER REFERENCES raw_payloads(id) + ); + CREATE INDEX idx_notes_discussion ON notes(discussion_id); + CREATE INDEX idx_notes_author ON notes(author_username); + CREATE INDEX idx_notes_type ON notes(type); ++CREATE INDEX idx_notes_system ON notes(is_system); ++CREATE INDEX idx_notes_new_path ON notes(position_new_path); +``` + +```diff +@@ Discussion Processing Rules @@ +-- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.) ++- System notes (`system: true`) are ingested with `notes.is_system=1` ++ - Excluded from document extraction/embeddings by default (reduces noise in semantic search) ++ - Preserved for audit trail, timeline views, and potential future decision-tracing features ++ - Can be toggled via `--include-system-notes` flag if needed ++- DiffNote position data is extracted and stored: ++ - `position.old_path`, `position.new_path` for file-level search ++ - `position.old_line`, `position.new_line` for line-level context + - Each discussion from the API becomes one row in `discussions` table + - All notes within a discussion are stored with their `discussion_id` foreign key + - `individual_note: true` discussions have exactly one note (standalone comment) + - `individual_note: false` discussions have multiple notes (threaded conversation) +``` + +```diff +@@ Checkpoint 2 Automated Tests @@ + tests/unit/discussion-transformer.test.ts + - transforms discussion payload to normalized schema + - extracts notes array from discussion + - sets individual_note flag correctly +- - filters out system notes (system: true) ++ - flags system notes with is_system=1 ++ - extracts DiffNote position metadata (paths and lines) + - preserves note order via position field + + tests/integration/discussion-ingestion.test.ts + - fetches discussions for each issue + - fetches discussions for each MR + - creates discussion rows with correct parent FK + - creates note rows linked to discussions +- - excludes system notes from storage ++ - stores system notes with is_system=1 flag ++ - extracts position_new_path from DiffNotes + - captures note-level resolution status + - captures note type (DiscussionNote, DiffNote) +``` + +```diff +@@ Checkpoint 2 Data Integrity Checks @@ + - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count + - [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments + - [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked) +-- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes) ++- [ ] System notes have `is_system=1` flag set correctly ++- [ ] DiffNotes have `position_new_path` populated when available + - [ ] Every discussion has at least one note + - [ ] `individual_note = true` discussions have exactly one note + - [ ] Discussion `first_note_at` <= `last_note_at` for all rows +``` + +--- + +## Change 6: Document Extraction Structured Header + Truncation Metadata + +**Why this is better:** Adding a deterministic header improves search snippets (more informative), embeddings (model gets stable context), and debuggability (see if/why truncation happened). + +```diff +@@ Schema Additions - documents @@ + CREATE TABLE documents ( + id INTEGER PRIMARY KEY, + source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' + source_id INTEGER NOT NULL, -- local DB id in the source table + project_id INTEGER NOT NULL REFERENCES projects(id), + author_username TEXT, -- for discussions: first note author + label_names TEXT, -- JSON array (display/debug only) + created_at INTEGER, + updated_at INTEGER, + url TEXT, + title TEXT, -- null for discussions + content_text TEXT NOT NULL, -- canonical text for embedding/snippets + content_hash TEXT NOT NULL, -- SHA-256 for change detection ++ is_truncated BOOLEAN NOT NULL DEFAULT 0, ++ truncated_reason TEXT, -- 'token_limit_middle_drop' | null + UNIQUE(source_type, source_id) + ); +``` + +```diff +@@ Discussion Document Format @@ +-[Issue #234: Authentication redesign] Discussion ++[[Discussion]] Issue #234: Authentication redesign ++Project: group/project-one ++URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345 ++Labels: ["bug", "auth"] ++Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread ++ ++--- Thread --- + + @johndoe (2024-03-15): + I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients... + + @janedoe (2024-03-15): + Agreed. What about refresh token strategy? + + @johndoe (2024-03-16): + Short-lived access tokens (15min), longer refresh (7 days). Here's why... +``` + +```diff +@@ Document Extraction Rules @@ + | Source | content_text Construction | + |--------|--------------------------| +-| Issue | `title + "\n\n" + description` | +-| MR | `title + "\n\n" + description` | ++| Issue | Structured header + `title + "\n\n" + description` | ++| MR | Structured header + `title + "\n\n" + description` | + | Discussion | Full thread with context (see below) | + ++**Structured Header Format (all document types):** ++``` ++[[{SourceType}]] {Title} ++Project: {path_with_namespace} ++URL: {web_url} ++Labels: {JSON array of label names} ++Files: {JSON array of paths from DiffNotes, if any} ++ ++--- Content --- ++``` ++ ++This format provides: ++- Stable, parseable context for embeddings ++- Consistent snippet formatting in search results ++- File path context without full file-history feature +``` + +```diff +@@ Truncation @@ +-**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning. ++**Truncation:** ++If content exceeds 8000 tokens: ++1. Truncate from the middle (preserve first + last notes for context) ++2. Set `documents.is_truncated = 1` ++3. Set `documents.truncated_reason = 'token_limit_middle_drop'` ++4. Log a warning with document ID and original token count ++ ++This metadata enables: ++- Monitoring truncation frequency in production ++- Future investigation of high-value truncated documents ++- Debugging when search misses expected content +``` + +--- + +## Change 7: Embedding Pipeline Concurrency + Per-Document Error Tracking + +**Why this is better:** For 50-100K documents, embedding is the longest pole. Controlled concurrency (4-8 workers) saturates local inference without OOM. Per-document error tracking prevents single bad payloads from stalling "100% coverage" and enables targeted re-runs. + +```diff +@@ Checkpoint 3: Embedding Generation - Scope @@ + **Scope:** + - Ollama integration (nomic-embed-text model) +-- Embedding generation pipeline (batch processing, 32 documents per batch) ++- Embedding generation pipeline: ++ - Batch size: 32 documents per batch ++ - Concurrency: configurable (default 4 workers) ++ - Retry with exponential backoff for transient failures (max 3 attempts) ++ - Per-document failure recording to enable targeted re-runs + - Vector storage in SQLite (sqlite-vss extension) + - Progress tracking and resumability + - Document extraction layer: +``` + +```diff +@@ Schema Additions - embedding_metadata @@ + CREATE TABLE embedding_metadata ( + document_id INTEGER PRIMARY KEY REFERENCES documents(id), + model TEXT NOT NULL, -- 'nomic-embed-text' + dims INTEGER NOT NULL, -- 768 + content_hash TEXT NOT NULL, -- copied from documents.content_hash +- created_at INTEGER NOT NULL ++ created_at INTEGER NOT NULL, ++ -- Error tracking for resumable embedding ++ last_error TEXT, -- error message from last failed attempt ++ attempt_count INTEGER NOT NULL DEFAULT 0, ++ last_attempt_at INTEGER -- when last attempt occurred + ); ++ ++-- Index for finding failed embeddings to retry ++CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL; +``` + +```diff +@@ Checkpoint 3 Automated Tests @@ + tests/integration/embedding-storage.test.ts + - stores embedding in sqlite-vss + - embedding rowid matches document id + - creates embedding_metadata record + - skips re-embedding when content_hash unchanged + - re-embeds when content_hash changes ++ - records error in embedding_metadata on failure ++ - increments attempt_count on each retry ++ - clears last_error on successful embedding ++ - respects concurrency limit +``` + +```diff +@@ Checkpoint 3 Manual CLI Smoke Tests @@ + | Command | Expected Output | Pass Criteria | + |---------|-----------------|---------------| + | `gi embed --all` | Progress bar with ETA | Completes without error | + | `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs | ++| `gi embed --retry-failed` | Progress on failed docs | Re-attempts previously failed embeddings | + | `gi stats` | Embedding coverage stats | Shows 100% coverage | + | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | + | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error | ++ ++**Stats output should include:** ++- Total documents ++- Successfully embedded ++- Failed (with error breakdown) ++- Pending (never attempted) +``` + +--- + +## Change 8: Search UX Improvements (--project, --explain, Stable JSON Schema) + +**Why this is better:** For day-to-day use, "search across everything" is less useful than "search within repo X." The `--explain` flag helps validate ranking during MVP. Stable JSON schema prevents accidental breaking changes for agent/MCP consumption. + +```diff +@@ Checkpoint 4 Scope @@ +-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name` ++- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path` ++- Debug: `--explain` returns rank contributions from vector + FTS + RRF + - Label filtering operates on `document_labels` (indexed, exact-match) + - Output formatting: ranked list with title, snippet, score, URL +-- JSON output mode for AI agent consumption ++- JSON output mode for AI/agent consumption (stable schema, documented) + - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning +``` + +```diff +@@ CLI Interface @@ + # Basic semantic search + gi search "why did we choose Redis" + ++# Search within specific project ++gi search "authentication" --project=group/project-one ++ + # Pure FTS search (fallback if embeddings unavailable) + gi search "redis" --mode=lexical + + # Filtered search + gi search "authentication" --type=mr --after=2024-01-01 + + # Filter by label + gi search "performance" --label=bug --label=critical + + # JSON output for programmatic use + gi search "payment processing" --json ++ ++# Debug ranking (shows how each retriever contributed) ++gi search "authentication" --explain +``` + +```diff +@@ JSON Output Schema (NEW SECTION) @@ ++**JSON Output Schema (Stable)** ++ ++For AI/agent consumption, `--json` output follows this stable schema: ++ ++```typescript ++interface SearchResult { ++ documentId: number; ++ sourceType: "issue" | "merge_request" | "discussion"; ++ title: string | null; ++ url: string; ++ projectPath: string; ++ author: string | null; ++ createdAt: string; // ISO 8601 ++ updatedAt: string; // ISO 8601 ++ score: number; // 0-1 normalized RRF score ++ snippet: string; // truncated content_text ++ labels: string[]; ++ // Only present with --explain flag ++ explain?: { ++ vectorRank?: number; // null if not in vector results ++ ftsRank?: number; // null if not in FTS results ++ rrfScore: number; ++ }; ++} ++ ++interface SearchResponse { ++ query: string; ++ mode: "hybrid" | "lexical" | "semantic"; ++ totalResults: number; ++ results: SearchResult[]; ++ warnings?: string[]; // e.g., "Embedding service unavailable" ++} ++``` ++ ++**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed. +``` + +```diff +@@ Checkpoint 4 Manual CLI Smoke Tests @@ + | Command | Expected Output | Pass Criteria | + |---------|-----------------|---------------| + | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score | ++| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project | + | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output | + | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | + | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | + | `gi search "authentication" --label=bug` | Label filtered | All results have bug label | + | `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama | + | `gi search "authentication" --json` | JSON output | Valid JSON matching schema | ++| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions | + | `gi search "xyznonexistent123"` | No results message | Graceful empty state | + | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | +``` + +--- + +## Change 9: Make `gi sync` an Orchestrator + +**Why this is better:** Once CP3+ exist, operators want one command that does the right thing. The most common MVP failure is "I ingested but forgot to regenerate docs / embed / update FTS." + +```diff +@@ Checkpoint 5 CLI Commands @@ + ```bash +-# Full sync (respects cursors, only fetches new/updated) +-gi sync ++# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced) ++gi sync # orchestrates all steps ++gi sync --no-embed # skip embedding step (fast ingest/debug) ++gi sync --no-docs # skip document regeneration (debug) + + # Force full re-sync (resets cursors) + gi sync --full + + # Override stale 'running' run after operator review + gi sync --force + + # Show sync status + gi sync-status + ``` ++ ++**Orchestration steps (in order):** ++1. Acquire app lock with heartbeat ++2. Ingest delta (issues, MRs, discussions) based on cursors ++3. Apply rolling backfill window ++4. Regenerate documents for changed entities ++5. Embed documents with changed content_hash ++6. FTS triggers auto-sync (no explicit step needed) ++7. Release lock, record sync_run as succeeded ++ ++Individual commands remain available for checkpoint testing and debugging: ++- `gi ingest --type=issues` ++- `gi ingest --type=merge_requests` ++- `gi embed --all` ++- `gi embed --retry-failed` +``` + +--- + +## Change 10: Checkpoint Focus Sharpening + +**Why this is better:** Makes each checkpoint's exit criteria crisper and reduces overlap. + +```diff +@@ Checkpoint 0: Project Setup @@ +-**Deliverable:** Scaffolded project with GitLab API connection verified ++**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working + + **Scope:** + - Project structure (TypeScript, ESLint, Vitest) + - GitLab API client with PAT authentication + - Environment and project configuration + - Basic CLI scaffold with `auth-test` command + - `doctor` command for environment verification +-- Projects table and initial sync +-- Sync tracking for reliability ++- Projects table and initial project resolution (no issue/MR ingestion yet) ++- DB migrations + WAL + FK enforcement ++- Sync tracking with crash-safe single-flight lock ++- Rate limit handling with exponential backoff + jitter +``` + +```diff +@@ Checkpoint 1 Deliverable @@ +-**Deliverable:** All issues from target repos stored locally ++**Deliverable:** All issues + labels from target repos stored locally with resumable cursor-based sync +``` + +```diff +@@ Checkpoint 2 Deliverable @@ +-**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context ++**Deliverable:** All MRs + discussions + notes (including flagged system notes) stored locally with full thread context and DiffNote file paths captured +``` + +--- + +## Change 11: Risk Mitigation Updates + +```diff +@@ Risk Mitigation @@ + | Risk | Mitigation | + |------|------------| + | GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync | + | Embedding model quality | Start with nomic-embed-text; architecture allows model swap | + | SQLite scale limits | Monitor performance; Postgres migration path documented | + | Stale data | Incremental sync with change detection | +-| Mid-sync failures | Cursor-based resumption, sync_runs audit trail | ++| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery | ++| Missed updates | Rolling backfill window (14 days), tuple cursor semantics | + | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | +-| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) | ++| Concurrent sync corruption | DB-enforced app lock with heartbeat, automatic stale lock recovery | ++| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs | +``` + +--- + +## Change 12: Resolved Decisions Updates + +```diff +@@ Resolved Decisions @@ + | Question | Decision | Rationale | + |----------|----------|-----------| + | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability | +-| System notes | **Exclude during ingestion** | System notes add noise without semantic value | ++| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise | ++| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature | + | MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature | + | Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering | + | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings | + | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient | ++| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates | + | Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates | + | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | + | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping | + | Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context | +-| Embedding batching | **32 documents per batch** | Balance throughput and memory | ++| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation | + | FTS5 tokenizer | **porter unicode61** | Stemming improves recall | + | Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching | ++| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption | +``` + +--- + +## Summary of All Changes + +| # | Change | Impact | +|---|--------|--------| +| 1 | Crash-safe heartbeat lock | Prevents race conditions, auto-recovers from crashes | +| 2 | Tuple cursor + rolling backfill | Reduces risk of missed updates dramatically | +| 3 | project_id on raw_payloads | Defensive scoping for multi-project scenarios | +| 4 | Uniqueness on (project_id, iid) | Enables O(1) `gi show issue 123 --project=X` | +| 5 | Store system notes flagged + DiffNote paths | Preserves audit trail, enables immediate file search | +| 6 | Structured document header + truncation metadata | Better embeddings, debuggability | +| 7 | Embedding concurrency + per-doc errors | 50-100K docs becomes manageable | +| 8 | --project, --explain, stable JSON | Day-to-day UX and trust-building | +| 9 | `gi sync` orchestrator | Reduces human error | +| 10 | Checkpoint focus sharpening | Clearer exit criteria | +| 11-12 | Risk/Decisions updates | Documentation alignment | + +**Net effect:** Same MVP product (semantic search over issues/MRs/discussions), but with production-grade hardening that prevents the class of bugs that typically kill MVPs in real-world use. diff --git a/SPEC.md b/SPEC.md index 0d242bf..752d7bc 100644 --- a/SPEC.md +++ b/SPEC.md @@ -6,6 +6,69 @@ A self-hosted tool to extract, index, and semantically search 2+ years of GitLab --- +## Quick Start + +### Prerequisites + +| Requirement | Version | Notes | +|-------------|---------|-------| +| Node.js | 20+ | LTS recommended | +| npm | 10+ | Comes with Node.js | +| Ollama | Latest | Optional for semantic search; lexical search works without it | + +### Installation + +```bash +# Clone and install +git clone https://github.com/your-org/gitlab-inbox.git +cd gitlab-inbox +npm install +npm run build +npm link # Makes `gi` available globally +``` + +### First Run + +1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope): + ```bash + export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx" + ``` + +2. **Run the setup wizard:** + ```bash + gi init + ``` + This creates `gi.config.json` with your GitLab URL and project paths. + +3. **Verify your environment:** + ```bash + gi doctor + ``` + All checks should pass (Ollama warning is OK if you only need lexical search). + +4. **Sync your data:** + ```bash + gi sync + ``` + Initial sync takes 10-20 minutes depending on repo size and rate limits. + +5. **Search:** + ```bash + gi search "authentication redesign" + ``` + +### Troubleshooting First Run + +| Symptom | Solution | +|---------|----------| +| `Config file not found` | Run `gi init` first | +| `GITLAB_TOKEN not set` | Export the environment variable | +| `401 Unauthorized` | Check token has `read_api` scope | +| `Project not found: group/project` | Verify project path in GitLab URL | +| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search | + +--- + ## Discovery Summary ### Pain Points Identified @@ -105,10 +168,12 @@ GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&pe Discussions must be fetched per-issue and per-MR. There is no bulk endpoint: ``` -GET /projects/:id/issues/:iid/discussions -GET /projects/:id/merge_requests/:iid/discussions +GET /projects/:id/issues/:iid/discussions?per_page=100&page=N +GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N ``` +**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss. + ### Sync Pattern **Initial sync:** @@ -124,17 +189,26 @@ GET /projects/:id/merge_requests/:iid/discussions 3. Fetch MRs where `updated_after=cursor` (bulk) 4. For EACH updated MR → refetch ALL its discussions -### Critical Assumption +### Critical Assumption (Softened) -**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed. +We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively. -Mitigation: Periodic full re-sync (weekly) as a safety net. +**Mitigations (MVP):** +1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor. +2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly. +3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`). + +The backfill window provides 80% of the safety of full resync at <5% of the API cost. ### Rate Limiting - Default: 10 requests/second with exponential backoff - Respect `Retry-After` headers on 429 responses - Add jitter to avoid thundering herd on retry +- **Separate concurrency limits:** + - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4) + - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s) + - Bound concurrency per-project to avoid one repo starving the other - Initial sync estimate: 10-20 minutes depending on rate limits --- @@ -144,7 +218,7 @@ Mitigation: Periodic full re-sync (weekly) as a safety net. Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding. ### Checkpoint 0: Project Setup -**Deliverable:** Scaffolded project with GitLab API connection verified +**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working **Automated Tests (Vitest):** ``` @@ -165,6 +239,23 @@ tests/integration/gitlab-client.test.ts ✓ returns 401 for invalid PAT ✓ fetches project by path ✓ handles rate limiting (429) with retry + +tests/integration/app-lock.test.ts + ✓ acquires lock successfully + ✓ updates heartbeat during operation + ✓ detects stale lock and recovers + ✓ refuses concurrent acquisition + +tests/integration/init.test.ts + ✓ creates config file with valid structure + ✓ validates GitLab URL format + ✓ validates GitLab connection before writing config + ✓ validates each project path exists in GitLab + ✓ fails if token not set + ✓ fails if GitLab auth fails + ✓ fails if any project path not found + ✓ prompts before overwriting existing config + ✓ respects --force to skip confirmation ``` **Manual CLI Smoke Tests:** @@ -174,6 +265,10 @@ tests/integration/gitlab-client.test.ts | `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) | | `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks | | `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure | +| `gi init` | Interactive prompts | Creates valid gi.config.json | +| `gi init` (config exists) | Confirmation prompt | Warns before overwriting | +| `gi --help` | Command list | Shows all available commands | +| `gi version` | Version number | Shows installed version | **Data Integrity Checks:** - [ ] `projects` table contains rows for each configured project path @@ -186,7 +281,24 @@ tests/integration/gitlab-client.test.ts - Environment and project configuration - Basic CLI scaffold with `auth-test` command - `doctor` command for environment verification -- Projects table and initial sync +- Projects table and initial project resolution (no issue/MR ingestion yet) +- DB migrations + WAL + FK enforcement +- Sync tracking with crash-safe single-flight lock (heartbeat-based) +- Rate limit handling with exponential backoff + jitter +- `gi init` command for guided setup: + - Prompts for GitLab base URL + - Prompts for project paths (comma-separated or multiple prompts) + - Prompts for token environment variable name (default: GITLAB_TOKEN) + - **Validates before writing config:** + - Token must be set in environment + - Tests auth with `GET /user` endpoint + - Validates each project path with `GET /projects/:path` + - Only writes config after all validations pass + - Generates `gi.config.json` with sensible defaults +- `gi --help` shows all available commands +- `gi --help` shows command-specific help +- `gi version` shows installed version +- First-run detection: if no config exists, suggest `gi init` **Configuration (MVP):** ```json @@ -200,10 +312,22 @@ tests/integration/gitlab-client.test.ts { "path": "group/project-one" }, { "path": "group/project-two" } ], + "sync": { + "backfillDays": 14, + "staleLockMinutes": 10, + "heartbeatIntervalSeconds": 30, + "cursorRewindSeconds": 2, + "primaryConcurrency": 4, + "dependentConcurrency": 2 + }, + "storage": { + "compressRawPayloads": true + }, "embedding": { "provider": "ollama", "model": "nomic-embed-text", - "baseUrl": "http://localhost:11434" + "baseUrl": "http://localhost:11434", + "concurrency": 4 } } ``` @@ -232,12 +356,21 @@ CREATE INDEX idx_projects_path ON projects(path_with_namespace); CREATE TABLE sync_runs ( id INTEGER PRIMARY KEY, started_at INTEGER NOT NULL, + heartbeat_at INTEGER NOT NULL, finished_at INTEGER, status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc. error TEXT ); +-- Crash-safe single-flight lock (DB-enforced) +CREATE TABLE app_locks ( + name TEXT PRIMARY KEY, -- 'sync' + owner TEXT NOT NULL, -- random run token (UUIDv4) + acquired_at INTEGER NOT NULL, + heartbeat_at INTEGER NOT NULL +); + -- Sync cursors for primary resources only -- Notes and MR changes are dependent resources (fetched via parent updates) CREATE TABLE sync_cursors ( @@ -252,18 +385,21 @@ CREATE TABLE sync_cursors ( CREATE TABLE raw_payloads ( id INTEGER PRIMARY KEY, source TEXT NOT NULL, -- 'gitlab' - resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' + project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources + resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion' gitlab_id INTEGER NOT NULL, fetched_at INTEGER NOT NULL, - json TEXT NOT NULL + content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip' + payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON ); -CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id); +CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id); +CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at); ``` --- ### Checkpoint 1: Issue Ingestion -**Deliverable:** All issues from target repos stored locally +**Deliverable:** All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync **Automated Tests (Vitest):** ``` @@ -277,6 +413,13 @@ tests/unit/pagination.test.ts ✓ respects per_page parameter ✓ stops when empty page returned +tests/unit/discussion-transformer.test.ts + ✓ transforms discussion payload to normalized schema + ✓ extracts notes array from discussion + ✓ sets individual_note flag correctly + ✓ flags system notes with is_system=1 + ✓ preserves note order via position field + tests/integration/issue-ingestion.test.ts ✓ inserts issues into database ✓ creates labels from issue payloads @@ -285,6 +428,13 @@ tests/integration/issue-ingestion.test.ts ✓ updates cursor after successful page commit ✓ resumes from cursor on subsequent runs +tests/integration/issue-discussion-ingestion.test.ts + ✓ fetches discussions for each issue + ✓ creates discussion rows with correct issue FK + ✓ creates note rows linked to discussions + ✓ stores system notes with is_system=1 flag + ✓ handles individual_note=true discussions + tests/integration/sync-runs.test.ts ✓ creates sync_run record on start ✓ marks run as succeeded on completion @@ -300,7 +450,9 @@ tests/integration/sync-runs.test.ts | `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author | | `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project | | `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI | -| `gi show issue 123` | Issue detail view | Shows title, description, labels, URL | +| `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL | +| `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count | +| `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count | | `gi sync-status` | Last sync time, cursor positions | Shows successful last run | **Data Integrity Checks:** @@ -309,6 +461,9 @@ tests/integration/sync-runs.test.ts - [ ] Labels in `issue_labels` junction all exist in `labels` table - [ ] `sync_cursors` has entry for each (project_id, 'issues') pair - [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current) +- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'` is non-zero +- [ ] Every discussion has at least one note +- [ ] `individual_note = true` discussions have exactly one note **Scope:** - Issue fetcher with pagination handling @@ -317,12 +472,26 @@ tests/integration/sync-runs.test.ts - Labels ingestion derived from issue payload: - Always persist label names from `labels: string[]` - Optionally request `with_labels_details=true` to capture color/description when available +- Issue discussions fetcher: + - Uses `GET /projects/:id/issues/:iid/discussions` + - Fetches all discussions for each issue during ingest + - Preserve system notes but flag them with `is_system=1` - Incremental sync support (run tracking + per-project cursor) - Basic list/count CLI commands **Reliability/Idempotency Rules:** - Every ingest/sync creates a `sync_runs` row -- Single-flight: refuse to start if an existing run is `running` (unless `--force`) +- Single-flight via DB-enforced app lock: + - On start: acquire lock via transactional compare-and-swap: + - `BEGIN IMMEDIATE` (acquires write lock immediately) + - If no row exists → INSERT new lock + - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps + - Else if `owner` matches current run → UPDATE heartbeat (re-entrant) + - Else → ROLLBACK and fail fast (another run is active) + - `COMMIT` + - During run: update `heartbeat_at` every 30 seconds + - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire + - `--force` remains as operator override for edge cases, but should rarely be needed - Cursor advances only after successful transaction commit per page/batch - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC` - Use explicit transactions for batch inserts @@ -340,11 +509,13 @@ CREATE TABLE issues ( author_username TEXT, created_at INTEGER, updated_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); CREATE INDEX idx_issues_author ON issues(author_username); +CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid); -- Labels are derived from issue payloads (string array) -- Uniqueness is (project_id, name) since gitlab_id isn't always available @@ -365,12 +536,65 @@ CREATE TABLE issue_labels ( PRIMARY KEY(issue_id, label_id) ); CREATE INDEX idx_issue_labels_label ON issue_labels(label_id); + +-- Discussion threads for issues (MR discussions added in CP2) +CREATE TABLE discussions ( + id INTEGER PRIMARY KEY, + gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...") + project_id INTEGER NOT NULL REFERENCES projects(id), + issue_id INTEGER REFERENCES issues(id), + merge_request_id INTEGER REFERENCES merge_requests(id), + noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' + individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion + first_note_at INTEGER, -- for ordering discussions + last_note_at INTEGER, -- for "recently active" queries + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync + resolvable BOOLEAN, -- MR discussions can be resolved + resolved BOOLEAN, + CHECK ( + (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR + (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) + ) +); +CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id); +CREATE INDEX idx_discussions_issue ON discussions(issue_id); +CREATE INDEX idx_discussions_mr ON discussions(merge_request_id); +CREATE INDEX idx_discussions_last_note ON discussions(last_note_at); + +-- Notes belong to discussions (preserving thread context) +CREATE TABLE notes ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + discussion_id INTEGER NOT NULL REFERENCES discussions(id), + project_id INTEGER NOT NULL REFERENCES projects(id), + type TEXT, -- 'DiscussionNote' | 'DiffNote' | null + is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.) + author_username TEXT, + body TEXT, + created_at INTEGER, + updated_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync + position INTEGER, -- derived from array order in API response (0-indexed) + resolvable BOOLEAN, + resolved BOOLEAN, + resolved_by TEXT, + resolved_at INTEGER, + -- DiffNote position metadata (only populated for MR DiffNotes in CP2) + position_old_path TEXT, + position_new_path TEXT, + position_old_line INTEGER, + position_new_line INTEGER, + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_notes_discussion ON notes(discussion_id); +CREATE INDEX idx_notes_author ON notes(author_username); +CREATE INDEX idx_notes_system ON notes(is_system); ``` --- -### Checkpoint 2: MR + Discussions Ingestion -**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context +### Checkpoint 2: MR Ingestion +**Deliverable:** All MRs + MR discussions + notes with DiffNote paths captured **Automated Tests (Vitest):** ``` @@ -379,12 +603,9 @@ tests/unit/mr-transformer.test.ts ✓ extracts labels from MR payload ✓ handles missing optional fields gracefully -tests/unit/discussion-transformer.test.ts - ✓ transforms discussion payload to normalized schema - ✓ extracts notes array from discussion - ✓ sets individual_note flag correctly - ✓ filters out system notes (system: true) - ✓ preserves note order via position field +tests/unit/diffnote-transformer.test.ts + ✓ extracts DiffNote position metadata (paths and lines) + ✓ handles missing position fields gracefully tests/integration/mr-ingestion.test.ts ✓ inserts MRs into database @@ -392,12 +613,11 @@ tests/integration/mr-ingestion.test.ts ✓ links MRs to labels via junction table ✓ stores raw payload for each MR -tests/integration/discussion-ingestion.test.ts - ✓ fetches discussions for each issue +tests/integration/mr-discussion-ingestion.test.ts ✓ fetches discussions for each MR - ✓ creates discussion rows with correct parent FK + ✓ creates discussion rows with correct MR FK ✓ creates note rows linked to discussions - ✓ excludes system notes from storage + ✓ extracts position_new_path from DiffNotes ✓ captures note-level resolution status ✓ captures note type (DiscussionNote, DiffNote) ``` @@ -409,28 +629,25 @@ tests/integration/discussion-ingestion.test.ts | `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch | | `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI | | `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads | -| `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads | -| `gi count discussions` | `Discussions: 12,345` | Non-zero count | -| `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes | +| `gi count discussions` | `Discussions: 12,345` | Total count (issue + MR) | +| `gi count discussions --type=mr` | `MR Discussions: 6,789` | MR discussions only | +| `gi count notes` | `Notes: 45,678 (excluding 8,901 system)` | Total with system note count | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count -- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments -- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked) -- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes) -- [ ] Every discussion has at least one note -- [ ] `individual_note = true` discussions have exactly one note +- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest'` is non-zero +- [ ] DiffNotes have `position_new_path` populated when available - [ ] Discussion `first_note_at` <= `last_note_at` for all rows **Scope:** - MR fetcher with pagination -- Discussions fetcher (issue discussions + MR discussions) as a dependent resource: - - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions` - - During initial ingest: fetch discussions for every issue/MR - - During sync: refetch discussions only for issues/MRs updated since cursor - - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise -- Relationship linking (discussion → parent issue/MR, notes → discussion) -- Extended CLI commands for MR/issue display with threads +- MR discussions fetcher: + - Uses `GET /projects/:id/merge_requests/:iid/discussions` + - Fetches all discussions for each MR during ingest + - Capture DiffNote file path/line metadata from `position` field for filename search +- Relationship linking (discussion → MR, notes → discussion) +- Extended CLI commands for MR display with threads +- Add `idx_notes_type` and `idx_notes_new_path` indexes for DiffNote queries **Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries. @@ -449,77 +666,38 @@ CREATE TABLE merge_requests ( target_branch TEXT, created_at INTEGER, updated_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync merged_at INTEGER, web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); CREATE INDEX idx_mrs_author ON merge_requests(author_username); +CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid); --- Discussion threads (the semantic unit for conversations) -CREATE TABLE discussions ( - id INTEGER PRIMARY KEY, - gitlab_discussion_id TEXT UNIQUE NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...") - project_id INTEGER NOT NULL REFERENCES projects(id), - issue_id INTEGER REFERENCES issues(id), - merge_request_id INTEGER REFERENCES merge_requests(id), - noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' - individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion - first_note_at INTEGER, -- for ordering discussions - last_note_at INTEGER, -- for "recently active" queries - resolvable BOOLEAN, -- MR discussions can be resolved - resolved BOOLEAN, - CHECK ( - (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR - (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) - ) -); -CREATE INDEX idx_discussions_issue ON discussions(issue_id); -CREATE INDEX idx_discussions_mr ON discussions(merge_request_id); -CREATE INDEX idx_discussions_last_note ON discussions(last_note_at); - --- Notes belong to discussions (preserving thread context) -CREATE TABLE notes ( - id INTEGER PRIMARY KEY, - gitlab_id INTEGER UNIQUE NOT NULL, - discussion_id INTEGER NOT NULL REFERENCES discussions(id), - project_id INTEGER NOT NULL REFERENCES projects(id), - type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API) - author_username TEXT, - body TEXT, - created_at INTEGER, - updated_at INTEGER, - position INTEGER, -- derived from array order in API response (0-indexed) - resolvable BOOLEAN, -- note-level resolvability (MR code comments) - resolved BOOLEAN, -- note-level resolution status - resolved_by TEXT, -- username who resolved - resolved_at INTEGER, -- when resolved - raw_payload_id INTEGER REFERENCES raw_payloads(id) -); -CREATE INDEX idx_notes_discussion ON notes(discussion_id); -CREATE INDEX idx_notes_author ON notes(author_username); -CREATE INDEX idx_notes_type ON notes(type); - --- MR labels (reuse same labels table) +-- MR labels (reuse same labels table from CP1) CREATE TABLE mr_labels ( merge_request_id INTEGER REFERENCES merge_requests(id), label_id INTEGER REFERENCES labels(id), PRIMARY KEY(merge_request_id, label_id) ); CREATE INDEX idx_mr_labels_label ON mr_labels(label_id); + +-- Additional indexes for DiffNote queries (tables created in CP1) +CREATE INDEX idx_notes_type ON notes(type); +CREATE INDEX idx_notes_new_path ON notes(position_new_path); ``` -**Discussion Processing Rules:** -- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.) -- Each discussion from the API becomes one row in `discussions` table -- All notes within a discussion are stored with their `discussion_id` foreign key -- `individual_note: true` discussions have exactly one note (standalone comment) -- `individual_note: false` discussions have multiple notes (threaded conversation) +**MR Discussion Processing Rules:** +- DiffNote position data is extracted and stored: + - `position.old_path`, `position.new_path` for file-level search + - `position.old_line`, `position.new_line` for line-level context +- MR discussions can be resolvable; resolution status is captured at note level --- -### Checkpoint 3: Embedding Generation -**Deliverable:** Vector embeddings generated for all text content +### Checkpoint 3: Document + Embedding Generation with Lexical Search +**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end **Automated Tests (Vitest):** ``` @@ -563,6 +741,7 @@ tests/integration/embedding-storage.test.ts | `gi stats` | Embedding coverage stats | Shows 100% coverage | | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error | +| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions @@ -574,7 +753,11 @@ tests/integration/embedding-storage.test.ts **Scope:** - Ollama integration (nomic-embed-text model) -- Embedding generation pipeline (batch processing, 32 documents per batch) +- Embedding generation pipeline: + - Batch size: 32 documents per batch + - Concurrency: configurable (default 4 workers) + - Retry with exponential backoff for transient failures (max 3 attempts) + - Per-document failure recording to enable targeted re-runs - Vector storage in SQLite (sqlite-vss extension) - Progress tracking and resumability - Document extraction layer: @@ -582,8 +765,14 @@ tests/integration/embedding-storage.test.ts - Stable content hashing for change detection (SHA-256 of content_text) - Single embedding per document (chunking deferred to post-MVP) - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192) + - **Implementation:** Use character budget, not exact token count + - `maxChars = 32000` (conservative 4 chars/token estimate) + - `approxTokens = ceil(charCount / 4)` for reporting/logging only + - This avoids tokenizer dependency while preventing embedding failures - Denormalized metadata for fast filtering (author, labels, dates) - Fast label filtering via `document_labels` join table +- FTS5 index for lexical search (enables `gi search --mode=lexical` without Ollama) +- `gi search --mode=lexical` CLI command (works without Ollama) **Schema Additions:** ```sql @@ -601,6 +790,8 @@ CREATE TABLE documents ( title TEXT, -- null for discussions content_text TEXT NOT NULL, -- canonical text for embedding/snippets content_hash TEXT NOT NULL, -- SHA-256 for change detection + is_truncated BOOLEAN NOT NULL DEFAULT 0, + truncated_reason TEXT, -- 'token_limit_middle_drop' | null UNIQUE(source_type, source_id) ); CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at); @@ -628,8 +819,31 @@ CREATE TABLE embedding_metadata ( model TEXT NOT NULL, -- 'nomic-embed-text' dims INTEGER NOT NULL, -- 768 content_hash TEXT NOT NULL, -- copied from documents.content_hash - created_at INTEGER NOT NULL + created_at INTEGER NOT NULL, + -- Error tracking for resumable embedding + last_error TEXT, -- error message from last failed attempt + attempt_count INTEGER NOT NULL DEFAULT 0, + last_attempt_at INTEGER -- when last attempt occurred ); + +-- Index for finding failed embeddings to retry +CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL; + +-- Track sources that require document regeneration (populated during ingestion) +CREATE TABLE dirty_sources ( + source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' + source_id INTEGER NOT NULL, -- local DB id + queued_at INTEGER NOT NULL, + PRIMARY KEY(source_type, source_id) +); + +-- Fast path filtering for documents (extracted from DiffNote positions) +CREATE TABLE document_paths ( + document_id INTEGER NOT NULL REFERENCES documents(id), + path TEXT NOT NULL, + PRIMARY KEY(document_id, path) +); +CREATE INDEX idx_document_paths_path ON document_paths(path); ``` **Storage Rule (MVP):** @@ -647,7 +861,13 @@ CREATE TABLE embedding_metadata ( **Discussion Document Format:** ``` -[Issue #234: Authentication redesign] Discussion +[[Discussion]] Issue #234: Authentication redesign +Project: group/project-one +URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345 +Labels: ["bug", "auth"] +Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread + +--- Thread --- @johndoe (2024-03-15): I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients... @@ -661,16 +881,32 @@ Short-lived access tokens (15min), longer refresh (7 days). Here's why... This format preserves: - Parent context (issue/MR title and number) +- Project path for scoped search +- Direct URL for navigation +- Labels for context +- File paths from DiffNotes (enables immediate file search) - Author attribution for each note - Temporal ordering of the conversation - Full thread semantics for decision traceability -**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning. +**Truncation:** +If content exceeds 8000 tokens: +**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`. + +1. Truncate from the middle (preserve first + last notes for context) +2. Set `documents.is_truncated = 1` +3. Set `documents.truncated_reason = 'token_limit_middle_drop'` +4. Log a warning with document ID and original token count + +This metadata enables: +- Monitoring truncation frequency in production +- Future investigation of high-value truncated documents +- Debugging when search misses expected content --- -### Checkpoint 4: Semantic Search -**Deliverable:** Working semantic search across all indexed content +### Checkpoint 4: Hybrid Search (Semantic + Lexical) +**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content **Automated Tests (Vitest):** ``` @@ -713,13 +949,18 @@ tests/e2e/golden-queries.test.ts | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score | +| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project | | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output | | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | | `gi search "authentication" --label=bug` | Label filtered | All results have bug label | -| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama | -| `gi search "authentication" --json` | JSON output | Valid JSON array with schema | +| `gi search "redis" --mode=lexical` | FTS results only | Works without Ollama | +| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ | +| `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema | +| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions | +| `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results | | `gi search "xyznonexistent123"` | No results message | Graceful empty state | +| `gi search "auth"` (no data synced) | No data message | Shows "Run gi sync first" | | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | **Golden Query Test Suite:** @@ -746,12 +987,23 @@ Each query must have at least one expected URL appear in top 10 results. - Hybrid retrieval: - Vector recall (sqlite-vss) + FTS lexical recall (fts5) - Merge + rerank results using Reciprocal Rank Fusion (RRF) +- Query embedding generation (same Ollama pipeline as documents) - Result ranking and scoring (document-level) -- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name` +- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N` + - `--limit=N` controls result count (default: 20, max: 100) + - `--path` filters documents by referenced file paths (from DiffNote positions) + - MVP: substring/exact match; glob patterns deferred - Label filtering operates on `document_labels` (indexed, exact-match) + - Filters work identically in hybrid and lexical modes +- Debug: `--explain` returns rank contributions from vector + FTS + RRF - Output formatting: ranked list with title, snippet, score, URL -- JSON output mode for AI agent consumption +- JSON output mode for AI/agent consumption (stable documented schema) - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning +- Empty state handling: + - No documents indexed: `No data indexed. Run 'gi sync' first.` + - Query returns no results: `No results found for "query".` + - Filters exclude all results: `No results match the specified filters.` + - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search") **Schema Additions:** ```sql @@ -806,14 +1058,20 @@ END; - Well-established in information retrieval literature **Graceful Degradation:** -- If Ollama is unreachable during search, automatically fall back to FTS5-only +- If Ollama is unreachable during search, automatically fall back to FTS5-only search - Display warning: "Embedding service unavailable, using lexical search only" - `embed` command fails with actionable error if Ollama is down **CLI Interface:** ```bash # Basic semantic search -gi search "why did we choose Redis" +gi search "authentication redesign" + +# Search within specific project +gi search "authentication" --project=group/project-one + +# Search by file path (finds discussions/MRs touching this file) +gi search "rate limit" --path=src/client.ts # Pure FTS search (fallback if embeddings unavailable) gi search "redis" --mode=lexical @@ -826,11 +1084,14 @@ gi search "performance" --label=bug --label=critical # JSON output for programmatic use gi search "payment processing" --json + +# Explain search (shows RRF contributions) +gi search "auth" --explain ``` **CLI Output Example:** ``` -$ gi search "authentication redesign" +$ gi search "authentication" Found 23 results (hybrid search, 0.34s) @@ -850,6 +1111,42 @@ Found 23 results (hybrid search, 0.34s) https://gitlab.example.com/group/project-one/-/issues/234#note_12345 ``` +**JSON Output Schema (Stable):** + +For AI/agent consumption, `--json` output follows this stable schema: + +```typescript +interface SearchResult { + documentId: number; + sourceType: "issue" | "merge_request" | "discussion"; + title: string | null; + url: string; + projectPath: string; + author: string | null; + createdAt: string; // ISO 8601 + updatedAt: string; // ISO 8601 + score: number; // 0-1 normalized RRF score + snippet: string; // truncated content_text + labels: string[]; + // Only present with --explain flag + explain?: { + vectorRank?: number; // null if not in vector results + ftsRank?: number; // null if not in FTS results + rrfScore: number; + }; +} + +interface SearchResponse { + query: string; + mode: "hybrid" | "lexical" | "semantic"; + totalResults: number; + results: SearchResult[]; + warnings?: string[]; // e.g., "Embedding service unavailable" +} +``` + +**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed. + --- ### Checkpoint 5: Incremental Sync @@ -888,7 +1185,7 @@ tests/integration/sync-recovery.test.ts |---------|-----------------|---------------| | `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check | | `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change | -| `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything | +| `gi sync --full` | Full sync progress | Resets cursors, fetches everything | | `gi sync-status` | Cursor positions, last sync time | Shows current state | | `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually | | `gi search "new content"` (after sync) | Returns new content | New content is searchable | @@ -913,20 +1210,33 @@ tests/integration/sync-recovery.test.ts - [ ] `sync_runs` has complete audit trail **Scope:** -- Delta sync based on stable cursor (updated_at + tie-breaker id) +- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)` +- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates - Dependent resources sync strategy (discussions refetched when parent updates) - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) - Sync status reporting - Recommended: run via cron every 10 minutes **Correctness Rules (MVP):** -1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` -2. Cursor advances only after successful DB commit for that page +1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC` +2. Cursor is a stable tuple `(updated_at, gitlab_id)`: + - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.** + - Use **cursor rewind + local filtering**: + - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable) + - Locally discard items where: + - `updated_at < cursor_updated_at`, OR + - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id` + - This makes the tuple cursor rule true in practice while keeping API calls simple. + - Cursor advances only after successful DB commit for that page + - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)` 3. Dependent resources: - For each updated issue/MR, refetch ALL its discussions - Discussion documents are regenerated and re-embedded if content_hash changes -4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` -5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) +4. Rolling backfill window: + - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays` + - This catches any items whose timestamps were updated without triggering our cursor +5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` +6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) **Why Dependent Resource Model:** - GitLab Discussions API doesn't provide a global `updated_after` stream @@ -935,8 +1245,10 @@ tests/integration/sync-recovery.test.ts **CLI Commands:** ```bash -# Full sync (respects cursors, only fetches new/updated) -gi sync +# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced) +gi sync # orchestrates all steps +gi sync --no-embed # skip embedding step (fast ingest/debug) +gi sync --no-docs # skip document regeneration (debug) # Force full re-sync (resets cursors) gi sync --full @@ -948,6 +1260,211 @@ gi sync --force gi sync-status ``` +**Orchestration steps (in order):** +1. Acquire app lock with heartbeat +2. Ingest delta (issues, MRs, discussions) based on cursors + - During ingestion, INSERT into `dirty_sources` for each upserted entity +3. Apply rolling backfill window +4. Regenerate documents for entities in `dirty_sources` (process + delete from queue) +5. Embed documents with changed content_hash +6. FTS triggers auto-sync (no explicit step needed) +7. Release lock, record sync_run as succeeded + +Individual commands remain available for checkpoint testing and debugging: +- `gi ingest --type=issues` +- `gi ingest --type=merge_requests` +- `gi embed --all` +- `gi embed --retry-failed` + +--- + +## CLI Command Reference + +All commands support `--help` for detailed usage information. + +### Setup & Diagnostics + +| Command | CP | Description | +|---------|-----|-------------| +| `gi init` | 0 | Interactive setup wizard; creates gi.config.json | +| `gi auth-test` | 0 | Verify GitLab authentication | +| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) | +| `gi doctor --json` | 0 | JSON output for scripting | +| `gi version` | 0 | Show installed version | + +### Data Ingestion + +| Command | CP | Description | +|---------|-----|-------------| +| `gi ingest --type=issues` | 1 | Fetch issues from GitLab | +| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions | +| `gi embed --all` | 3 | Generate embeddings for all documents | +| `gi embed --retry-failed` | 3 | Retry failed embeddings | +| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) | +| `gi sync --full` | 5 | Force complete re-sync (reset cursors) | +| `gi sync --force` | 5 | Override stale lock after operator review | +| `gi sync --no-embed` | 5 | Sync without embedding (faster) | + +### Data Inspection + +| Command | CP | Description | +|---------|-----|-------------| +| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues | +| `gi list mrs [--limit=N]` | 2 | List merge requests | +| `gi count issues` | 1 | Count issues | +| `gi count mrs` | 2 | Count merge requests | +| `gi count discussions --type=issue` | 1 | Count issue discussions | +| `gi count discussions` | 2 | Count all discussions | +| `gi count discussions --type=mr` | 2 | Count MR discussions | +| `gi count notes --type=issue` | 1 | Count issue notes (excluding system) | +| `gi count notes` | 2 | Count all notes (excluding system) | +| `gi show issue ` | 1 | Show issue details | +| `gi show mr ` | 2 | Show MR details with discussions | +| `gi stats` | 3 | Embedding coverage statistics | +| `gi stats --json` | 3 | JSON stats for scripting | +| `gi sync-status` | 1 | Show cursor positions and last sync | + +### Search + +| Command | CP | Description | +|---------|-----|-------------| +| `gi search "query"` | 4 | Hybrid semantic + lexical search | +| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) | +| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type | +| `gi search "query" --author=USERNAME` | 4 | Filter by author | +| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date | +| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) | +| `gi search "query" --project=PATH` | 4 | Filter by project | +| `gi search "query" --path=FILE` | 4 | Filter by file path | +| `gi search "query" --limit=N` | 4 | Limit results (default: 20, max: 100) | +| `gi search "query" --json` | 4 | JSON output for scripting | +| `gi search "query" --explain` | 4 | Show ranking breakdown | + +### Database Management + +| Command | CP | Description | +|---------|-----|-------------| +| `gi backup` | 0 | Create timestamped database backup | +| `gi reset --confirm` | 0 | Delete database and reset cursors | + +--- + +## Error Handling + +Common errors and their resolutions: + +### Configuration Errors + +| Error | Cause | Resolution | +|-------|-------|------------| +| `Config file not found` | No gi.config.json | Run `gi init` to create configuration | +| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually | +| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config | + +### Authentication Errors + +| Error | Cause | Resolution | +|-------|-------|------------| +| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` | +| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope | +| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope | + +### GitLab API Errors + +| Error | Cause | Resolution | +|-------|-------|------------| +| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) | +| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry | +| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity | + +### Data Errors + +| Error | Cause | Resolution | +|-------|-------|------------| +| `No documents indexed` | Sync not run | Run `gi sync` first | +| `No results found` | Query too specific | Try broader search terms | +| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale | + +### Embedding Errors + +| Error | Cause | Resolution | +|-------|-------|------------| +| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` | +| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` | +| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` | + +### Operational Behavior + +| Scenario | Behavior | +|----------|----------| +| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. | +| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. | +| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. | +| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. | + +--- + +## Database Management + +### Database Location + +The SQLite database is stored at an XDG-compliant location: + +``` +~/.local/share/gi/data.db +``` + +This can be overridden in `gi.config.json`: + +```json +{ + "storage": { + "dbPath": "/custom/path/to/data.db" + } +} +``` + +### Backup + +Create a timestamped backup of the database: + +```bash +gi backup +# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db +``` + +Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode). + +### Reset + +To completely reset the database and all sync cursors: + +```bash +gi reset --confirm +``` + +This deletes: +- The database file +- All sync cursors +- All embeddings + +You'll need to run `gi sync` again to repopulate. + +### Schema Migrations + +Database schema is version-tracked and migrations auto-apply on startup: + +1. On first run, schema is created at latest version +2. On subsequent runs, pending migrations are applied automatically +3. Migration version is stored in `schema_version` table +4. Migrations are idempotent and reversible where possible + +**Manual migration check:** +```bash +gi doctor --json | jq '.checks.database' +# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 } +``` + --- ## Future Work (Post-MVP) @@ -989,7 +1506,7 @@ CREATE TABLE note_positions ( new_line INTEGER, position_type TEXT -- 'text' | 'image' | etc. ); -CREATE INDEX idx_note_positions_new_path ON note_positions(new_path); +CREATE INDEX idx_note_positions_new_path ON note_positions(position_new_path); ``` --- @@ -1013,9 +1530,11 @@ Each checkpoint includes: | Embedding model quality | Start with nomic-embed-text; architecture allows model swap | | SQLite scale limits | Monitor performance; Postgres migration path documented | | Stale data | Incremental sync with change detection | -| Mid-sync failures | Cursor-based resumption, sync_runs audit trail | +| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery | +| Missed updates | Rolling backfill window (14 days), tuple cursor semantics | | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | -| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) | +| Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery | +| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs | **SQLite Performance Defaults (MVP):** - Enable `PRAGMA journal_mode=WAL;` on every connection @@ -1030,20 +1549,24 @@ Each checkpoint includes: | Table | Checkpoint | Purpose | |-------|------------|---------| | projects | 0 | Configured GitLab projects | -| sync_runs | 0 | Audit trail of sync operations | +| sync_runs | 0 | Audit trail of sync operations (with heartbeat) | +| app_locks | 0 | Crash-safe single-flight lock | | sync_cursors | 0 | Resumable sync state per primary resource | -| raw_payloads | 0 | Decoupled raw JSON storage | -| issues | 1 | Normalized issues | +| raw_payloads | 0 | Decoupled raw JSON storage (with project_id) | +| schema_version | 0 | Database migration version tracking | +| issues | 1 | Normalized issues (unique by project+iid) | | labels | 1 | Label definitions (unique by project + name) | | issue_labels | 1 | Issue-label junction | -| merge_requests | 2 | Normalized MRs | -| discussions | 2 | Discussion threads (the semantic unit for conversations) | -| notes | 2 | Individual comments within discussions | +| merge_requests | 2 | Normalized MRs (unique by project+iid) | +| discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) | +| notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) | | mr_labels | 2 | MR-label junction | -| documents | 3 | Unified searchable documents (issues, MRs, discussions) | +| documents | 3 | Unified searchable documents with truncation metadata | | document_labels | 3 | Document-label junction for fast filtering | +| document_paths | 3 | Fast path filtering for documents (DiffNote file paths) | +| dirty_sources | 3 | Queue for incremental document regeneration | | embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) | -| embedding_metadata | 3 | Embedding provenance + change detection | +| embedding_metadata | 3 | Embedding provenance + error tracking | | documents_fts | 4 | Full-text search index (fts5 with porter stemmer) | | mr_files | 6 | MR file changes (deferred to File History feature) | @@ -1053,19 +1576,26 @@ Each checkpoint includes: | Question | Decision | Rationale | |----------|----------|-----------| -| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread | -| System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value | -| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls | -| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering | -| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available | -| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient | -| Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates | +| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability | +| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise | +| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature | +| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature | +| Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering | +| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings | +| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient | +| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates | +| Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | -| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts | -| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 | -| Embedding batching | **32 documents per batch** | Balance between throughput and memory | -| FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text | -| Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching | +| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping | +| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context | +| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation | +| FTS5 tokenizer | **porter unicode61** | Stemming improves recall | +| Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching | +| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption | +| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable | +| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX | +| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly | +| Empty state UX | **Actionable messages** | Guide user to next step | ---