diff --git a/SPEC-REVISIONS-2.md b/SPEC-REVISIONS-2.md
new file mode 100644
index 0000000..b134f13
--- /dev/null
+++ b/SPEC-REVISIONS-2.md
@@ -0,0 +1,373 @@
+# SPEC.md Revision Document - Round 2
+
+This document provides git-diff style changes for the second round of improvements from ChatGPT's review. These are primarily correctness fixes and optimizations.
+
+---
+
+## Change 1: Fix Tuple Cursor Correctness Gap (Cursor Rewind + Local Filtering)
+
+**Why this is critical:** The spec specifies tuple cursor semantics `(updated_at, gitlab_id)` but GitLab's API only supports `updated_after` which is strictly "after" - it cannot express `WHERE updated_at = X AND id > Y` server-side. This creates a real risk of missed items on crash/resume and on dense timestamp buckets.
+
+**Fix:** Cursor rewind + local filtering. Call GitLab with `updated_after = cursor_updated_at - rewindSeconds`, then locally discard items we've already processed.
+
+```diff
+@@ Correctness Rules (MVP): @@
+ 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
+ 2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
+-   - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)`
++   - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.**
++   - Use **cursor rewind + local filtering**:
++     - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable)
++     - Locally discard items where:
++       - `updated_at < cursor_updated_at`, OR
++       - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id`
++     - This makes the tuple cursor rule true in practice while keeping API calls simple.
+    - Cursor advances only after successful DB commit for that page
+    - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
+```
+
+```diff
+@@ Configuration (MVP): @@
+   "sync": {
+     "backfillDays": 14,
+     "staleLockMinutes": 10,
+-    "heartbeatIntervalSeconds": 30
++    "heartbeatIntervalSeconds": 30,
++    "cursorRewindSeconds": 2
+   },
+```
+
+---
+
+## Change 2: Make App Lock Actually Safe (BEGIN IMMEDIATE CAS)
+
+**Why this is critical:** INSERT OR REPLACE can overwrite an active lock if two processes start close together (both do "stale check" outside a write transaction, then both INSERT OR REPLACE). SQLite's BEGIN IMMEDIATE provides a proper compare-and-swap.
+
+```diff
+@@ Reliability/Idempotency Rules: @@
+ - Every ingest/sync creates a `sync_runs` row
+ - Single-flight via DB-enforced app lock:
+-  - On start: INSERT OR REPLACE lock row with new owner token
++  - On start: acquire lock via transactional compare-and-swap:
++    - `BEGIN IMMEDIATE` (acquires write lock immediately)
++    - If no row exists → INSERT new lock
++    - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps
++    - Else if `owner` matches current run → UPDATE heartbeat (re-entrant)
++    - Else → ROLLBACK and fail fast (another run is active)
++    - `COMMIT`
+   - During run: update `heartbeat_at` every 30 seconds
+   - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
+   - `--force` remains as operator override for edge cases, but should rarely be needed
+```
+
+---
+
+## Change 3: Dependent Resource Pagination + Bounded Concurrency
+
+**Why this is important:** Discussions endpoints are paginated on many GitLab instances. Without pagination, we silently lose data. Without bounded concurrency, initial sync can become unstable (429s, long tail retries).
+
+```diff
+@@ Dependent Resources (Per-Parent Fetch): @@
+-GET /projects/:id/issues/:iid/discussions
+-GET /projects/:id/merge_requests/:iid/discussions
++GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
++GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
++
++**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent.
+```
+
+```diff
+@@ Rate Limiting: @@
+ - Default: 10 requests/second with exponential backoff
+ - Respect `Retry-After` headers on 429 responses
+ - Add jitter to avoid thundering herd on retry
++- **Separate concurrency limits:**
++  - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4)
++  - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
++  - Bound concurrency per-project to avoid one repo starving the other
+ - Initial sync estimate: 10-20 minutes depending on rate limits
+```
+
+```diff
+@@ Configuration (MVP): @@
+   "sync": {
+     "backfillDays": 14,
+     "staleLockMinutes": 10,
+     "heartbeatIntervalSeconds": 30,
+-    "cursorRewindSeconds": 2
++    "cursorRewindSeconds": 2,
++    "primaryConcurrency": 4,
++    "dependentConcurrency": 2
+   },
+```
+
+---
+
+## Change 4: Track last_seen_at for Eventual Consistency Debugging
+
+**Why this is valuable:** Even without implementing deletions, you want to know: (a) whether a record is actively refreshed under backfill/sync, (b) whether a sync run is "covering" the dataset, (c) whether a particular item hasn't been seen in months (helps diagnose missed updates).
+
+```diff
+@@ Schema Preview - issues: @@
+ CREATE TABLE issues (
+   id INTEGER PRIMARY KEY,
+   gitlab_id INTEGER UNIQUE NOT NULL,
+   project_id INTEGER NOT NULL REFERENCES projects(id),
+   iid INTEGER NOT NULL,
+   title TEXT,
+   description TEXT,
+   state TEXT,
+   author_username TEXT,
+   created_at INTEGER,
+   updated_at INTEGER,
++  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
+   web_url TEXT,
+   raw_payload_id INTEGER REFERENCES raw_payloads(id)
+ );
+```
+
+```diff
+@@ Schema Additions - merge_requests: @@
+ CREATE TABLE merge_requests (
+@@
+   updated_at INTEGER,
++  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
+   merged_at INTEGER,
+@@
+ );
+```
+
+```diff
+@@ Schema Additions - discussions: @@
+ CREATE TABLE discussions (
+@@
+   last_note_at INTEGER,
++  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
+   resolvable BOOLEAN,
+@@
+ );
+```
+
+```diff
+@@ Schema Additions - notes: @@
+ CREATE TABLE notes (
+@@
+   updated_at INTEGER,
++  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
+   position INTEGER,
+@@
+ );
+```
+
+---
+
+## Change 5: Raw Payload Compression
+
+**Why this is valuable:** At 50-100K documents plus threaded discussions, raw JSON is likely the largest storage consumer. Supporting gzip compression reduces DB size while preserving replay capability.
+
+```diff
+@@ Schema (Checkpoint 0) - raw_payloads: @@
+ CREATE TABLE raw_payloads (
+   id INTEGER PRIMARY KEY,
+   source TEXT NOT NULL,          -- 'gitlab'
+   project_id INTEGER REFERENCES projects(id),
+   resource_type TEXT NOT NULL,
+   gitlab_id INTEGER NOT NULL,
+   fetched_at INTEGER NOT NULL,
+-  json TEXT NOT NULL
++  content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
++  payload BLOB NOT NULL           -- raw JSON or gzip-compressed JSON
+ );
+```
+
+```diff
+@@ Configuration (MVP): @@
++  "storage": {
++    "compressRawPayloads": true    -- gzip raw payloads to reduce DB size
++  },
+```
+
+---
+
+## Change 6: Scope Discussions Unique by Project
+
+**Why this is important:** `gitlab_discussion_id TEXT UNIQUE` assumes global uniqueness across all projects. While likely true for GitLab, it's safer to scope by project_id. This avoids rare but painful collisions and makes it easier to support more repos later.
+
+```diff
+@@ Schema Additions - discussions: @@
+ CREATE TABLE discussions (
+   id INTEGER PRIMARY KEY,
+-  gitlab_discussion_id TEXT UNIQUE NOT NULL,
++  gitlab_discussion_id TEXT NOT NULL,
+   project_id INTEGER NOT NULL REFERENCES projects(id),
+@@
+ );
++CREATE UNIQUE INDEX uq_discussions_project_discussion_id
++  ON discussions(project_id, gitlab_discussion_id);
+```
+
+---
+
+## Change 7: Dirty Queue for Document Regeneration
+
+**Why this is valuable:** The orchestration says "Regenerate documents for changed entities" but doesn't define how "changed" is computed without scanning large tables. A dirty queue populated during ingestion makes doc regen deterministic and fast.
+
+```diff
+@@ Schema Additions (Checkpoint 3): @@
++-- Track sources that require document regeneration (populated during ingestion)
++CREATE TABLE dirty_sources (
++  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
++  source_id INTEGER NOT NULL,    -- local DB id
++  queued_at INTEGER NOT NULL,
++  PRIMARY KEY(source_type, source_id)
++);
+```
+
+```diff
+@@ Orchestration steps (in order): @@
+ 1. Acquire app lock with heartbeat
+ 2. Ingest delta (issues, MRs, discussions) based on cursors
++   - During ingestion, INSERT into dirty_sources for each upserted entity
+ 3. Apply rolling backfill window
+-4. Regenerate documents for changed entities
++4. Regenerate documents for entities in dirty_sources (process + delete from queue)
+ 5. Embed documents with changed content_hash
+ 6. FTS triggers auto-sync (no explicit step needed)
+ 7. Release lock, record sync_run as succeeded
+```
+
+---
+
+## Change 8: document_paths + --path Filter
+
+**Why this is high value:** We're already capturing DiffNote file paths in CP2. Adding a `--path` filter now makes the MVP dramatically more compelling for engineers who search by file path constantly.
+
+```diff
+@@ Checkpoint 4 Scope: @@
+-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`
++- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`
++  - `--path` filters documents by referenced file paths (from DiffNote positions)
++  - MVP: substring/exact match; glob patterns deferred
+```
+
+```diff
+@@ Schema Additions (Checkpoint 3): @@
++-- Fast path filtering for documents (extracted from DiffNote positions)
++CREATE TABLE document_paths (
++  document_id INTEGER NOT NULL REFERENCES documents(id),
++  path TEXT NOT NULL,
++  PRIMARY KEY(document_id, path)
++);
++CREATE INDEX idx_document_paths_path ON document_paths(path);
+```
+
+```diff
+@@ CLI Interface: @@
+ # Search within specific project
+ gi search "authentication" --project=group/project-one
+
++# Search by file path (finds discussions/MRs touching this file)
++gi search "rate limit" --path=src/client.ts
++
+ # Pure FTS search (fallback if embeddings unavailable)
+ gi search "redis" --mode=lexical
+```
+
+```diff
+@@ Manual CLI Smoke Tests: @@
++| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
+```
+
+---
+
+## Change 9: Character-Based Truncation (Not Exact Tokens)
+
+**Why this is practical:** "8000 tokens" sounds precise, but tokenizers vary. Exact token counting adds dependency complexity. A conservative character budget is simpler and avoids false precision.
+
+```diff
+@@ Document Extraction Rules: @@
+ - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
++  - **Implementation:** Use character budget, not exact token count
++    - `maxChars = 32000` (conservative 4 chars/token estimate)
++    - `approxTokens = ceil(charCount / 4)` for reporting/logging only
++    - This avoids tokenizer dependency while preventing embedding failures
+```
+
+```diff
+@@ Truncation: @@
+ If content exceeds 8000 tokens:
++**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`.
++
+ 1. Truncate from the middle (preserve first + last notes for context)
+ 2. Set `documents.is_truncated = 1`
+ 3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
+ 4. Log a warning with document ID and original token count
+```
+
+---
+
+## Change 10: Move Lexical Search to CP3 (Reorder, Not New Scope)
+
+**Why this is better:** Nothing "search-like" exists until CP4, but FTS5 is already a dependency for graceful degradation. Moving FTS setup to CP3 (when documents exist) gives an earlier usable artifact and better validation. CP4 becomes "hybrid ranking upgrade."
+
+```diff
+@@ Checkpoint 3: Embedding Generation @@
+-### Checkpoint 3: Embedding Generation
+-**Deliverable:** Vector embeddings generated for all text content
++### Checkpoint 3: Document + Embedding Generation with Lexical Search
++**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end
+```
+
+```diff
+@@ Checkpoint 3 Scope: @@
+ - Ollama integration (nomic-embed-text model)
+ - Embedding generation pipeline:
+@@
+ - Fast label filtering via `document_labels` join table
++- FTS5 index for lexical search (moved from CP4)
++- `gi search --mode=lexical` CLI command (works without Ollama)
+```
+
+```diff
+@@ Checkpoint 3 Manual CLI Smoke Tests: @@
++| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required |
+```
+
+```diff
+@@ Checkpoint 4: Semantic Search @@
+-### Checkpoint 4: Semantic Search
+-**Deliverable:** Working semantic search across all indexed content
++### Checkpoint 4: Hybrid Search (Semantic + Lexical)
++**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
+```
+
+```diff
+@@ Checkpoint 4 Scope: @@
+ **Scope:**
+ - Hybrid retrieval:
+   - Vector recall (sqlite-vss) + FTS lexical recall (fts5)
+   - Merge + rerank results using Reciprocal Rank Fusion (RRF)
++- Query embedding generation (same Ollama pipeline as documents)
+ - Result ranking and scoring (document-level)
+-- Search filters: ...
++- Filters work identically in hybrid and lexical modes
+```
+
+---
+
+## Summary of All Changes (Round 2)
+
+| # | Change | Impact |
+|---|--------|--------|
+| 1 | **Cursor rewind + local filtering** | Fixes real correctness gap in tuple cursor implementation |
+| 2 | **BEGIN IMMEDIATE CAS for lock** | Prevents race condition in lock acquisition |
+| 3 | **Discussions pagination + concurrency** | Prevents silent data loss on large discussion threads |
+| 4 | **last_seen_at columns** | Enables debugging of sync coverage without deletions |
+| 5 | **Raw payload compression** | Reduces DB size significantly at scale |
+| 6 | **Scope discussions unique by project** | Defensive uniqueness for multi-project safety |
+| 7 | **Dirty queue for doc regen** | Makes document regeneration deterministic and fast |
+| 8 | **document_paths + --path filter** | High-value file search with minimal scope |
+| 9 | **Character-based truncation** | Practical implementation without tokenizer dependency |
+| 10 | **Lexical search in CP3** | Earlier usable artifact; better checkpoint validation |
+
+**Net effect:** These changes fix several correctness gaps (cursor, lock, pagination) while adding high-value features (--path filter) and operational improvements (compression, dirty queue, last_seen_at).
diff --git a/SPEC-REVISIONS-3.md b/SPEC-REVISIONS-3.md
new file mode 100644
index 0000000..818ac87
--- /dev/null
+++ b/SPEC-REVISIONS-3.md
@@ -0,0 +1,427 @@
+# SPEC.md Revisions - First-Time User Experience
+
+**Date:** 2026-01-21
+**Purpose:** Document all changes adding installation, setup, and user flow documentation to SPEC.md
+
+---
+
+## Summary of Changes
+
+| Change | Location | Description |
+|--------|----------|-------------|
+| 1. Quick Start | After Executive Summary | Prerequisites, installation, first-run walkthrough |
+| 2. `gi init` Command | Checkpoint 0 | Interactive setup wizard with GitLab validation |
+| 3. CLI Command Reference | Before Future Work | Unified table of all commands |
+| 4. Error Handling | After CLI Reference | Common errors with recovery guidance |
+| 5. Database Management | After Error Handling | Location, backup, reset, migrations |
+| 6. Empty State Handling | Checkpoint 4 scope | Behavior when no data indexed |
+| 7. Resolved Decisions | Resolved Decisions table | New decisions from this revision |
+
+---
+
+## Change 1: Quick Start Section
+
+**Location:** Insert after line 6 (after Executive Summary), before Discovery Summary
+
+```diff
+ A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
+
+ ---
+
++## Quick Start
++
++### Prerequisites
++
++| Requirement | Version | Notes |
++|-------------|---------|-------|
++| Node.js | 20+ | LTS recommended |
++| npm | 10+ | Comes with Node.js |
++| Ollama | Latest | Optional for semantic search; lexical search works without it |
++
++### Installation
++
++```bash
++# Clone and install
++git clone https://github.com/your-org/gitlab-inbox.git
++cd gitlab-inbox
++npm install
++npm run build
++npm link  # Makes `gi` available globally
++```
++
++### First Run
++
++1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope):
++   ```bash
++   export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
++   ```
++
++2. **Run the setup wizard:**
++   ```bash
++   gi init
++   ```
++   This creates `gi.config.json` with your GitLab URL and project paths.
++
++3. **Verify your environment:**
++   ```bash
++   gi doctor
++   ```
++   All checks should pass (Ollama warning is OK if you only need lexical search).
++
++4. **Sync your data:**
++   ```bash
++   gi sync
++   ```
++   Initial sync takes 10-20 minutes depending on repo size and rate limits.
++
++5. **Search:**
++   ```bash
++   gi search "authentication redesign"
++   ```
++
++### Troubleshooting First Run
++
++| Symptom | Solution |
++|---------|----------|
++| `Config file not found` | Run `gi init` first |
++| `GITLAB_TOKEN not set` | Export the environment variable |
++| `401 Unauthorized` | Check token has `read_api` scope |
++| `Project not found: group/project` | Verify project path in GitLab URL |
++| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search |
++
++---
++
+ ## Discovery Summary
+```
+
+---
+
+## Change 2: `gi init` Command in Checkpoint 0
+
+**Location:** Insert in Checkpoint 0 Manual CLI Smoke Tests table and Scope section
+
+### 2a: Add to Manual CLI Smoke Tests table (after line 193)
+
+```diff
+ | `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
++| `gi init` | Interactive prompts | Creates valid gi.config.json |
++| `gi init` (config exists) | Confirmation prompt | Warns before overwriting |
++| `gi --help` | Command list | Shows all available commands |
++| `gi version` | Version number | Shows installed version |
+```
+
+### 2b: Add Automated Tests for init (after line 185)
+
+```diff
+ tests/integration/app-lock.test.ts
+   ✓ acquires lock successfully
+   ✓ updates heartbeat during operation
+   ✓ detects stale lock and recovers
+   ✓ refuses concurrent acquisition
++
++tests/integration/init.test.ts
++  ✓ creates config file with valid structure
++  ✓ validates GitLab URL format
++  ✓ validates GitLab connection before writing config
++  ✓ validates each project path exists in GitLab
++  ✓ fails if token not set
++  ✓ fails if GitLab auth fails
++  ✓ fails if any project path not found
++  ✓ prompts before overwriting existing config
++  ✓ respects --force to skip confirmation
+```
+
+### 2c: Add to Checkpoint 0 Scope (after line 209)
+
+```diff
+ - Rate limit handling with exponential backoff + jitter
++- `gi init` command for guided setup:
++  - Prompts for GitLab base URL
++  - Prompts for project paths (comma-separated or multiple prompts)
++  - Prompts for token environment variable name (default: GITLAB_TOKEN)
++  - **Validates before writing config:**
++    - Token must be set in environment
++    - Tests auth with `GET /user` endpoint
++    - Validates each project path with `GET /projects/:path`
++    - Only writes config after all validations pass
++  - Generates `gi.config.json` with sensible defaults
++- `gi --help` shows all available commands
++- `gi <command> --help` shows command-specific help
++- `gi version` shows installed version
++- First-run detection: if no config exists, suggest `gi init`
+```
+
+---
+
+## Change 3: CLI Command Reference Section
+
+**Location:** Insert before "## Future Work (Post-MVP)" (before line 1174)
+
+```diff
++## CLI Command Reference
++
++All commands support `--help` for detailed usage information.
++
++### Setup & Diagnostics
++
++| Command | CP | Description |
++|---------|-----|-------------|
++| `gi init` | 0 | Interactive setup wizard; creates gi.config.json |
++| `gi auth-test` | 0 | Verify GitLab authentication |
++| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) |
++| `gi doctor --json` | 0 | JSON output for scripting |
++| `gi version` | 0 | Show installed version |
++
++### Data Ingestion
++
++| Command | CP | Description |
++|---------|-----|-------------|
++| `gi ingest --type=issues` | 1 | Fetch issues from GitLab |
++| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions |
++| `gi embed --all` | 3 | Generate embeddings for all documents |
++| `gi embed --retry-failed` | 3 | Retry failed embeddings |
++| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) |
++| `gi sync --full` | 5 | Force complete re-sync (reset cursors) |
++| `gi sync --force` | 5 | Override stale lock after operator review |
++| `gi sync --no-embed` | 5 | Sync without embedding (faster) |
++
++### Data Inspection
++
++| Command | CP | Description |
++|---------|-----|-------------|
++| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues |
++| `gi list mrs [--limit=N]` | 2 | List merge requests |
++| `gi count issues` | 1 | Count issues |
++| `gi count mrs` | 2 | Count merge requests |
++| `gi count discussions` | 2 | Count discussions |
++| `gi count notes` | 2 | Count notes |
++| `gi show issue <iid>` | 1 | Show issue details |
++| `gi show mr <iid>` | 2 | Show MR details with discussions |
++| `gi stats` | 3 | Embedding coverage statistics |
++| `gi stats --json` | 3 | JSON stats for scripting |
++| `gi sync-status` | 1 | Show cursor positions and last sync |
++
++### Search
++
++| Command | CP | Description |
++|---------|-----|-------------|
++| `gi search "query"` | 4 | Hybrid semantic + lexical search |
++| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) |
++| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type |
++| `gi search "query" --author=USERNAME` | 4 | Filter by author |
++| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date |
++| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) |
++| `gi search "query" --project=PATH` | 4 | Filter by project |
++| `gi search "query" --path=FILE` | 4 | Filter by file path |
++| `gi search "query" --json` | 4 | JSON output for scripting |
++| `gi search "query" --explain` | 4 | Show ranking breakdown |
++
++### Database Management
++
++| Command | CP | Description |
++|---------|-----|-------------|
++| `gi backup` | 0 | Create timestamped database backup |
++| `gi reset --confirm` | 0 | Delete database and reset cursors |
++
++---
++
+ ## Future Work (Post-MVP)
+```
+
+---
+
+## Change 4: Error Handling Section
+
+**Location:** Insert after CLI Command Reference, before Future Work
+
+```diff
++## Error Handling
++
++Common errors and their resolutions:
++
++### Configuration Errors
++
++| Error | Cause | Resolution |
++|-------|-------|------------|
++| `Config file not found` | No gi.config.json | Run `gi init` to create configuration |
++| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually |
++| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config |
++
++### Authentication Errors
++
++| Error | Cause | Resolution |
++|-------|-------|------------|
++| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` |
++| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope |
++| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope |
++
++### GitLab API Errors
++
++| Error | Cause | Resolution |
++|-------|-------|------------|
++| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) |
++| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry |
++| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity |
++
++### Data Errors
++
++| Error | Cause | Resolution |
++|-------|-------|------------|
++| `No documents indexed` | Sync not run | Run `gi sync` first |
++| `No results found` | Query too specific | Try broader search terms |
++| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale |
++
++### Embedding Errors
++
++| Error | Cause | Resolution |
++|-------|-------|------------|
++| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` |
++| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` |
++| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` |
++
++### Operational Behavior
++
++| Scenario | Behavior |
++|----------|----------|
++| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. |
++| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
++| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
++| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
++
++---
++
+ ## Future Work (Post-MVP)
+```
+
+---
+
+## Change 5: Database Management Section
+
+**Location:** Insert after Error Handling, before Future Work
+
+```diff
++## Database Management
++
++### Database Location
++
++The SQLite database is stored at an XDG-compliant location:
++
++```
++~/.local/share/gi/data.db
++```
++
++This can be overridden in `gi.config.json`:
++
++```json
++{
++  "storage": {
++    "dbPath": "/custom/path/to/data.db"
++  }
++}
++```
++
++### Backup
++
++Create a timestamped backup of the database:
++
++```bash
++gi backup
++# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db
++```
++
++Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode).
++
++### Reset
++
++To completely reset the database and all sync cursors:
++
++```bash
++gi reset --confirm
++```
++
++This deletes:
++- The database file
++- All sync cursors
++- All embeddings
++
++You'll need to run `gi sync` again to repopulate.
++
++### Schema Migrations
++
++Database schema is version-tracked and migrations auto-apply on startup:
++
++1. On first run, schema is created at latest version
++2. On subsequent runs, pending migrations are applied automatically
++3. Migration version is stored in `schema_version` table
++4. Migrations are idempotent and reversible where possible
++
++**Manual migration check:**
++```bash
++gi doctor --json | jq '.checks.database'
++# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }
++```
++
++---
++
+ ## Future Work (Post-MVP)
+```
+
+---
+
+## Change 6: Empty State Handling in Checkpoint 4
+
+**Location:** Add to Checkpoint 4 scope section (around line 885, after "Graceful degradation")
+
+```diff
+ - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
++- Empty state handling:
++  - No documents indexed: `No data indexed. Run 'gi sync' first.`
++  - Query returns no results: `No results found for "query".`
++  - Filters exclude all results: `No results match the specified filters.`
++  - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
+```
+
+**Location:** Add to Manual CLI Smoke Tests table (after `gi search "xyznonexistent123"` row)
+
+```diff
+ | `gi search "xyznonexistent123"` | No results message | Graceful empty state |
++| `gi search "auth"` (no data synced) | No data message | Shows "Run gi sync first" |
+```
+
+---
+
+## Change 7: Update Resolved Decisions Table
+
+**Location:** Add new rows to Resolved Decisions table (around line 1280)
+
+```diff
+ | JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
++| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
++| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
++| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
++| Empty state UX | **Actionable messages** | Guide user to next step |
+```
+
+---
+
+## Files Modified
+
+| File | Action |
+|------|--------|
+| `SPEC.md` | 7 changes applied |
+| `SPEC-REVISIONS-3.md` | Created (this file) |
+
+---
+
+## Verification Checklist
+
+After applying changes:
+
+- [ ] Quick Start section provides clear 5-step onboarding
+- [ ] `gi init` fully specified with validation behavior
+- [ ] All CLI commands documented in reference table
+- [ ] Error scenarios have recovery guidance
+- [ ] Database location and management documented
+- [ ] Empty states have helpful messages
+- [ ] Resolved Decisions updated with new choices
+- [ ] No orphaned command references
diff --git a/SPEC-REVISIONS.md b/SPEC-REVISIONS.md
new file mode 100644
index 0000000..caa9573
--- /dev/null
+++ b/SPEC-REVISIONS.md
@@ -0,0 +1,716 @@
+# SPEC.md Revision Document
+
+This document provides git-diff style changes to integrate improvements from ChatGPT's review into the original SPEC.md. The goal is a "best of all worlds" hybrid that maintains the original architecture while adding production-grade hardening.
+
+---
+
+## Change 1: Crash-safe Single-flight with Heartbeat Lock
+
+**Why this is better:** The original plan's single-flight protection is policy-based, not DB-enforced. A race condition exists where two processes could both start before either writes to `sync_runs`. The heartbeat approach provides DB-enforced atomicity, automatic crash recovery, and less manual intervention.
+
+```diff
+@@ Schema (Checkpoint 0): @@
+ CREATE TABLE sync_runs (
+   id INTEGER PRIMARY KEY,
+   started_at INTEGER NOT NULL,
++  heartbeat_at INTEGER NOT NULL,
+   finished_at INTEGER,
+   status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
+   command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
+   error TEXT
+ );
+
++-- Crash-safe single-flight lock (DB-enforced)
++CREATE TABLE app_locks (
++  name TEXT PRIMARY KEY,         -- 'sync'
++  owner TEXT NOT NULL,           -- random run token (UUIDv4)
++  acquired_at INTEGER NOT NULL,
++  heartbeat_at INTEGER NOT NULL
++);
+```
+
+```diff
+@@ Checkpoint 0: Project Setup - Scope @@
+ **Scope:**
+ - Project structure (TypeScript, ESLint, Vitest)
+ - GitLab API client with PAT authentication
+ - Environment and project configuration
+ - Basic CLI scaffold with `auth-test` command
+ - `doctor` command for environment verification
+-- Projects table and initial sync
++- Projects table and initial project resolution (no issue/MR ingestion yet)
++- DB migrations + WAL + FK + app lock primitives
++- Crash-safe single-flight lock with heartbeat
+```
+
+```diff
+@@ Reliability/Idempotency Rules: @@
+ - Every ingest/sync creates a `sync_runs` row
+-- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
++- Single-flight: acquire `app_locks('sync')` before starting
++  - On start: INSERT OR REPLACE lock row with new owner token
++  - During run: update `heartbeat_at` every 30 seconds
++  - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
++  - `--force` remains as operator override for edge cases, but should rarely be needed
+ - Cursor advances only after successful transaction commit per page/batch
+ - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
+ - Use explicit transactions for batch inserts
+```
+
+```diff
+@@ Configuration (MVP): @@
+ // gi.config.json
+ {
+   "gitlab": {
+     "baseUrl": "https://gitlab.example.com",
+     "tokenEnvVar": "GITLAB_TOKEN"
+   },
+   "projects": [
+     { "path": "group/project-one" },
+     { "path": "group/project-two" }
+   ],
++  "sync": {
++    "backfillDays": 14,
++    "staleLockMinutes": 10,
++    "heartbeatIntervalSeconds": 30
++  },
+   "embedding": {
+     "provider": "ollama",
+     "model": "nomic-embed-text",
+-    "baseUrl": "http://localhost:11434"
++    "baseUrl": "http://localhost:11434",
++    "concurrency": 4
+   }
+ }
+```
+
+---
+
+## Change 2: Harden Cursor Semantics + Rolling Backfill Window
+
+**Why this is better:** The original plan's "critical assumption" that comments update parent `updated_at` is mostly true but the failure mode is catastrophic (silently missing new discussion content). The rolling backfill provides a safety net without requiring weekly full resyncs.
+
+```diff
+@@ GitLab API Strategy - Critical Assumption @@
+-### Critical Assumption
+-
+-**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
+-
+-Mitigation: Periodic full re-sync (weekly) as a safety net.
++### Critical Assumption (Softened)
++
++We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively.
++
++**Mitigations (MVP):**
++1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor.
++2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
++3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`).
++
++The backfill window provides 80% of the safety of full resync at <5% of the API cost.
+```
+
+```diff
+@@ Checkpoint 5: Incremental Sync - Scope @@
+ **Scope:**
+-- Delta sync based on stable cursor (updated_at + tie-breaker id)
++- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)`
++- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
+ - Dependent resources sync strategy (discussions refetched when parent updates)
+ - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
+ - Sync status reporting
+ - Recommended: run via cron every 10 minutes
+```
+
+```diff
+@@ Correctness Rules (MVP): @@
+-1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
+-2. Cursor advances only after successful DB commit for that page
++1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
++2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
++   - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)`
++   - Cursor advances only after successful DB commit for that page
++   - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
+ 3. Dependent resources:
+    - For each updated issue/MR, refetch ALL its discussions
+    - Discussion documents are regenerated and re-embedded if content_hash changes
+-4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
+-5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
++4. Rolling backfill window:
++   - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays`
++   - This catches any items whose timestamps were updated without triggering our cursor
++5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
++6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
+```
+
+---
+
+## Change 3: Raw Payload Scoping + project_id
+
+**Why this is better:** The original `raw_payloads(resource_type, gitlab_id)` index could have collisions in edge cases (especially if later adding more projects or resource types). Adding `project_id` is defensive and enables project-scoped lookups.
+
+```diff
+@@ Schema (Checkpoint 0) - raw_payloads @@
+ CREATE TABLE raw_payloads (
+   id INTEGER PRIMARY KEY,
+   source TEXT NOT NULL,          -- 'gitlab'
++  project_id INTEGER REFERENCES projects(id),  -- nullable for instance-level resources
+   resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
+   gitlab_id INTEGER NOT NULL,
+   fetched_at INTEGER NOT NULL,
+   json TEXT NOT NULL
+ );
+-CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
++CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
++CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
+```
+
+---
+
+## Change 4: Tighten Uniqueness Constraints (project_id + iid)
+
+**Why this is better:** Users think in terms of "issue 123 in project X," not global IDs. This enables O(1) `gi show issue 123 --project=X` and prevents subtle ingestion bugs from creating duplicate rows.
+
+```diff
+@@ Schema Preview - issues @@
+ CREATE TABLE issues (
+   id INTEGER PRIMARY KEY,
+   gitlab_id INTEGER UNIQUE NOT NULL,
+   project_id INTEGER NOT NULL REFERENCES projects(id),
+   iid INTEGER NOT NULL,
+   title TEXT,
+   description TEXT,
+   state TEXT,
+   author_username TEXT,
+   created_at INTEGER,
+   updated_at INTEGER,
+   web_url TEXT,
+   raw_payload_id INTEGER REFERENCES raw_payloads(id)
+ );
+ CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
+ CREATE INDEX idx_issues_author ON issues(author_username);
++CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);
+```
+
+```diff
+@@ Schema Additions - merge_requests @@
+ CREATE TABLE merge_requests (
+   id INTEGER PRIMARY KEY,
+   gitlab_id INTEGER UNIQUE NOT NULL,
+   project_id INTEGER NOT NULL REFERENCES projects(id),
+   iid INTEGER NOT NULL,
+   ...
+ );
+ CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
+ CREATE INDEX idx_mrs_author ON merge_requests(author_username);
++CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);
+```
+
+---
+
+## Change 5: Store System Notes (Flagged) + Capture DiffNote Paths
+
+**Why this is better:** Two problems with dropping system notes entirely: (1) Some system notes carry decision trace context ("marked as resolved", "changed milestone"). (2) File/path search is disproportionately valuable for engineers. DiffNote positions already contain path metadata - capturing it now enables immediate filename search.
+
+```diff
+@@ Checkpoint 2 Scope @@
+ - Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
+   - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
+   - During initial ingest: fetch discussions for every issue/MR
+   - During sync: refetch discussions only for issues/MRs updated since cursor
+-  - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
++  - Preserve system notes but flag them with `is_system=1`; exclude from embeddings by default
++  - Capture DiffNote file path/line metadata from `position` field for immediate filename search value
+```
+
+```diff
+@@ Schema Additions - notes @@
+ CREATE TABLE notes (
+   id INTEGER PRIMARY KEY,
+   gitlab_id INTEGER UNIQUE NOT NULL,
+   discussion_id INTEGER NOT NULL REFERENCES discussions(id),
+   project_id INTEGER NOT NULL REFERENCES projects(id),
+   type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
++  is_system BOOLEAN NOT NULL DEFAULT 0,       -- system notes (assignments, label changes, etc.)
+   author_username TEXT,
+   body TEXT,
+   created_at INTEGER,
+   updated_at INTEGER,
+   position INTEGER,                           -- derived from array order in API response (0-indexed)
+   resolvable BOOLEAN,
+   resolved BOOLEAN,
+   resolved_by TEXT,
+   resolved_at INTEGER,
++  -- DiffNote position metadata (nullable, from GitLab API position object)
++  position_old_path TEXT,
++  position_new_path TEXT,
++  position_old_line INTEGER,
++  position_new_line INTEGER,
+   raw_payload_id INTEGER REFERENCES raw_payloads(id)
+ );
+ CREATE INDEX idx_notes_discussion ON notes(discussion_id);
+ CREATE INDEX idx_notes_author ON notes(author_username);
+ CREATE INDEX idx_notes_type ON notes(type);
++CREATE INDEX idx_notes_system ON notes(is_system);
++CREATE INDEX idx_notes_new_path ON notes(position_new_path);
+```
+
+```diff
+@@ Discussion Processing Rules @@
+-- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
++- System notes (`system: true`) are ingested with `notes.is_system=1`
++  - Excluded from document extraction/embeddings by default (reduces noise in semantic search)
++  - Preserved for audit trail, timeline views, and potential future decision-tracing features
++  - Can be toggled via `--include-system-notes` flag if needed
++- DiffNote position data is extracted and stored:
++  - `position.old_path`, `position.new_path` for file-level search
++  - `position.old_line`, `position.new_line` for line-level context
+ - Each discussion from the API becomes one row in `discussions` table
+ - All notes within a discussion are stored with their `discussion_id` foreign key
+ - `individual_note: true` discussions have exactly one note (standalone comment)
+ - `individual_note: false` discussions have multiple notes (threaded conversation)
+```
+
+```diff
+@@ Checkpoint 2 Automated Tests @@
+ tests/unit/discussion-transformer.test.ts
+   - transforms discussion payload to normalized schema
+   - extracts notes array from discussion
+   - sets individual_note flag correctly
+-  - filters out system notes (system: true)
++  - flags system notes with is_system=1
++  - extracts DiffNote position metadata (paths and lines)
+   - preserves note order via position field
+
+ tests/integration/discussion-ingestion.test.ts
+   - fetches discussions for each issue
+   - fetches discussions for each MR
+   - creates discussion rows with correct parent FK
+   - creates note rows linked to discussions
+-  - excludes system notes from storage
++  - stores system notes with is_system=1 flag
++  - extracts position_new_path from DiffNotes
+   - captures note-level resolution status
+   - captures note type (DiscussionNote, DiffNote)
+```
+
+```diff
+@@ Checkpoint 2 Data Integrity Checks @@
+ - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
+ - [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
+ - [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
+-- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
++- [ ] System notes have `is_system=1` flag set correctly
++- [ ] DiffNotes have `position_new_path` populated when available
+ - [ ] Every discussion has at least one note
+ - [ ] `individual_note = true` discussions have exactly one note
+ - [ ] Discussion `first_note_at` <= `last_note_at` for all rows
+```
+
+---
+
+## Change 6: Document Extraction Structured Header + Truncation Metadata
+
+**Why this is better:** Adding a deterministic header improves search snippets (more informative), embeddings (model gets stable context), and debuggability (see if/why truncation happened).
+
+```diff
+@@ Schema Additions - documents @@
+ CREATE TABLE documents (
+   id INTEGER PRIMARY KEY,
+   source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
+   source_id INTEGER NOT NULL,    -- local DB id in the source table
+   project_id INTEGER NOT NULL REFERENCES projects(id),
+   author_username TEXT,          -- for discussions: first note author
+   label_names TEXT,              -- JSON array (display/debug only)
+   created_at INTEGER,
+   updated_at INTEGER,
+   url TEXT,
+   title TEXT,                    -- null for discussions
+   content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
+   content_hash TEXT NOT NULL,    -- SHA-256 for change detection
++  is_truncated BOOLEAN NOT NULL DEFAULT 0,
++  truncated_reason TEXT,         -- 'token_limit_middle_drop' | null
+   UNIQUE(source_type, source_id)
+ );
+```
+
+```diff
+@@ Discussion Document Format @@
+-[Issue #234: Authentication redesign] Discussion
++[[Discussion]] Issue #234: Authentication redesign
++Project: group/project-one
++URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
++Labels: ["bug", "auth"]
++Files: ["src/auth/login.ts"]     -- present if any DiffNotes exist in thread
++
++--- Thread ---
+
+ @johndoe (2024-03-15):
+ I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
+
+ @janedoe (2024-03-15):
+ Agreed. What about refresh token strategy?
+
+ @johndoe (2024-03-16):
+ Short-lived access tokens (15min), longer refresh (7 days). Here's why...
+```
+
+```diff
+@@ Document Extraction Rules @@
+ | Source | content_text Construction |
+ |--------|--------------------------|
+-| Issue | `title + "\n\n" + description` |
+-| MR | `title + "\n\n" + description` |
++| Issue | Structured header + `title + "\n\n" + description` |
++| MR | Structured header + `title + "\n\n" + description` |
+ | Discussion | Full thread with context (see below) |
+
++**Structured Header Format (all document types):**
++```
++[[{SourceType}]] {Title}
++Project: {path_with_namespace}
++URL: {web_url}
++Labels: {JSON array of label names}
++Files: {JSON array of paths from DiffNotes, if any}
++
++--- Content ---
++```
++
++This format provides:
++- Stable, parseable context for embeddings
++- Consistent snippet formatting in search results
++- File path context without full file-history feature
+```
+
+```diff
+@@ Truncation @@
+-**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.
++**Truncation:**
++If content exceeds 8000 tokens:
++1. Truncate from the middle (preserve first + last notes for context)
++2. Set `documents.is_truncated = 1`
++3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
++4. Log a warning with document ID and original token count
++
++This metadata enables:
++- Monitoring truncation frequency in production
++- Future investigation of high-value truncated documents
++- Debugging when search misses expected content
+```
+
+---
+
+## Change 7: Embedding Pipeline Concurrency + Per-Document Error Tracking
+
+**Why this is better:** For 50-100K documents, embedding is the longest pole. Controlled concurrency (4-8 workers) saturates local inference without OOM. Per-document error tracking prevents single bad payloads from stalling "100% coverage" and enables targeted re-runs.
+
+```diff
+@@ Checkpoint 3: Embedding Generation - Scope @@
+ **Scope:**
+ - Ollama integration (nomic-embed-text model)
+-- Embedding generation pipeline (batch processing, 32 documents per batch)
++- Embedding generation pipeline:
++  - Batch size: 32 documents per batch
++  - Concurrency: configurable (default 4 workers)
++  - Retry with exponential backoff for transient failures (max 3 attempts)
++  - Per-document failure recording to enable targeted re-runs
+ - Vector storage in SQLite (sqlite-vss extension)
+ - Progress tracking and resumability
+ - Document extraction layer:
+```
+
+```diff
+@@ Schema Additions - embedding_metadata @@
+ CREATE TABLE embedding_metadata (
+   document_id INTEGER PRIMARY KEY REFERENCES documents(id),
+   model TEXT NOT NULL,           -- 'nomic-embed-text'
+   dims INTEGER NOT NULL,         -- 768
+   content_hash TEXT NOT NULL,    -- copied from documents.content_hash
+-  created_at INTEGER NOT NULL
++  created_at INTEGER NOT NULL,
++  -- Error tracking for resumable embedding
++  last_error TEXT,               -- error message from last failed attempt
++  attempt_count INTEGER NOT NULL DEFAULT 0,
++  last_attempt_at INTEGER        -- when last attempt occurred
+ );
++
++-- Index for finding failed embeddings to retry
++CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
+```
+
+```diff
+@@ Checkpoint 3 Automated Tests @@
+ tests/integration/embedding-storage.test.ts
+   - stores embedding in sqlite-vss
+   - embedding rowid matches document id
+   - creates embedding_metadata record
+   - skips re-embedding when content_hash unchanged
+   - re-embeds when content_hash changes
++  - records error in embedding_metadata on failure
++  - increments attempt_count on each retry
++  - clears last_error on successful embedding
++  - respects concurrency limit
+```
+
+```diff
+@@ Checkpoint 3 Manual CLI Smoke Tests @@
+ | Command | Expected Output | Pass Criteria |
+ |---------|-----------------|---------------|
+ | `gi embed --all` | Progress bar with ETA | Completes without error |
+ | `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
++| `gi embed --retry-failed` | Progress on failed docs | Re-attempts previously failed embeddings |
+ | `gi stats` | Embedding coverage stats | Shows 100% coverage |
+ | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
+ | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
++
++**Stats output should include:**
++- Total documents
++- Successfully embedded
++- Failed (with error breakdown)
++- Pending (never attempted)
+```
+
+---
+
+## Change 8: Search UX Improvements (--project, --explain, Stable JSON Schema)
+
+**Why this is better:** For day-to-day use, "search across everything" is less useful than "search within repo X." The `--explain` flag helps validate ranking during MVP. Stable JSON schema prevents accidental breaking changes for agent/MCP consumption.
+
+```diff
+@@ Checkpoint 4 Scope @@
+-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
++- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`
++- Debug: `--explain` returns rank contributions from vector + FTS + RRF
+   - Label filtering operates on `document_labels` (indexed, exact-match)
+ - Output formatting: ranked list with title, snippet, score, URL
+-- JSON output mode for AI agent consumption
++- JSON output mode for AI/agent consumption (stable schema, documented)
+ - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
+```
+
+```diff
+@@ CLI Interface @@
+ # Basic semantic search
+ gi search "why did we choose Redis"
+
++# Search within specific project
++gi search "authentication" --project=group/project-one
++
+ # Pure FTS search (fallback if embeddings unavailable)
+ gi search "redis" --mode=lexical
+
+ # Filtered search
+ gi search "authentication" --type=mr --after=2024-01-01
+
+ # Filter by label
+ gi search "performance" --label=bug --label=critical
+
+ # JSON output for programmatic use
+ gi search "payment processing" --json
++
++# Debug ranking (shows how each retriever contributed)
++gi search "authentication" --explain
+```
+
+```diff
+@@ JSON Output Schema (NEW SECTION) @@
++**JSON Output Schema (Stable)**
++
++For AI/agent consumption, `--json` output follows this stable schema:
++
++```typescript
++interface SearchResult {
++  documentId: number;
++  sourceType: "issue" | "merge_request" | "discussion";
++  title: string | null;
++  url: string;
++  projectPath: string;
++  author: string | null;
++  createdAt: string;       // ISO 8601
++  updatedAt: string;       // ISO 8601
++  score: number;           // 0-1 normalized RRF score
++  snippet: string;         // truncated content_text
++  labels: string[];
++  // Only present with --explain flag
++  explain?: {
++    vectorRank?: number;   // null if not in vector results
++    ftsRank?: number;      // null if not in FTS results
++    rrfScore: number;
++  };
++}
++
++interface SearchResponse {
++  query: string;
++  mode: "hybrid" | "lexical" | "semantic";
++  totalResults: number;
++  results: SearchResult[];
++  warnings?: string[];     // e.g., "Embedding service unavailable"
++}
++```
++
++**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.
+```
+
+```diff
+@@ Checkpoint 4 Manual CLI Smoke Tests @@
+ | Command | Expected Output | Pass Criteria |
+ |---------|-----------------|---------------|
+ | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
++| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project |
+ | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
+ | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
+ | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
+ | `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
+ | `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
+ | `gi search "authentication" --json` | JSON output | Valid JSON matching schema |
++| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
+ | `gi search "xyznonexistent123"` | No results message | Graceful empty state |
+ | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
+```
+
+---
+
+## Change 9: Make `gi sync` an Orchestrator
+
+**Why this is better:** Once CP3+ exist, operators want one command that does the right thing. The most common MVP failure is "I ingested but forgot to regenerate docs / embed / update FTS."
+
+```diff
+@@ Checkpoint 5 CLI Commands @@
+ ```bash
+-# Full sync (respects cursors, only fetches new/updated)
+-gi sync
++# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
++gi sync                    # orchestrates all steps
++gi sync --no-embed         # skip embedding step (fast ingest/debug)
++gi sync --no-docs          # skip document regeneration (debug)
+
+ # Force full re-sync (resets cursors)
+ gi sync --full
+
+ # Override stale 'running' run after operator review
+ gi sync --force
+
+ # Show sync status
+ gi sync-status
+ ```
++
++**Orchestration steps (in order):**
++1. Acquire app lock with heartbeat
++2. Ingest delta (issues, MRs, discussions) based on cursors
++3. Apply rolling backfill window
++4. Regenerate documents for changed entities
++5. Embed documents with changed content_hash
++6. FTS triggers auto-sync (no explicit step needed)
++7. Release lock, record sync_run as succeeded
++
++Individual commands remain available for checkpoint testing and debugging:
++- `gi ingest --type=issues`
++- `gi ingest --type=merge_requests`
++- `gi embed --all`
++- `gi embed --retry-failed`
+```
+
+---
+
+## Change 10: Checkpoint Focus Sharpening
+
+**Why this is better:** Makes each checkpoint's exit criteria crisper and reduces overlap.
+
+```diff
+@@ Checkpoint 0: Project Setup @@
+-**Deliverable:** Scaffolded project with GitLab API connection verified
++**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working
+
+ **Scope:**
+ - Project structure (TypeScript, ESLint, Vitest)
+ - GitLab API client with PAT authentication
+ - Environment and project configuration
+ - Basic CLI scaffold with `auth-test` command
+ - `doctor` command for environment verification
+-- Projects table and initial sync
+-- Sync tracking for reliability
++- Projects table and initial project resolution (no issue/MR ingestion yet)
++- DB migrations + WAL + FK enforcement
++- Sync tracking with crash-safe single-flight lock
++- Rate limit handling with exponential backoff + jitter
+```
+
+```diff
+@@ Checkpoint 1 Deliverable @@
+-**Deliverable:** All issues from target repos stored locally
++**Deliverable:** All issues + labels from target repos stored locally with resumable cursor-based sync
+```
+
+```diff
+@@ Checkpoint 2 Deliverable @@
+-**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context
++**Deliverable:** All MRs + discussions + notes (including flagged system notes) stored locally with full thread context and DiffNote file paths captured
+```
+
+---
+
+## Change 11: Risk Mitigation Updates
+
+```diff
+@@ Risk Mitigation @@
+ | Risk | Mitigation |
+ |------|------------|
+ | GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
+ | Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
+ | SQLite scale limits | Monitor performance; Postgres migration path documented |
+ | Stale data | Incremental sync with change detection |
+-| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
++| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
++| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
+ | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
+-| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |
++| Concurrent sync corruption | DB-enforced app lock with heartbeat, automatic stale lock recovery |
++| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
+```
+
+---
+
+## Change 12: Resolved Decisions Updates
+
+```diff
+@@ Resolved Decisions @@
+ | Question | Decision | Rationale |
+ |----------|----------|-----------|
+ | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability |
+-| System notes | **Exclude during ingestion** | System notes add noise without semantic value |
++| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise |
++| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature |
+ | MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature |
+ | Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering |
+ | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
+ | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
++| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
+ | Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates |
+ | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
+ | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
+ | Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context |
+-| Embedding batching | **32 documents per batch** | Balance throughput and memory |
++| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
+ | FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
+ | Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
++| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
+```
+
+---
+
+## Summary of All Changes
+
+| # | Change | Impact |
+|---|--------|--------|
+| 1 | Crash-safe heartbeat lock | Prevents race conditions, auto-recovers from crashes |
+| 2 | Tuple cursor + rolling backfill | Reduces risk of missed updates dramatically |
+| 3 | project_id on raw_payloads | Defensive scoping for multi-project scenarios |
+| 4 | Uniqueness on (project_id, iid) | Enables O(1) `gi show issue 123 --project=X` |
+| 5 | Store system notes flagged + DiffNote paths | Preserves audit trail, enables immediate file search |
+| 6 | Structured document header + truncation metadata | Better embeddings, debuggability |
+| 7 | Embedding concurrency + per-doc errors | 50-100K docs becomes manageable |
+| 8 | --project, --explain, stable JSON | Day-to-day UX and trust-building |
+| 9 | `gi sync` orchestrator | Reduces human error |
+| 10 | Checkpoint focus sharpening | Clearer exit criteria |
+| 11-12 | Risk/Decisions updates | Documentation alignment |
+
+**Net effect:** Same MVP product (semantic search over issues/MRs/discussions), but with production-grade hardening that prevents the class of bugs that typically kill MVPs in real-world use.
diff --git a/SPEC.md b/SPEC.md
index 0d242bf..752d7bc 100644
--- a/SPEC.md
+++ b/SPEC.md
@@ -6,6 +6,69 @@ A self-hosted tool to extract, index, and semantically search 2+ years of GitLab
 
 ---
 
+## Quick Start
+
+### Prerequisites
+
+| Requirement | Version | Notes |
+|-------------|---------|-------|
+| Node.js | 20+ | LTS recommended |
+| npm | 10+ | Comes with Node.js |
+| Ollama | Latest | Optional for semantic search; lexical search works without it |
+
+### Installation
+
+```bash
+# Clone and install
+git clone https://github.com/your-org/gitlab-inbox.git
+cd gitlab-inbox
+npm install
+npm run build
+npm link  # Makes `gi` available globally
+```
+
+### First Run
+
+1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope):
+   ```bash
+   export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
+   ```
+
+2. **Run the setup wizard:**
+   ```bash
+   gi init
+   ```
+   This creates `gi.config.json` with your GitLab URL and project paths.
+
+3. **Verify your environment:**
+   ```bash
+   gi doctor
+   ```
+   All checks should pass (Ollama warning is OK if you only need lexical search).
+
+4. **Sync your data:**
+   ```bash
+   gi sync
+   ```
+   Initial sync takes 10-20 minutes depending on repo size and rate limits.
+
+5. **Search:**
+   ```bash
+   gi search "authentication redesign"
+   ```
+
+### Troubleshooting First Run
+
+| Symptom | Solution |
+|---------|----------|
+| `Config file not found` | Run `gi init` first |
+| `GITLAB_TOKEN not set` | Export the environment variable |
+| `401 Unauthorized` | Check token has `read_api` scope |
+| `Project not found: group/project` | Verify project path in GitLab URL |
+| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search |
+
+---
+
 ## Discovery Summary
 
 ### Pain Points Identified
@@ -105,10 +168,12 @@ GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&pe
 Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
 
 ```
-GET /projects/:id/issues/:iid/discussions
-GET /projects/:id/merge_requests/:iid/discussions
+GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
+GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
 ```
 
+**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss.
+
 ### Sync Pattern
 
 **Initial sync:**
@@ -124,17 +189,26 @@ GET /projects/:id/merge_requests/:iid/discussions
 3. Fetch MRs where `updated_after=cursor` (bulk)
 4. For EACH updated MR → refetch ALL its discussions
 
-### Critical Assumption
+### Critical Assumption (Softened)
 
-**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
+We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively.
 
-Mitigation: Periodic full re-sync (weekly) as a safety net.
+**Mitigations (MVP):**
+1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor.
+2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
+3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`).
+
+The backfill window provides 80% of the safety of full resync at <5% of the API cost.
 
 ### Rate Limiting
 
 - Default: 10 requests/second with exponential backoff
 - Respect `Retry-After` headers on 429 responses
 - Add jitter to avoid thundering herd on retry
+- **Separate concurrency limits:**
+  - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4)
+  - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
+  - Bound concurrency per-project to avoid one repo starving the other
 - Initial sync estimate: 10-20 minutes depending on rate limits
 
 ---
@@ -144,7 +218,7 @@ Mitigation: Periodic full re-sync (weekly) as a safety net.
 Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
 
 ### Checkpoint 0: Project Setup
-**Deliverable:** Scaffolded project with GitLab API connection verified
+**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working
 
 **Automated Tests (Vitest):**
 ```
@@ -165,6 +239,23 @@ tests/integration/gitlab-client.test.ts
   ✓ returns 401 for invalid PAT
   ✓ fetches project by path
   ✓ handles rate limiting (429) with retry
+
+tests/integration/app-lock.test.ts
+  ✓ acquires lock successfully
+  ✓ updates heartbeat during operation
+  ✓ detects stale lock and recovers
+  ✓ refuses concurrent acquisition
+
+tests/integration/init.test.ts
+  ✓ creates config file with valid structure
+  ✓ validates GitLab URL format
+  ✓ validates GitLab connection before writing config
+  ✓ validates each project path exists in GitLab
+  ✓ fails if token not set
+  ✓ fails if GitLab auth fails
+  ✓ fails if any project path not found
+  ✓ prompts before overwriting existing config
+  ✓ respects --force to skip confirmation
 ```
 
 **Manual CLI Smoke Tests:**
@@ -174,6 +265,10 @@ tests/integration/gitlab-client.test.ts
 | `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
 | `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks |
 | `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
+| `gi init` | Interactive prompts | Creates valid gi.config.json |
+| `gi init` (config exists) | Confirmation prompt | Warns before overwriting |
+| `gi --help` | Command list | Shows all available commands |
+| `gi version` | Version number | Shows installed version |
 
 **Data Integrity Checks:**
 - [ ] `projects` table contains rows for each configured project path
@@ -186,7 +281,24 @@ tests/integration/gitlab-client.test.ts
 - Environment and project configuration
 - Basic CLI scaffold with `auth-test` command
 - `doctor` command for environment verification
-- Projects table and initial sync
+- Projects table and initial project resolution (no issue/MR ingestion yet)
+- DB migrations + WAL + FK enforcement
+- Sync tracking with crash-safe single-flight lock (heartbeat-based)
+- Rate limit handling with exponential backoff + jitter
+- `gi init` command for guided setup:
+  - Prompts for GitLab base URL
+  - Prompts for project paths (comma-separated or multiple prompts)
+  - Prompts for token environment variable name (default: GITLAB_TOKEN)
+  - **Validates before writing config:**
+    - Token must be set in environment
+    - Tests auth with `GET /user` endpoint
+    - Validates each project path with `GET /projects/:path`
+    - Only writes config after all validations pass
+  - Generates `gi.config.json` with sensible defaults
+- `gi --help` shows all available commands
+- `gi <command> --help` shows command-specific help
+- `gi version` shows installed version
+- First-run detection: if no config exists, suggest `gi init`
 
 **Configuration (MVP):**
 ```json
@@ -200,10 +312,22 @@ tests/integration/gitlab-client.test.ts
     { "path": "group/project-one" },
     { "path": "group/project-two" }
   ],
+  "sync": {
+    "backfillDays": 14,
+    "staleLockMinutes": 10,
+    "heartbeatIntervalSeconds": 30,
+    "cursorRewindSeconds": 2,
+    "primaryConcurrency": 4,
+    "dependentConcurrency": 2
+  },
+  "storage": {
+    "compressRawPayloads": true
+  },
   "embedding": {
     "provider": "ollama",
     "model": "nomic-embed-text",
-    "baseUrl": "http://localhost:11434"
+    "baseUrl": "http://localhost:11434",
+    "concurrency": 4
   }
 }
 ```
@@ -232,12 +356,21 @@ CREATE INDEX idx_projects_path ON projects(path_with_namespace);
 CREATE TABLE sync_runs (
   id INTEGER PRIMARY KEY,
   started_at INTEGER NOT NULL,
+  heartbeat_at INTEGER NOT NULL,
   finished_at INTEGER,
   status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
   command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
   error TEXT
 );
 
+-- Crash-safe single-flight lock (DB-enforced)
+CREATE TABLE app_locks (
+  name TEXT PRIMARY KEY,         -- 'sync'
+  owner TEXT NOT NULL,           -- random run token (UUIDv4)
+  acquired_at INTEGER NOT NULL,
+  heartbeat_at INTEGER NOT NULL
+);
+
 -- Sync cursors for primary resources only
 -- Notes and MR changes are dependent resources (fetched via parent updates)
 CREATE TABLE sync_cursors (
@@ -252,18 +385,21 @@ CREATE TABLE sync_cursors (
 CREATE TABLE raw_payloads (
   id INTEGER PRIMARY KEY,
   source TEXT NOT NULL,          -- 'gitlab'
-  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note'
+  project_id INTEGER REFERENCES projects(id),  -- nullable for instance-level resources
+  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
   gitlab_id INTEGER NOT NULL,
   fetched_at INTEGER NOT NULL,
-  json TEXT NOT NULL
+  content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
+  payload BLOB NOT NULL          -- raw JSON or gzip-compressed JSON
 );
-CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
+CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
+CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
 ```
 
 ---
 
 ### Checkpoint 1: Issue Ingestion
-**Deliverable:** All issues from target repos stored locally
+**Deliverable:** All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync
 
 **Automated Tests (Vitest):**
 ```
@@ -277,6 +413,13 @@ tests/unit/pagination.test.ts
   ✓ respects per_page parameter
   ✓ stops when empty page returned
 
+tests/unit/discussion-transformer.test.ts
+  ✓ transforms discussion payload to normalized schema
+  ✓ extracts notes array from discussion
+  ✓ sets individual_note flag correctly
+  ✓ flags system notes with is_system=1
+  ✓ preserves note order via position field
+
 tests/integration/issue-ingestion.test.ts
   ✓ inserts issues into database
   ✓ creates labels from issue payloads
@@ -285,6 +428,13 @@ tests/integration/issue-ingestion.test.ts
   ✓ updates cursor after successful page commit
   ✓ resumes from cursor on subsequent runs
 
+tests/integration/issue-discussion-ingestion.test.ts
+  ✓ fetches discussions for each issue
+  ✓ creates discussion rows with correct issue FK
+  ✓ creates note rows linked to discussions
+  ✓ stores system notes with is_system=1 flag
+  ✓ handles individual_note=true discussions
+
 tests/integration/sync-runs.test.ts
   ✓ creates sync_run record on start
   ✓ marks run as succeeded on completion
@@ -300,7 +450,9 @@ tests/integration/sync-runs.test.ts
 | `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
 | `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
 | `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
-| `gi show issue 123` | Issue detail view | Shows title, description, labels, URL |
+| `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL |
+| `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count |
+| `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count |
 | `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
 
 **Data Integrity Checks:**
@@ -309,6 +461,9 @@ tests/integration/sync-runs.test.ts
 - [ ] Labels in `issue_labels` junction all exist in `labels` table
 - [ ] `sync_cursors` has entry for each (project_id, 'issues') pair
 - [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current)
+- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'` is non-zero
+- [ ] Every discussion has at least one note
+- [ ] `individual_note = true` discussions have exactly one note
 
 **Scope:**
 - Issue fetcher with pagination handling
@@ -317,12 +472,26 @@ tests/integration/sync-runs.test.ts
 - Labels ingestion derived from issue payload:
   - Always persist label names from `labels: string[]`
   - Optionally request `with_labels_details=true` to capture color/description when available
+- Issue discussions fetcher:
+  - Uses `GET /projects/:id/issues/:iid/discussions`
+  - Fetches all discussions for each issue during ingest
+  - Preserve system notes but flag them with `is_system=1`
 - Incremental sync support (run tracking + per-project cursor)
 - Basic list/count CLI commands
 
 **Reliability/Idempotency Rules:**
 - Every ingest/sync creates a `sync_runs` row
-- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
+- Single-flight via DB-enforced app lock:
+  - On start: acquire lock via transactional compare-and-swap:
+    - `BEGIN IMMEDIATE` (acquires write lock immediately)
+    - If no row exists → INSERT new lock
+    - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps
+    - Else if `owner` matches current run → UPDATE heartbeat (re-entrant)
+    - Else → ROLLBACK and fail fast (another run is active)
+    - `COMMIT`
+  - During run: update `heartbeat_at` every 30 seconds
+  - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
+  - `--force` remains as operator override for edge cases, but should rarely be needed
 - Cursor advances only after successful transaction commit per page/batch
 - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
 - Use explicit transactions for batch inserts
@@ -340,11 +509,13 @@ CREATE TABLE issues (
   author_username TEXT,
   created_at INTEGER,
   updated_at INTEGER,
+  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
   web_url TEXT,
   raw_payload_id INTEGER REFERENCES raw_payloads(id)
 );
 CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
 CREATE INDEX idx_issues_author ON issues(author_username);
+CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);
 
 -- Labels are derived from issue payloads (string array)
 -- Uniqueness is (project_id, name) since gitlab_id isn't always available
@@ -365,12 +536,65 @@ CREATE TABLE issue_labels (
   PRIMARY KEY(issue_id, label_id)
 );
 CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
+
+-- Discussion threads for issues (MR discussions added in CP2)
+CREATE TABLE discussions (
+  id INTEGER PRIMARY KEY,
+  gitlab_discussion_id TEXT NOT NULL,         -- GitLab's string ID (e.g. "6a9c1750b37d...")
+  project_id INTEGER NOT NULL REFERENCES projects(id),
+  issue_id INTEGER REFERENCES issues(id),
+  merge_request_id INTEGER REFERENCES merge_requests(id),
+  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
+  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
+  first_note_at INTEGER,                      -- for ordering discussions
+  last_note_at INTEGER,                       -- for "recently active" queries
+  last_seen_at INTEGER NOT NULL,              -- updated on every upsert during sync
+  resolvable BOOLEAN,                         -- MR discussions can be resolved
+  resolved BOOLEAN,
+  CHECK (
+    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
+    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
+  )
+);
+CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id);
+CREATE INDEX idx_discussions_issue ON discussions(issue_id);
+CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
+CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
+
+-- Notes belong to discussions (preserving thread context)
+CREATE TABLE notes (
+  id INTEGER PRIMARY KEY,
+  gitlab_id INTEGER UNIQUE NOT NULL,
+  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
+  project_id INTEGER NOT NULL REFERENCES projects(id),
+  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null
+  is_system BOOLEAN NOT NULL DEFAULT 0,       -- system notes (assignments, label changes, etc.)
+  author_username TEXT,
+  body TEXT,
+  created_at INTEGER,
+  updated_at INTEGER,
+  last_seen_at INTEGER NOT NULL,              -- updated on every upsert during sync
+  position INTEGER,                           -- derived from array order in API response (0-indexed)
+  resolvable BOOLEAN,
+  resolved BOOLEAN,
+  resolved_by TEXT,
+  resolved_at INTEGER,
+  -- DiffNote position metadata (only populated for MR DiffNotes in CP2)
+  position_old_path TEXT,
+  position_new_path TEXT,
+  position_old_line INTEGER,
+  position_new_line INTEGER,
+  raw_payload_id INTEGER REFERENCES raw_payloads(id)
+);
+CREATE INDEX idx_notes_discussion ON notes(discussion_id);
+CREATE INDEX idx_notes_author ON notes(author_username);
+CREATE INDEX idx_notes_system ON notes(is_system);
 ```
 
 ---
 
-### Checkpoint 2: MR + Discussions Ingestion
-**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context
+### Checkpoint 2: MR Ingestion
+**Deliverable:** All MRs + MR discussions + notes with DiffNote paths captured
 
 **Automated Tests (Vitest):**
 ```
@@ -379,12 +603,9 @@ tests/unit/mr-transformer.test.ts
   ✓ extracts labels from MR payload
   ✓ handles missing optional fields gracefully
 
-tests/unit/discussion-transformer.test.ts
-  ✓ transforms discussion payload to normalized schema
-  ✓ extracts notes array from discussion
-  ✓ sets individual_note flag correctly
-  ✓ filters out system notes (system: true)
-  ✓ preserves note order via position field
+tests/unit/diffnote-transformer.test.ts
+  ✓ extracts DiffNote position metadata (paths and lines)
+  ✓ handles missing position fields gracefully
 
 tests/integration/mr-ingestion.test.ts
   ✓ inserts MRs into database
@@ -392,12 +613,11 @@ tests/integration/mr-ingestion.test.ts
   ✓ links MRs to labels via junction table
   ✓ stores raw payload for each MR
 
-tests/integration/discussion-ingestion.test.ts
-  ✓ fetches discussions for each issue
+tests/integration/mr-discussion-ingestion.test.ts
   ✓ fetches discussions for each MR
-  ✓ creates discussion rows with correct parent FK
+  ✓ creates discussion rows with correct MR FK
   ✓ creates note rows linked to discussions
-  ✓ excludes system notes from storage
+  ✓ extracts position_new_path from DiffNotes
   ✓ captures note-level resolution status
   ✓ captures note type (DiscussionNote, DiffNote)
 ```
@@ -409,28 +629,25 @@ tests/integration/discussion-ingestion.test.ts
 | `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch |
 | `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI |
 | `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads |
-| `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads |
-| `gi count discussions` | `Discussions: 12,345` | Non-zero count |
-| `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes |
+| `gi count discussions` | `Discussions: 12,345` | Total count (issue + MR) |
+| `gi count discussions --type=mr` | `MR Discussions: 6,789` | MR discussions only |
+| `gi count notes` | `Notes: 45,678 (excluding 8,901 system)` | Total with system note count |
 
 **Data Integrity Checks:**
 - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
-- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
-- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
-- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
-- [ ] Every discussion has at least one note
-- [ ] `individual_note = true` discussions have exactly one note
+- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest'` is non-zero
+- [ ] DiffNotes have `position_new_path` populated when available
 - [ ] Discussion `first_note_at` <= `last_note_at` for all rows
 
 **Scope:**
 - MR fetcher with pagination
-- Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
-  - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
-  - During initial ingest: fetch discussions for every issue/MR
-  - During sync: refetch discussions only for issues/MRs updated since cursor
-  - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
-- Relationship linking (discussion → parent issue/MR, notes → discussion)
-- Extended CLI commands for MR/issue display with threads
+- MR discussions fetcher:
+  - Uses `GET /projects/:id/merge_requests/:iid/discussions`
+  - Fetches all discussions for each MR during ingest
+  - Capture DiffNote file path/line metadata from `position` field for filename search
+- Relationship linking (discussion → MR, notes → discussion)
+- Extended CLI commands for MR display with threads
+- Add `idx_notes_type` and `idx_notes_new_path` indexes for DiffNote queries
 
 **Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.
 
@@ -449,77 +666,38 @@ CREATE TABLE merge_requests (
   target_branch TEXT,
   created_at INTEGER,
   updated_at INTEGER,
+  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
   merged_at INTEGER,
   web_url TEXT,
   raw_payload_id INTEGER REFERENCES raw_payloads(id)
 );
 CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
 CREATE INDEX idx_mrs_author ON merge_requests(author_username);
+CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);
 
--- Discussion threads (the semantic unit for conversations)
-CREATE TABLE discussions (
-  id INTEGER PRIMARY KEY,
-  gitlab_discussion_id TEXT UNIQUE NOT NULL,  -- GitLab's string ID (e.g. "6a9c1750b37d...")
-  project_id INTEGER NOT NULL REFERENCES projects(id),
-  issue_id INTEGER REFERENCES issues(id),
-  merge_request_id INTEGER REFERENCES merge_requests(id),
-  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
-  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
-  first_note_at INTEGER,                      -- for ordering discussions
-  last_note_at INTEGER,                       -- for "recently active" queries
-  resolvable BOOLEAN,                         -- MR discussions can be resolved
-  resolved BOOLEAN,
-  CHECK (
-    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
-    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
-  )
-);
-CREATE INDEX idx_discussions_issue ON discussions(issue_id);
-CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
-CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
-
--- Notes belong to discussions (preserving thread context)
-CREATE TABLE notes (
-  id INTEGER PRIMARY KEY,
-  gitlab_id INTEGER UNIQUE NOT NULL,
-  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
-  project_id INTEGER NOT NULL REFERENCES projects(id),
-  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
-  author_username TEXT,
-  body TEXT,
-  created_at INTEGER,
-  updated_at INTEGER,
-  position INTEGER,                           -- derived from array order in API response (0-indexed)
-  resolvable BOOLEAN,                         -- note-level resolvability (MR code comments)
-  resolved BOOLEAN,                           -- note-level resolution status
-  resolved_by TEXT,                           -- username who resolved
-  resolved_at INTEGER,                        -- when resolved
-  raw_payload_id INTEGER REFERENCES raw_payloads(id)
-);
-CREATE INDEX idx_notes_discussion ON notes(discussion_id);
-CREATE INDEX idx_notes_author ON notes(author_username);
-CREATE INDEX idx_notes_type ON notes(type);
-
--- MR labels (reuse same labels table)
+-- MR labels (reuse same labels table from CP1)
 CREATE TABLE mr_labels (
   merge_request_id INTEGER REFERENCES merge_requests(id),
   label_id INTEGER REFERENCES labels(id),
   PRIMARY KEY(merge_request_id, label_id)
 );
 CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
+
+-- Additional indexes for DiffNote queries (tables created in CP1)
+CREATE INDEX idx_notes_type ON notes(type);
+CREATE INDEX idx_notes_new_path ON notes(position_new_path);
 ```
 
-**Discussion Processing Rules:**
-- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
-- Each discussion from the API becomes one row in `discussions` table
-- All notes within a discussion are stored with their `discussion_id` foreign key
-- `individual_note: true` discussions have exactly one note (standalone comment)
-- `individual_note: false` discussions have multiple notes (threaded conversation)
+**MR Discussion Processing Rules:**
+- DiffNote position data is extracted and stored:
+  - `position.old_path`, `position.new_path` for file-level search
+  - `position.old_line`, `position.new_line` for line-level context
+- MR discussions can be resolvable; resolution status is captured at note level
 
 ---
 
-### Checkpoint 3: Embedding Generation
-**Deliverable:** Vector embeddings generated for all text content
+### Checkpoint 3: Document + Embedding Generation with Lexical Search
+**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end
 
 **Automated Tests (Vitest):**
 ```
@@ -563,6 +741,7 @@ tests/integration/embedding-storage.test.ts
 | `gi stats` | Embedding coverage stats | Shows 100% coverage |
 | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
 | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
+| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required |
 
 **Data Integrity Checks:**
 - [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
@@ -574,7 +753,11 @@ tests/integration/embedding-storage.test.ts
 
 **Scope:**
 - Ollama integration (nomic-embed-text model)
-- Embedding generation pipeline (batch processing, 32 documents per batch)
+- Embedding generation pipeline:
+  - Batch size: 32 documents per batch
+  - Concurrency: configurable (default 4 workers)
+  - Retry with exponential backoff for transient failures (max 3 attempts)
+  - Per-document failure recording to enable targeted re-runs
 - Vector storage in SQLite (sqlite-vss extension)
 - Progress tracking and resumability
 - Document extraction layer:
@@ -582,8 +765,14 @@ tests/integration/embedding-storage.test.ts
   - Stable content hashing for change detection (SHA-256 of content_text)
   - Single embedding per document (chunking deferred to post-MVP)
   - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
+    - **Implementation:** Use character budget, not exact token count
+    - `maxChars = 32000` (conservative 4 chars/token estimate)
+    - `approxTokens = ceil(charCount / 4)` for reporting/logging only
+    - This avoids tokenizer dependency while preventing embedding failures
 - Denormalized metadata for fast filtering (author, labels, dates)
 - Fast label filtering via `document_labels` join table
+- FTS5 index for lexical search (enables `gi search --mode=lexical` without Ollama)
+- `gi search --mode=lexical` CLI command (works without Ollama)
 
 **Schema Additions:**
 ```sql
@@ -601,6 +790,8 @@ CREATE TABLE documents (
   title TEXT,                    -- null for discussions
   content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
   content_hash TEXT NOT NULL,    -- SHA-256 for change detection
+  is_truncated BOOLEAN NOT NULL DEFAULT 0,
+  truncated_reason TEXT,         -- 'token_limit_middle_drop' | null
   UNIQUE(source_type, source_id)
 );
 CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
@@ -628,8 +819,31 @@ CREATE TABLE embedding_metadata (
   model TEXT NOT NULL,           -- 'nomic-embed-text'
   dims INTEGER NOT NULL,         -- 768
   content_hash TEXT NOT NULL,    -- copied from documents.content_hash
-  created_at INTEGER NOT NULL
+  created_at INTEGER NOT NULL,
+  -- Error tracking for resumable embedding
+  last_error TEXT,               -- error message from last failed attempt
+  attempt_count INTEGER NOT NULL DEFAULT 0,
+  last_attempt_at INTEGER        -- when last attempt occurred
 );
+
+-- Index for finding failed embeddings to retry
+CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
+
+-- Track sources that require document regeneration (populated during ingestion)
+CREATE TABLE dirty_sources (
+  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
+  source_id INTEGER NOT NULL,    -- local DB id
+  queued_at INTEGER NOT NULL,
+  PRIMARY KEY(source_type, source_id)
+);
+
+-- Fast path filtering for documents (extracted from DiffNote positions)
+CREATE TABLE document_paths (
+  document_id INTEGER NOT NULL REFERENCES documents(id),
+  path TEXT NOT NULL,
+  PRIMARY KEY(document_id, path)
+);
+CREATE INDEX idx_document_paths_path ON document_paths(path);
 ```
 
 **Storage Rule (MVP):**
@@ -647,7 +861,13 @@ CREATE TABLE embedding_metadata (
 
 **Discussion Document Format:**
 ```
-[Issue #234: Authentication redesign] Discussion
+[[Discussion]] Issue #234: Authentication redesign
+Project: group/project-one
+URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
+Labels: ["bug", "auth"]
+Files: ["src/auth/login.ts"]     -- present if any DiffNotes exist in thread
+
+--- Thread ---
 
 @johndoe (2024-03-15):
 I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
@@ -661,16 +881,32 @@ Short-lived access tokens (15min), longer refresh (7 days). Here's why...
 
 This format preserves:
 - Parent context (issue/MR title and number)
+- Project path for scoped search
+- Direct URL for navigation
+- Labels for context
+- File paths from DiffNotes (enables immediate file search)
 - Author attribution for each note
 - Temporal ordering of the conversation
 - Full thread semantics for decision traceability
 
-**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.
+**Truncation:**
+If content exceeds 8000 tokens:
+**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`.
+
+1. Truncate from the middle (preserve first + last notes for context)
+2. Set `documents.is_truncated = 1`
+3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
+4. Log a warning with document ID and original token count
+
+This metadata enables:
+- Monitoring truncation frequency in production
+- Future investigation of high-value truncated documents
+- Debugging when search misses expected content
 
 ---
 
-### Checkpoint 4: Semantic Search
-**Deliverable:** Working semantic search across all indexed content
+### Checkpoint 4: Hybrid Search (Semantic + Lexical)
+**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
 
 **Automated Tests (Vitest):**
 ```
@@ -713,13 +949,18 @@ tests/e2e/golden-queries.test.ts
 | Command | Expected Output | Pass Criteria |
 |---------|-----------------|---------------|
 | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
+| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project |
 | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
 | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
 | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
 | `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
-| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
-| `gi search "authentication" --json` | JSON output | Valid JSON array with schema |
+| `gi search "redis" --mode=lexical` | FTS results only | Works without Ollama |
+| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
+| `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema |
+| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
+| `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results |
 | `gi search "xyznonexistent123"` | No results message | Graceful empty state |
+| `gi search "auth"` (no data synced) | No data message | Shows "Run gi sync first" |
 | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
 
 **Golden Query Test Suite:**
@@ -746,12 +987,23 @@ Each query must have at least one expected URL appear in top 10 results.
 - Hybrid retrieval:
   - Vector recall (sqlite-vss) + FTS lexical recall (fts5)
   - Merge + rerank results using Reciprocal Rank Fusion (RRF)
+- Query embedding generation (same Ollama pipeline as documents)
 - Result ranking and scoring (document-level)
-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
+- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N`
+  - `--limit=N` controls result count (default: 20, max: 100)
+  - `--path` filters documents by referenced file paths (from DiffNote positions)
+  - MVP: substring/exact match; glob patterns deferred
   - Label filtering operates on `document_labels` (indexed, exact-match)
+  - Filters work identically in hybrid and lexical modes
+- Debug: `--explain` returns rank contributions from vector + FTS + RRF
 - Output formatting: ranked list with title, snippet, score, URL
-- JSON output mode for AI agent consumption
+- JSON output mode for AI/agent consumption (stable documented schema)
 - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
+- Empty state handling:
+  - No documents indexed: `No data indexed. Run 'gi sync' first.`
+  - Query returns no results: `No results found for "query".`
+  - Filters exclude all results: `No results match the specified filters.`
+  - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
 
 **Schema Additions:**
 ```sql
@@ -806,14 +1058,20 @@ END;
 - Well-established in information retrieval literature
 
 **Graceful Degradation:**
-- If Ollama is unreachable during search, automatically fall back to FTS5-only
+- If Ollama is unreachable during search, automatically fall back to FTS5-only search
 - Display warning: "Embedding service unavailable, using lexical search only"
 - `embed` command fails with actionable error if Ollama is down
 
 **CLI Interface:**
 ```bash
 # Basic semantic search
-gi search "why did we choose Redis"
+gi search "authentication redesign"
+
+# Search within specific project
+gi search "authentication" --project=group/project-one
+
+# Search by file path (finds discussions/MRs touching this file)
+gi search "rate limit" --path=src/client.ts
 
 # Pure FTS search (fallback if embeddings unavailable)
 gi search "redis" --mode=lexical
@@ -826,11 +1084,14 @@ gi search "performance" --label=bug --label=critical
 
 # JSON output for programmatic use
 gi search "payment processing" --json
+
+# Explain search (shows RRF contributions)
+gi search "auth" --explain
 ```
 
 **CLI Output Example:**
 ```
-$ gi search "authentication redesign"
+$ gi search "authentication"
 
 Found 23 results (hybrid search, 0.34s)
 
@@ -850,6 +1111,42 @@ Found 23 results (hybrid search, 0.34s)
     https://gitlab.example.com/group/project-one/-/issues/234#note_12345
 ```
 
+**JSON Output Schema (Stable):**
+
+For AI/agent consumption, `--json` output follows this stable schema:
+
+```typescript
+interface SearchResult {
+  documentId: number;
+  sourceType: "issue" | "merge_request" | "discussion";
+  title: string | null;
+  url: string;
+  projectPath: string;
+  author: string | null;
+  createdAt: string;       // ISO 8601
+  updatedAt: string;       // ISO 8601
+  score: number;           // 0-1 normalized RRF score
+  snippet: string;         // truncated content_text
+  labels: string[];
+  // Only present with --explain flag
+  explain?: {
+    vectorRank?: number;   // null if not in vector results
+    ftsRank?: number;      // null if not in FTS results
+    rrfScore: number;
+  };
+}
+
+interface SearchResponse {
+  query: string;
+  mode: "hybrid" | "lexical" | "semantic";
+  totalResults: number;
+  results: SearchResult[];
+  warnings?: string[];     // e.g., "Embedding service unavailable"
+}
+```
+
+**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.
+
 ---
 
 ### Checkpoint 5: Incremental Sync
@@ -888,7 +1185,7 @@ tests/integration/sync-recovery.test.ts
 |---------|-----------------|---------------|
 | `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
 | `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
-| `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything |
+| `gi sync --full` | Full sync progress | Resets cursors, fetches everything |
 | `gi sync-status` | Cursor positions, last sync time | Shows current state |
 | `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
 | `gi search "new content"` (after sync) | Returns new content | New content is searchable |
@@ -913,20 +1210,33 @@ tests/integration/sync-recovery.test.ts
 - [ ] `sync_runs` has complete audit trail
 
 **Scope:**
-- Delta sync based on stable cursor (updated_at + tie-breaker id)
+- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)`
+- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
 - Dependent resources sync strategy (discussions refetched when parent updates)
 - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
 - Sync status reporting
 - Recommended: run via cron every 10 minutes
 
 **Correctness Rules (MVP):**
-1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
-2. Cursor advances only after successful DB commit for that page
+1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
+2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
+   - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.**
+   - Use **cursor rewind + local filtering**:
+     - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable)
+     - Locally discard items where:
+       - `updated_at < cursor_updated_at`, OR
+       - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id`
+     - This makes the tuple cursor rule true in practice while keeping API calls simple.
+   - Cursor advances only after successful DB commit for that page
+   - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
 3. Dependent resources:
    - For each updated issue/MR, refetch ALL its discussions
    - Discussion documents are regenerated and re-embedded if content_hash changes
-4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
-5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
+4. Rolling backfill window:
+   - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays`
+   - This catches any items whose timestamps were updated without triggering our cursor
+5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
+6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
 
 **Why Dependent Resource Model:**
 - GitLab Discussions API doesn't provide a global `updated_after` stream
@@ -935,8 +1245,10 @@ tests/integration/sync-recovery.test.ts
 
 **CLI Commands:**
 ```bash
-# Full sync (respects cursors, only fetches new/updated)
-gi sync
+# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
+gi sync                    # orchestrates all steps
+gi sync --no-embed         # skip embedding step (fast ingest/debug)
+gi sync --no-docs          # skip document regeneration (debug)
 
 # Force full re-sync (resets cursors)
 gi sync --full
@@ -948,6 +1260,211 @@ gi sync --force
 gi sync-status
 ```
 
+**Orchestration steps (in order):**
+1. Acquire app lock with heartbeat
+2. Ingest delta (issues, MRs, discussions) based on cursors
+   - During ingestion, INSERT into `dirty_sources` for each upserted entity
+3. Apply rolling backfill window
+4. Regenerate documents for entities in `dirty_sources` (process + delete from queue)
+5. Embed documents with changed content_hash
+6. FTS triggers auto-sync (no explicit step needed)
+7. Release lock, record sync_run as succeeded
+
+Individual commands remain available for checkpoint testing and debugging:
+- `gi ingest --type=issues`
+- `gi ingest --type=merge_requests`
+- `gi embed --all`
+- `gi embed --retry-failed`
+
+---
+
+## CLI Command Reference
+
+All commands support `--help` for detailed usage information.
+
+### Setup & Diagnostics
+
+| Command | CP | Description |
+|---------|-----|-------------|
+| `gi init` | 0 | Interactive setup wizard; creates gi.config.json |
+| `gi auth-test` | 0 | Verify GitLab authentication |
+| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) |
+| `gi doctor --json` | 0 | JSON output for scripting |
+| `gi version` | 0 | Show installed version |
+
+### Data Ingestion
+
+| Command | CP | Description |
+|---------|-----|-------------|
+| `gi ingest --type=issues` | 1 | Fetch issues from GitLab |
+| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions |
+| `gi embed --all` | 3 | Generate embeddings for all documents |
+| `gi embed --retry-failed` | 3 | Retry failed embeddings |
+| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) |
+| `gi sync --full` | 5 | Force complete re-sync (reset cursors) |
+| `gi sync --force` | 5 | Override stale lock after operator review |
+| `gi sync --no-embed` | 5 | Sync without embedding (faster) |
+
+### Data Inspection
+
+| Command | CP | Description |
+|---------|-----|-------------|
+| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues |
+| `gi list mrs [--limit=N]` | 2 | List merge requests |
+| `gi count issues` | 1 | Count issues |
+| `gi count mrs` | 2 | Count merge requests |
+| `gi count discussions --type=issue` | 1 | Count issue discussions |
+| `gi count discussions` | 2 | Count all discussions |
+| `gi count discussions --type=mr` | 2 | Count MR discussions |
+| `gi count notes --type=issue` | 1 | Count issue notes (excluding system) |
+| `gi count notes` | 2 | Count all notes (excluding system) |
+| `gi show issue <iid>` | 1 | Show issue details |
+| `gi show mr <iid>` | 2 | Show MR details with discussions |
+| `gi stats` | 3 | Embedding coverage statistics |
+| `gi stats --json` | 3 | JSON stats for scripting |
+| `gi sync-status` | 1 | Show cursor positions and last sync |
+
+### Search
+
+| Command | CP | Description |
+|---------|-----|-------------|
+| `gi search "query"` | 4 | Hybrid semantic + lexical search |
+| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) |
+| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type |
+| `gi search "query" --author=USERNAME` | 4 | Filter by author |
+| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date |
+| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) |
+| `gi search "query" --project=PATH` | 4 | Filter by project |
+| `gi search "query" --path=FILE` | 4 | Filter by file path |
+| `gi search "query" --limit=N` | 4 | Limit results (default: 20, max: 100) |
+| `gi search "query" --json` | 4 | JSON output for scripting |
+| `gi search "query" --explain` | 4 | Show ranking breakdown |
+
+### Database Management
+
+| Command | CP | Description |
+|---------|-----|-------------|
+| `gi backup` | 0 | Create timestamped database backup |
+| `gi reset --confirm` | 0 | Delete database and reset cursors |
+
+---
+
+## Error Handling
+
+Common errors and their resolutions:
+
+### Configuration Errors
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| `Config file not found` | No gi.config.json | Run `gi init` to create configuration |
+| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually |
+| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config |
+
+### Authentication Errors
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` |
+| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope |
+| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope |
+
+### GitLab API Errors
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) |
+| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry |
+| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity |
+
+### Data Errors
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| `No documents indexed` | Sync not run | Run `gi sync` first |
+| `No results found` | Query too specific | Try broader search terms |
+| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale |
+
+### Embedding Errors
+
+| Error | Cause | Resolution |
+|-------|-------|------------|
+| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` |
+| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` |
+| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` |
+
+### Operational Behavior
+
+| Scenario | Behavior |
+|----------|----------|
+| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. |
+| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
+| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
+| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
+
+---
+
+## Database Management
+
+### Database Location
+
+The SQLite database is stored at an XDG-compliant location:
+
+```
+~/.local/share/gi/data.db
+```
+
+This can be overridden in `gi.config.json`:
+
+```json
+{
+  "storage": {
+    "dbPath": "/custom/path/to/data.db"
+  }
+}
+```
+
+### Backup
+
+Create a timestamped backup of the database:
+
+```bash
+gi backup
+# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db
+```
+
+Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode).
+
+### Reset
+
+To completely reset the database and all sync cursors:
+
+```bash
+gi reset --confirm
+```
+
+This deletes:
+- The database file
+- All sync cursors
+- All embeddings
+
+You'll need to run `gi sync` again to repopulate.
+
+### Schema Migrations
+
+Database schema is version-tracked and migrations auto-apply on startup:
+
+1. On first run, schema is created at latest version
+2. On subsequent runs, pending migrations are applied automatically
+3. Migration version is stored in `schema_version` table
+4. Migrations are idempotent and reversible where possible
+
+**Manual migration check:**
+```bash
+gi doctor --json | jq '.checks.database'
+# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }
+```
+
 ---
 
 ## Future Work (Post-MVP)
@@ -989,7 +1506,7 @@ CREATE TABLE note_positions (
   new_line INTEGER,
   position_type TEXT                          -- 'text' | 'image' | etc.
 );
-CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
+CREATE INDEX idx_note_positions_new_path ON note_positions(position_new_path);
 ```
 
 ---
@@ -1013,9 +1530,11 @@ Each checkpoint includes:
 | Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
 | SQLite scale limits | Monitor performance; Postgres migration path documented |
 | Stale data | Incremental sync with change detection |
-| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
+| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
+| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
 | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
-| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |
+| Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery |
+| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
 
 **SQLite Performance Defaults (MVP):**
 - Enable `PRAGMA journal_mode=WAL;` on every connection
@@ -1030,20 +1549,24 @@ Each checkpoint includes:
 | Table | Checkpoint | Purpose |
 |-------|------------|---------|
 | projects | 0 | Configured GitLab projects |
-| sync_runs | 0 | Audit trail of sync operations |
+| sync_runs | 0 | Audit trail of sync operations (with heartbeat) |
+| app_locks | 0 | Crash-safe single-flight lock |
 | sync_cursors | 0 | Resumable sync state per primary resource |
-| raw_payloads | 0 | Decoupled raw JSON storage |
-| issues | 1 | Normalized issues |
+| raw_payloads | 0 | Decoupled raw JSON storage (with project_id) |
+| schema_version | 0 | Database migration version tracking |
+| issues | 1 | Normalized issues (unique by project+iid) |
 | labels | 1 | Label definitions (unique by project + name) |
 | issue_labels | 1 | Issue-label junction |
-| merge_requests | 2 | Normalized MRs |
-| discussions | 2 | Discussion threads (the semantic unit for conversations) |
-| notes | 2 | Individual comments within discussions |
+| merge_requests | 2 | Normalized MRs (unique by project+iid) |
+| discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) |
+| notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) |
 | mr_labels | 2 | MR-label junction |
-| documents | 3 | Unified searchable documents (issues, MRs, discussions) |
+| documents | 3 | Unified searchable documents with truncation metadata |
 | document_labels | 3 | Document-label junction for fast filtering |
+| document_paths | 3 | Fast path filtering for documents (DiffNote file paths) |
+| dirty_sources | 3 | Queue for incremental document regeneration |
 | embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
-| embedding_metadata | 3 | Embedding provenance + change detection |
+| embedding_metadata | 3 | Embedding provenance + error tracking |
 | documents_fts | 4 | Full-text search index (fts5 with porter stemmer) |
 | mr_files | 6 | MR file changes (deferred to File History feature) |
 
@@ -1053,19 +1576,26 @@ Each checkpoint includes:
 
 | Question | Decision | Rationale |
 |----------|----------|-----------|
-| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread |
-| System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value |
-| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls |
-| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering |
-| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available |
-| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
-| Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates |
+| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability |
+| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise |
+| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature |
+| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature |
+| Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering |
+| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
+| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient |
+| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
+| Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates |
 | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
-| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts |
-| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 |
-| Embedding batching | **32 documents per batch** | Balance between throughput and memory |
-| FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text |
-| Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching |
+| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
+| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context |
+| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
+| FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
+| Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
+| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
+| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
+| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
+| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
+| Empty state UX | **Actionable messages** | Guide user to next step |
 
 ---