Spec iterations

2026-01-20 16:26:27 -05:00
parent 7702d2a493
commit 97a303eca9
1 changed files with 541 additions and 104 deletions
--- a/SPEC.md
+++ b/SPEC.md
@@ -2,7 +2,7 @@

 ## Executive Summary

-A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP.
+A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.

 ---

@@ -89,6 +89,56 @@ A self-hosted tool to extract, index, and semantically search 2+ years of GitLab

 ---

+## GitLab API Strategy
+
+### Primary Resources (Bulk Fetch)
+
+Issues and MRs support efficient bulk fetching with incremental sync:
+
+```
+GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100
+GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100
+```
+
+### Dependent Resources (Per-Parent Fetch)
+
+Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
+
+```
+GET /projects/:id/issues/:iid/discussions
+GET /projects/:id/merge_requests/:iid/discussions
+```
+
+### Sync Pattern
+
+**Initial sync:**
+1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
+2. For EACH issue → fetch all discussions (~3K calls)
+3. Fetch all MRs (paginated, ~60 calls)
+4. For EACH MR → fetch all discussions (~3K calls)
+5. Total: ~6,100+ API calls for initial sync
+
+**Incremental sync:**
+1. Fetch issues where `updated_after=cursor` (bulk)
+2. For EACH updated issue → refetch ALL its discussions
+3. Fetch MRs where `updated_after=cursor` (bulk)
+4. For EACH updated MR → refetch ALL its discussions
+
+### Critical Assumption
+
+**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
+
+Mitigation: Periodic full re-sync (weekly) as a safety net.
+
+### Rate Limiting
+
+- Default: 10 requests/second with exponential backoff
+- Respect `Retry-After` headers on 429 responses
+- Add jitter to avoid thundering herd on retry
+- Initial sync estimate: 10-20 minutes depending on rate limits
+
+---
+
 ## Checkpoint Structure

 Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
@@ -96,13 +146,39 @@ Each checkpoint is a **testable milestone** where a human can validate the syste
 ### Checkpoint 0: Project Setup
 **Deliverable:** Scaffolded project with GitLab API connection verified

-**Tests:**
-1. Run `gitlab-engine auth-test` → returns authenticated user info
-2. Run `gitlab-engine doctor` → verifies:
-   - Can reach GitLab baseUrl
-   - PAT is present and can read configured projects
-   - SQLite opens DB and migrations apply
-   - Ollama reachable OR embedding disabled with clear warning
+**Automated Tests (Vitest):**
+```
+tests/unit/config.test.ts
+  ✓ loads config from gi.config.json
+  ✓ throws if config file missing
+  ✓ throws if required fields missing (baseUrl, projects)
+  ✓ validates project paths are non-empty strings
+
+tests/unit/db.test.ts
+  ✓ creates database file if not exists
+  ✓ applies migrations in order
+  ✓ sets WAL journal mode
+  ✓ enables foreign keys
+
+tests/integration/gitlab-client.test.ts
+  ✓ authenticates with valid PAT
+  ✓ returns 401 for invalid PAT
+  ✓ fetches project by path
+  ✓ handles rate limiting (429) with retry
+```
+
+**Manual CLI Smoke Tests:**
+| Command | Expected Output | Pass Criteria |
+|---------|-----------------|---------------|
+| `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name |
+| `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
+| `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks |
+| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
+
+**Data Integrity Checks:**
+- [ ] `projects` table contains rows for each configured project path
+- [ ] `gitlab_project_id` matches actual GitLab project IDs
+- [ ] `raw_payloads` contains project JSON for each synced project

 **Scope:**
 - Project structure (TypeScript, ESLint, Vitest)
@@ -114,7 +190,7 @@ Each checkpoint is a **testable milestone** where a human can validate the syste

 **Configuration (MVP):**
 ```json
-// gitlab-engine.config.json
+// gi.config.json
 {
  "gitlab": {
    "baseUrl": "https://gitlab.example.com",
@@ -189,7 +265,50 @@ CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
 ### Checkpoint 1: Issue Ingestion
 **Deliverable:** All issues from target repos stored locally

-**Test:** Run `gitlab-engine ingest --type=issues` → count matches GitLab; run `gitlab-engine list issues --limit=10` → displays issues correctly
+**Automated Tests (Vitest):**
+```
+tests/unit/issue-transformer.test.ts
+  ✓ transforms GitLab issue payload to normalized schema
+  ✓ extracts labels from issue payload
+  ✓ handles missing optional fields gracefully
+
+tests/unit/pagination.test.ts
+  ✓ fetches all pages when multiple exist
+  ✓ respects per_page parameter
+  ✓ stops when empty page returned
+
+tests/integration/issue-ingestion.test.ts
+  ✓ inserts issues into database
+  ✓ creates labels from issue payloads
+  ✓ links issues to labels via junction table
+  ✓ stores raw payload for each issue
+  ✓ updates cursor after successful page commit
+  ✓ resumes from cursor on subsequent runs
+
+tests/integration/sync-runs.test.ts
+  ✓ creates sync_run record on start
+  ✓ marks run as succeeded on completion
+  ✓ marks run as failed with error message on failure
+  ✓ refuses concurrent run (single-flight)
+  ✓ allows --force to override stale running status
+```
+
+**Manual CLI Smoke Tests:**
+| Command | Expected Output | Pass Criteria |
+|---------|-----------------|---------------|
+| `gi ingest --type=issues` | Progress bar, final count | Completes without error |
+| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
+| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
+| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
+| `gi show issue 123` | Issue detail view | Shows title, description, labels, URL |
+| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
+
+**Data Integrity Checks:**
+- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects
+- [ ] Every issue has a corresponding `raw_payloads` row
+- [ ] Labels in `issue_labels` junction all exist in `labels` table
+- [ ] `sync_cursors` has entry for each (project_id, 'issues') pair
+- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current)

 **Scope:**
 - Issue fetcher with pagination handling
@@ -250,21 +369,70 @@ CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);

 ---

-### Checkpoint 2: MR + Comments + File Links Ingestion
-**Deliverable:** All MRs, discussion threads, and file-change links stored locally
+### Checkpoint 2: MR + Discussions Ingestion
+**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context

-**Test:** Run `gitlab-engine ingest --type=merge_requests` → count matches; run `gitlab-engine show mr 1234` → displays MR with comments and files changed
+**Automated Tests (Vitest):**
+```
+tests/unit/mr-transformer.test.ts
+  ✓ transforms GitLab MR payload to normalized schema
+  ✓ extracts labels from MR payload
+  ✓ handles missing optional fields gracefully
+
+tests/unit/discussion-transformer.test.ts
+  ✓ transforms discussion payload to normalized schema
+  ✓ extracts notes array from discussion
+  ✓ sets individual_note flag correctly
+  ✓ filters out system notes (system: true)
+  ✓ preserves note order via position field
+
+tests/integration/mr-ingestion.test.ts
+  ✓ inserts MRs into database
+  ✓ creates labels from MR payloads
+  ✓ links MRs to labels via junction table
+  ✓ stores raw payload for each MR
+
+tests/integration/discussion-ingestion.test.ts
+  ✓ fetches discussions for each issue
+  ✓ fetches discussions for each MR
+  ✓ creates discussion rows with correct parent FK
+  ✓ creates note rows linked to discussions
+  ✓ excludes system notes from storage
+  ✓ captures note-level resolution status
+  ✓ captures note type (DiscussionNote, DiffNote)
+```
+
+**Manual CLI Smoke Tests:**
+| Command | Expected Output | Pass Criteria |
+|---------|-----------------|---------------|
+| `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error |
+| `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch |
+| `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI |
+| `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads |
+| `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads |
+| `gi count discussions` | `Discussions: 12,345` | Non-zero count |
+| `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes |
+
+**Data Integrity Checks:**
+- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
+- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
+- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
+- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
+- [ ] Every discussion has at least one note
+- [ ] `individual_note = true` discussions have exactly one note
+- [ ] Discussion `first_note_at` <= `last_note_at` for all rows

 **Scope:**
 - MR fetcher with pagination
- Notes fetcher (issue notes + MR notes) as a dependent resource:
-  - During initial ingest: fetch notes for every issue/MR
-  - During sync: refetch notes only for issues/MRs updated since cursor
- MR changes/diffs fetcher as a dependent resource:
-  - During initial ingest: fetch changes for every MR
-  - During sync: refetch changes only for MRs updated since cursor
- Relationship linking (note → parent issue/MR via foreign keys, MR → files)
- Extended CLI commands for MR display
+- Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
+  - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
+  - During initial ingest: fetch discussions for every issue/MR
+  - During sync: refetch discussions only for issues/MRs updated since cursor
+  - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
+- Relationship linking (discussion → parent issue/MR, notes → discussion)
+- Extended CLI commands for MR/issue display with threads
+
+**Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.

 **Schema Additions:**
 ```sql
@@ -288,44 +456,49 @@ CREATE TABLE merge_requests (
 CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
 CREATE INDEX idx_mrs_author ON merge_requests(author_username);

-- Notes with explicit parent foreign keys for referential integrity
-CREATE TABLE notes (
+-- Discussion threads (the semantic unit for conversations)
+CREATE TABLE discussions (
  id INTEGER PRIMARY KEY,
-  gitlab_id INTEGER UNIQUE NOT NULL,
+  gitlab_discussion_id TEXT UNIQUE NOT NULL,  -- GitLab's string ID (e.g. "6a9c1750b37d...")
  project_id INTEGER NOT NULL REFERENCES projects(id),
  issue_id INTEGER REFERENCES issues(id),
  merge_request_id INTEGER REFERENCES merge_requests(id),
  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
-  noteable_iid INTEGER NOT NULL,    -- parent IID (from API path)
-  author_username TEXT,
-  body TEXT,
-  created_at INTEGER,
-  updated_at INTEGER,
-  system BOOLEAN,
-  raw_payload_id INTEGER REFERENCES raw_payloads(id),
-  -- Exactly one parent FK must be set
+  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
+  first_note_at INTEGER,                      -- for ordering discussions
+  last_note_at INTEGER,                       -- for "recently active" queries
+  resolvable BOOLEAN,                         -- MR discussions can be resolved
+  resolved BOOLEAN,
  CHECK (
    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
  )
 );
-CREATE INDEX idx_notes_issue ON notes(issue_id);
-CREATE INDEX idx_notes_mr ON notes(merge_request_id);
-CREATE INDEX idx_notes_author ON notes(author_username);
+CREATE INDEX idx_discussions_issue ON discussions(issue_id);
+CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
+CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);

-- File linkage for "what MRs touched this file?" queries (with rename support)
-CREATE TABLE mr_files (
+-- Notes belong to discussions (preserving thread context)
+CREATE TABLE notes (
  id INTEGER PRIMARY KEY,
-  merge_request_id INTEGER REFERENCES merge_requests(id),
-  old_path TEXT,
-  new_path TEXT,
-  new_file BOOLEAN,
-  deleted_file BOOLEAN,
-  renamed_file BOOLEAN,
-  UNIQUE(merge_request_id, old_path, new_path)
+  gitlab_id INTEGER UNIQUE NOT NULL,
+  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
+  project_id INTEGER NOT NULL REFERENCES projects(id),
+  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
+  author_username TEXT,
+  body TEXT,
+  created_at INTEGER,
+  updated_at INTEGER,
+  position INTEGER,                           -- derived from array order in API response (0-indexed)
+  resolvable BOOLEAN,                         -- note-level resolvability (MR code comments)
+  resolved BOOLEAN,                           -- note-level resolution status
+  resolved_by TEXT,                           -- username who resolved
+  resolved_at INTEGER,                        -- when resolved
+  raw_payload_id INTEGER REFERENCES raw_payloads(id)
 );
-CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
-CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
+CREATE INDEX idx_notes_discussion ON notes(discussion_id);
+CREATE INDEX idx_notes_author ON notes(author_username);
+CREATE INDEX idx_notes_type ON notes(type);

 -- MR labels (reuse same labels table)
 CREATE TABLE mr_labels (
@@ -336,39 +509,96 @@ CREATE TABLE mr_labels (
 CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
 ```

+**Discussion Processing Rules:**
+- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
+- Each discussion from the API becomes one row in `discussions` table
+- All notes within a discussion are stored with their `discussion_id` foreign key
+- `individual_note: true` discussions have exactly one note (standalone comment)
+- `individual_note: false` discussions have multiple notes (threaded conversation)
+
 ---

 ### Checkpoint 3: Embedding Generation
 **Deliverable:** Vector embeddings generated for all text content

-**Test:** Run `gitlab-engine embed --all` → progress indicator; run `gitlab-engine stats` → shows embedding coverage percentage
+**Automated Tests (Vitest):**
+```
+tests/unit/document-extractor.test.ts
+  ✓ extracts issue document (title + description)
+  ✓ extracts MR document (title + description)
+  ✓ extracts discussion document with full thread context
+  ✓ includes parent issue/MR title in discussion header
+  ✓ formats notes with author and timestamp
+  ✓ truncates content exceeding 8000 tokens
+  ✓ preserves first and last notes when truncating middle
+  ✓ computes SHA-256 content hash consistently
+
+tests/unit/embedding-client.test.ts
+  ✓ connects to Ollama API
+  ✓ generates embedding for text input
+  ✓ returns 768-dimension vector
+  ✓ handles Ollama connection failure gracefully
+  ✓ batches requests (32 documents per batch)
+
+tests/integration/document-creation.test.ts
+  ✓ creates document for each issue
+  ✓ creates document for each MR
+  ✓ creates document for each discussion
+  ✓ populates document_labels junction table
+  ✓ computes content_hash for each document
+
+tests/integration/embedding-storage.test.ts
+  ✓ stores embedding in sqlite-vss
+  ✓ embedding rowid matches document id
+  ✓ creates embedding_metadata record
+  ✓ skips re-embedding when content_hash unchanged
+  ✓ re-embeds when content_hash changes
+```
+
+**Manual CLI Smoke Tests:**
+| Command | Expected Output | Pass Criteria |
+|---------|-----------------|---------------|
+| `gi embed --all` | Progress bar with ETA | Completes without error |
+| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
+| `gi stats` | Embedding coverage stats | Shows 100% coverage |
+| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
+| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
+
+**Data Integrity Checks:**
+- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
+- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents`
+- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
+- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
+- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings
+- [ ] Discussion documents include parent title in content_text

 **Scope:**
 - Ollama integration (nomic-embed-text model)
- Embedding generation pipeline (batch processing)
+- Embedding generation pipeline (batch processing, 32 documents per batch)
 - Vector storage in SQLite (sqlite-vss extension)
 - Progress tracking and resumability
 - Document extraction layer:
-  - Canonical "search documents" derived from issues/MRs/notes
+  - Canonical "search documents" derived from issues/MRs/discussions
  - Stable content hashing for change detection (SHA-256 of content_text)
  - Single embedding per document (chunking deferred to post-MVP)
+  - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
 - Denormalized metadata for fast filtering (author, labels, dates)
 - Fast label filtering via `document_labels` join table

 **Schema Additions:**
 ```sql
-- Unified searchable documents (derived from issues/MRs/notes)
+-- Unified searchable documents (derived from issues/MRs/discussions)
 CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
-  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'note'
+  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
-  author_username TEXT,
+  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,
  updated_at INTEGER,
  url TEXT,
-  title TEXT,                    -- null for notes
+  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  UNIQUE(source_type, source_id)
@@ -408,38 +638,131 @@ CREATE TABLE embedding_metadata (
 - This alignment simplifies joins and eliminates rowid mapping fragility

 **Document Extraction Rules:**
- Issue → title + "\n\n" + description
- MR → title + "\n\n" + description
- Note → body (skip system notes unless they contain meaningful content)
+
+| Source | content_text Construction |
+|--------|--------------------------|
+| Issue | `title + "\n\n" + description` |
+| MR | `title + "\n\n" + description` |
+| Discussion | Full thread with context (see below) |
+
+**Discussion Document Format:**
+```
+[Issue #234: Authentication redesign] Discussion
+
+@johndoe (2024-03-15):
+I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
+
+@janedoe (2024-03-15):
+Agreed. What about refresh token strategy?
+
+@johndoe (2024-03-16):
+Short-lived access tokens (15min), longer refresh (7 days). Here's why...
+```
+
+This format preserves:
+- Parent context (issue/MR title and number)
+- Author attribution for each note
+- Temporal ordering of the conversation
+- Full thread semantics for decision traceability
+
+**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.

 ---

 ### Checkpoint 4: Semantic Search
 **Deliverable:** Working semantic search across all indexed content

-**Tests:**
-1. Run `gitlab-engine search "authentication redesign"` → returns ranked results with snippets
-2. Golden queries: curated list of 10 queries with expected result *containment* (e.g., "at least one of these 3 known URLs appears in top 10")
-3. `gitlab-engine search "..." --json` validates against JSON schema (stable fields present)
+**Automated Tests (Vitest):**
+```
+tests/unit/search-query.test.ts
+  ✓ parses filter flags (--type, --author, --after, --label)
+  ✓ validates date format for --after
+  ✓ handles multiple --label flags
+
+tests/unit/rrf-ranking.test.ts
+  ✓ computes RRF score correctly
+  ✓ merges results from vector and FTS retrievers
+  ✓ handles documents appearing in only one retriever
+  ✓ respects k=60 parameter
+
+tests/integration/vector-search.test.ts
+  ✓ returns results for semantic query
+  ✓ ranks similar content higher
+  ✓ returns empty for nonsense query
+
+tests/integration/fts-search.test.ts
+  ✓ returns exact keyword matches
+  ✓ handles porter stemming (search/searching)
+  ✓ returns empty for non-matching query
+
+tests/integration/hybrid-search.test.ts
+  ✓ combines vector and FTS results
+  ✓ applies type filter correctly
+  ✓ applies author filter correctly
+  ✓ applies date filter correctly
+  ✓ applies label filter correctly
+  ✓ falls back to FTS when Ollama unavailable
+
+tests/e2e/golden-queries.test.ts
+  ✓ "authentication redesign" returns known auth-related items
+  ✓ "database migration" returns known migration items
+  ✓ [8 more domain-specific golden queries]
+```
+
+**Manual CLI Smoke Tests:**
+| Command | Expected Output | Pass Criteria |
+|---------|-----------------|---------------|
+| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
+| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
+| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
+| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
+| `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
+| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
+| `gi search "authentication" --json` | JSON output | Valid JSON array with schema |
+| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
+| `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
+
+**Golden Query Test Suite:**
+Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs:
+```json
+[
+  {
+    "query": "authentication redesign",
+    "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
+    "minResults": 1,
+    "maxRank": 10
+  }
+]
+```
+Each query must have at least one expected URL appear in top 10 results.
+
+**Data Integrity Checks:**
+- [ ] `documents_fts` row count matches `documents` row count
+- [ ] Search returns results for known content (not empty)
+- [ ] JSON output validates against defined schema
+- [ ] All result URLs are valid GitLab URLs

 **Scope:**
 - Hybrid retrieval:
  - Vector recall (sqlite-vss) + FTS lexical recall (fts5)
  - Merge + rerank results using Reciprocal Rank Fusion (RRF)
 - Result ranking and scoring (document-level)
- Search filters: `--type=issue|mr|note`, `--author=username`, `--after=date`, `--label=name`
+- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
  - Label filtering operates on `document_labels` (indexed, exact-match)
 - Output formatting: ranked list with title, snippet, score, URL
 - JSON output mode for AI agent consumption
+- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning

 **Schema Additions:**
 ```sql
 -- Full-text search for hybrid retrieval
+-- Using porter stemmer for better matching of word variants
 CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
-  content_rowid='id'
+  content_rowid='id',
+  tokenize='porter unicode61'
 );

 -- Triggers to keep FTS in sync
@@ -461,6 +784,11 @@ CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
 END;
 ```

+**FTS5 Tokenizer Notes:**
+- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
+- `unicode61` handles Unicode properly
+- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
+
 **Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
 1. Query both vector index (top 50) and FTS5 (top 50)
 2. Merge results by document_id
@@ -477,22 +805,49 @@ END;
 - RRF operates on ranks, not scores, making it robust to scale differences
 - Well-established in information retrieval literature

+**Graceful Degradation:**
+- If Ollama is unreachable during search, automatically fall back to FTS5-only
+- Display warning: "Embedding service unavailable, using lexical search only"
+- `embed` command fails with actionable error if Ollama is down
+
 **CLI Interface:**
 ```bash
 # Basic semantic search
-gitlab-engine search "why did we choose Redis"
+gi search "why did we choose Redis"

 # Pure FTS search (fallback if embeddings unavailable)
-gitlab-engine search "redis" --mode=lexical
+gi search "redis" --mode=lexical

 # Filtered search
-gitlab-engine search "authentication" --type=mr --after=2024-01-01
+gi search "authentication" --type=mr --after=2024-01-01

 # Filter by label
-gitlab-engine search "performance" --label=bug --label=critical
+gi search "performance" --label=bug --label=critical

 # JSON output for programmatic use
-gitlab-engine search "payment processing" --json
+gi search "payment processing" --json
+```
+
+**CLI Output Example:**
+```
+$ gi search "authentication redesign"
+
+Found 23 results (hybrid search, 0.34s)
+
+[1] MR !847 - Refactor auth to use JWT tokens (0.82)
+    @johndoe · 2024-03-15 · group/project-one
+    "...moving away from session cookies to JWT for authentication..."
+    https://gitlab.example.com/group/project-one/-/merge_requests/847
+
+[2] Issue #234 - Authentication redesign discussion (0.79)
+    @janedoe · 2024-02-28 · group/project-one
+    "...we need to redesign the authentication flow because..."
+    https://gitlab.example.com/group/project-one/-/issues/234
+
+[3] Discussion on Issue #234 (0.76)
+    @johndoe · 2024-03-01 · group/project-one
+    "I think we should move to JWT-based auth because the session..."
+    https://gitlab.example.com/group/project-one/-/issues/234#note_12345
 ```

 ---
@@ -500,66 +855,142 @@ gitlab-engine search "payment processing" --json
 ### Checkpoint 5: Incremental Sync
 **Deliverable:** Efficient ongoing synchronization with GitLab

-**Test:** Make a change in GitLab; run `gitlab-engine sync` → only fetches changed items; verify change appears in search
+**Automated Tests (Vitest):**
+```
+tests/unit/cursor-management.test.ts
+  ✓ advances cursor after successful page commit
+  ✓ uses tie-breaker id for identical timestamps
+  ✓ does not advance cursor on failure
+  ✓ resets cursor on --full flag
+
+tests/unit/change-detection.test.ts
+  ✓ detects content_hash mismatch
+  ✓ queues document for re-embedding on change
+  ✓ skips re-embedding when hash unchanged
+
+tests/integration/incremental-sync.test.ts
+  ✓ fetches only items updated after cursor
+  ✓ refetches discussions for updated issues
+  ✓ refetches discussions for updated MRs
+  ✓ updates existing records (not duplicates)
+  ✓ creates new records for new items
+  ✓ re-embeds documents with changed content
+
+tests/integration/sync-recovery.test.ts
+  ✓ resumes from cursor after interrupted sync
+  ✓ marks failed run with error message
+  ✓ handles rate limiting (429) with backoff
+  ✓ respects Retry-After header
+```
+
+**Manual CLI Smoke Tests:**
+| Command | Expected Output | Pass Criteria |
+|---------|-----------------|---------------|
+| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
+| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
+| `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything |
+| `gi sync-status` | Cursor positions, last sync time | Shows current state |
+| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
+| `gi search "new content"` (after sync) | Returns new content | New content is searchable |
+
+**End-to-End Sync Verification:**
+1. Note the current `sync_cursors` values
+2. Create a new comment on an issue in GitLab
+3. Run `gi sync`
+4. Verify:
+   - [ ] Issue's `updated_at` in DB matches GitLab
+   - [ ] New discussion row exists
+   - [ ] New note row exists
+   - [ ] New document row exists for discussion
+   - [ ] New embedding exists for document
+   - [ ] `gi search "new comment text"` returns the new discussion
+   - [ ] Cursor advanced past the updated issue
+
+**Data Integrity Checks:**
+- [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table
+- [ ] No orphaned documents (all have valid source_id)
+- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows
+- [ ] `sync_runs` has complete audit trail

 **Scope:**
 - Delta sync based on stable cursor (updated_at + tie-breaker id)
- Dependent resources sync strategy (notes, MR changes)
- Webhook handler (optional, if webhook access granted)
+- Dependent resources sync strategy (discussions refetched when parent updates)
 - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
 - Sync status reporting
+- Recommended: run via cron every 10 minutes

 **Correctness Rules (MVP):**
 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
 2. Cursor advances only after successful DB commit for that page
 3. Dependent resources:
-   - For each updated issue/MR, refetch its notes (sorted by `updated_at`)
-   - For each updated MR, refetch its file changes
+   - For each updated issue/MR, refetch ALL its discussions
+   - Discussion documents are regenerated and re-embedded if content_hash changes
 4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
 5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

 **Why Dependent Resource Model:**
- GitLab Notes API doesn't provide a clean global `updated_after` stream
- Notes are listed per-issue or per-MR, not as a top-level resource
- Treating notes as dependent resources (refetch when parent updates) is simpler and more correct
- Same applies to MR changes/diffs
+- GitLab Discussions API doesn't provide a global `updated_after` stream
+- Discussions are listed per-issue or per-MR, not as a top-level resource
+- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct

 **CLI Commands:**
 ```bash
 # Full sync (respects cursors, only fetches new/updated)
-gitlab-engine sync
+gi sync

 # Force full re-sync (resets cursors)
-gitlab-engine sync --full
+gi sync --full

 # Override stale 'running' run after operator review
-gitlab-engine sync --force
+gi sync --force

 # Show sync status
-gitlab-engine sync-status
+gi sync-status
 ```

 ---

-## Future Checkpoints (Post-MVP)
+## Future Work (Post-MVP)

-### Checkpoint 6: File/Feature History View
- Map commits to MRs to discussions
- Query: "Show decision history for src/auth/login.ts"
- Ship `gitlab-engine file-history <path>` as a first-class feature here
- This command is deferred from MVP to sharpen checkpoint focus
+The following features are explicitly deferred to keep MVP scope focused:

-### Checkpoint 7: Personal Dashboard
- Filter by assigned/mentioned
- Integrate with existing gitlab-inbox tool
+| Feature | Description | Depends On |
+|---------|-------------|------------|
+| **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
+| **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
+| **Person Context** | Aggregate contributions by author, expertise inference | MVP complete |
+| **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
+| **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
+| **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |

-### Checkpoint 8: Person Context
- Aggregate contributions by author
- Expertise inference from activity
+**Checkpoint 6 (File History) Schema Preview:**
+```sql
+-- Deferred from MVP; added when file-history feature is built
+CREATE TABLE mr_files (
+  id INTEGER PRIMARY KEY,
+  merge_request_id INTEGER REFERENCES merge_requests(id),
+  old_path TEXT,
+  new_path TEXT,
+  new_file BOOLEAN,
+  deleted_file BOOLEAN,
+  renamed_file BOOLEAN,
+  UNIQUE(merge_request_id, old_path, new_path)
+);
+CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
+CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);

-### Checkpoint 9: Decision Graph
- Extract decisions from discussions (LLM-assisted)
- Visualize decision relationships
+-- DiffNote position data (for "show me comments on this file" queries)
+-- Populated from notes.type='DiffNote' position object in GitLab API
+CREATE TABLE note_positions (
+  note_id INTEGER PRIMARY KEY REFERENCES notes(id),
+  old_path TEXT,
+  new_path TEXT,
+  old_line INTEGER,
+  new_line INTEGER,
+  position_type TEXT                          -- 'text' | 'image' | etc.
+);
+CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
+```

 ---

@@ -606,14 +1037,15 @@ Each checkpoint includes:
 | labels | 1 | Label definitions (unique by project + name) |
 | issue_labels | 1 | Issue-label junction |
 | merge_requests | 2 | Normalized MRs |
-| notes | 2 | Issue and MR comments (with parent FKs) |
-| mr_files | 2 | MR file changes (with rename tracking) |
+| discussions | 2 | Discussion threads (the semantic unit for conversations) |
+| notes | 2 | Individual comments within discussions |
 | mr_labels | 2 | MR-label junction |
-| documents | 3 | Unified searchable documents |
+| documents | 3 | Unified searchable documents (issues, MRs, discussions) |
 | document_labels | 3 | Document-label junction for fast filtering |
 | embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
 | embedding_metadata | 3 | Embedding provenance + change detection |
-| documents_fts | 4 | Full-text search index (fts5) |
+| documents_fts | 4 | Full-text search index (fts5 with porter stemmer) |
+| mr_files | 6 | MR file changes (deferred to File History feature) |

 ---

@@ -621,14 +1053,19 @@ Each checkpoint includes:

 | Question | Decision | Rationale |
 |----------|----------|-----------|
-| Commit/file linkage | **Include MR→file links** | Enables "what MRs touched this file?" without full commit history |
+| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread |
+| System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value |
+| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls |
 | Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering |
 | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available |
-| Sync method | **Polling for MVP** | Decide on webhooks after using the system |
-| Notes sync | **Dependent resource** | Notes API is per-parent, not global; refetch on parent update |
+| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
+| Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates |
 | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
 | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts |
-| file-history CLI | **Post-MVP (CP6)** | Sharpens MVP checkpoint focus |
+| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 |
+| Embedding batching | **32 documents per batch** | Balance between throughput and memory |
+| FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text |
+| Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching |

 ---