# GitLab Knowledge Engine - Spec Document

## Executive Summary

A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.

---

## Discovery Summary

### Pain Points Identified
1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find
2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions

### Constraints
| Constraint | Detail |
|------------|--------|
| Hosting | Self-hosted only, no external APIs |
| Compute | Local dev machine (M-series Mac assumed) |
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
| Build Method | AI agents will implement; user is TypeScript expert for review |

### Target Use Cases (Priority Order)
1. **MVP: Semantic Search** - "Find discussions about authentication redesign"
2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?"
3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?"
4. **Future: Person Context** - "What's @johndoe's background in this project?"

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                        GitLab API                                │
│                    (Issues, MRs, Notes)                          │
└─────────────────────────────────────────────────────────────────┘
  (Commit-level indexing explicitly post-MVP)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Ingestion Layer                         │
│  - Incremental sync (PAT-based polling)                         │
│  - Rate limiting / backoff                                       │
│  - Raw JSON storage for replay                                   │
│  - Dependent resource fetching (notes, MR changes)              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Data Processing Layer                        │
│  - Normalize artifacts to unified schema                        │
│  - Extract searchable documents (canonical text + metadata)     │
│  - Content hashing for change detection                         │
│  - Build relationship graph (issue↔MR↔note↔file)               │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                               │
│  - SQLite + sqlite-vss + FTS5 (hybrid search)                   │
│  - Structured metadata in relational tables                      │
│  - Vector embeddings for semantic search                         │
│  - Full-text index for lexical search fallback                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Query Interface                             │
│  - CLI for human testing                                         │
│  - JSON API for AI agent testing                                 │
│  - Semantic search with filters (author, date, type, label)     │
└─────────────────────────────────────────────────────────────────┘
```

### Technology Choices

| Component | Recommendation | Rationale |
|-----------|---------------|-----------|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
| CLI Framework | Commander.js or oclif | Standard, well-documented |

### Alternative Considered: Postgres + pgvector
- Pros: More scalable, better for production multi-user
- Cons: Requires running Postgres, heavier setup
- Decision: Start with SQLite for simplicity; migration path exists if needed

---

## GitLab API Strategy

### Primary Resources (Bulk Fetch)

Issues and MRs support efficient bulk fetching with incremental sync:

```
GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100
```

### Dependent Resources (Per-Parent Fetch)

Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:

```
GET /projects/:id/issues/:iid/discussions
GET /projects/:id/merge_requests/:iid/discussions
```

### Sync Pattern

**Initial sync:**
1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
2. For EACH issue → fetch all discussions (~3K calls)
3. Fetch all MRs (paginated, ~60 calls)
4. For EACH MR → fetch all discussions (~3K calls)
5. Total: ~6,100+ API calls for initial sync

**Incremental sync:**
1. Fetch issues where `updated_after=cursor` (bulk)
2. For EACH updated issue → refetch ALL its discussions
3. Fetch MRs where `updated_after=cursor` (bulk)
4. For EACH updated MR → refetch ALL its discussions

### Critical Assumption

**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.

Mitigation: Periodic full re-sync (weekly) as a safety net.

### Rate Limiting

- Default: 10 requests/second with exponential backoff
- Respect `Retry-After` headers on 429 responses
- Add jitter to avoid thundering herd on retry
- Initial sync estimate: 10-20 minutes depending on rate limits

---

## Checkpoint Structure

Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.

### Checkpoint 0: Project Setup
**Deliverable:** Scaffolded project with GitLab API connection verified

**Automated Tests (Vitest):**
```
tests/unit/config.test.ts
  ✓ loads config from gi.config.json
  ✓ throws if config file missing
  ✓ throws if required fields missing (baseUrl, projects)
  ✓ validates project paths are non-empty strings

tests/unit/db.test.ts
  ✓ creates database file if not exists
  ✓ applies migrations in order
  ✓ sets WAL journal mode
  ✓ enables foreign keys

tests/integration/gitlab-client.test.ts
  ✓ authenticates with valid PAT
  ✓ returns 401 for invalid PAT
  ✓ fetches project by path
  ✓ handles rate limiting (429) with retry
```

**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name |
| `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
| `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks |
| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |

**Data Integrity Checks:**
- [ ] `projects` table contains rows for each configured project path
- [ ] `gitlab_project_id` matches actual GitLab project IDs
- [ ] `raw_payloads` contains project JSON for each synced project

**Scope:**
- Project structure (TypeScript, ESLint, Vitest)
- GitLab API client with PAT authentication
- Environment and project configuration
- Basic CLI scaffold with `auth-test` command
- `doctor` command for environment verification
- Projects table and initial sync

**Configuration (MVP):**
```json
// gi.config.json
{
  "gitlab": {
    "baseUrl": "https://gitlab.example.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project-one" },
    { "path": "group/project-two" }
  ],
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "baseUrl": "http://localhost:11434"
  }
}
```

**DB Runtime Defaults (Checkpoint 0):**
- On every connection:
  - `PRAGMA journal_mode=WAL;`
  - `PRAGMA foreign_keys=ON;`

**Schema (Checkpoint 0):**
```sql
-- Projects table (configured targets)
CREATE TABLE projects (
  id INTEGER PRIMARY KEY,
  gitlab_project_id INTEGER UNIQUE NOT NULL,
  path_with_namespace TEXT NOT NULL,
  default_branch TEXT,
  web_url TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);

-- Sync tracking for reliability
CREATE TABLE sync_runs (
  id INTEGER PRIMARY KEY,
  started_at INTEGER NOT NULL,
  finished_at INTEGER,
  status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
  command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
  error TEXT
);

-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  resource_type TEXT NOT NULL,   -- 'issues' | 'merge_requests'
  updated_at_cursor INTEGER,     -- last fully processed updated_at (ms epoch)
  tie_breaker_id INTEGER,        -- last fully processed gitlab_id (for stable ordering)
  PRIMARY KEY(project_id, resource_type)
);

-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
  id INTEGER PRIMARY KEY,
  source TEXT NOT NULL,          -- 'gitlab'
  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note'
  gitlab_id INTEGER NOT NULL,
  fetched_at INTEGER NOT NULL,
  json TEXT NOT NULL
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
```

---

### Checkpoint 1: Issue Ingestion
**Deliverable:** All issues from target repos stored locally

**Automated Tests (Vitest):**
```
tests/unit/issue-transformer.test.ts
  ✓ transforms GitLab issue payload to normalized schema
  ✓ extracts labels from issue payload
  ✓ handles missing optional fields gracefully

tests/unit/pagination.test.ts
  ✓ fetches all pages when multiple exist
  ✓ respects per_page parameter
  ✓ stops when empty page returned

tests/integration/issue-ingestion.test.ts
  ✓ inserts issues into database
  ✓ creates labels from issue payloads
  ✓ links issues to labels via junction table
  ✓ stores raw payload for each issue
  ✓ updates cursor after successful page commit
  ✓ resumes from cursor on subsequent runs

tests/integration/sync-runs.test.ts
  ✓ creates sync_run record on start
  ✓ marks run as succeeded on completion
  ✓ marks run as failed with error message on failure
  ✓ refuses concurrent run (single-flight)
  ✓ allows --force to override stale running status
```

**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi ingest --type=issues` | Progress bar, final count | Completes without error |
| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
| `gi show issue 123` | Issue detail view | Shows title, description, labels, URL |
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |

**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects
- [ ] Every issue has a corresponding `raw_payloads` row
- [ ] Labels in `issue_labels` junction all exist in `labels` table
- [ ] `sync_cursors` has entry for each (project_id, 'issues') pair
- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current)

**Scope:**
- Issue fetcher with pagination handling
- Raw JSON storage in raw_payloads table
- Normalized issue schema in SQLite
- Labels ingestion derived from issue payload:
  - Always persist label names from `labels: string[]`
  - Optionally request `with_labels_details=true` to capture color/description when available
- Incremental sync support (run tracking + per-project cursor)
- Basic list/count CLI commands

**Reliability/Idempotency Rules:**
- Every ingest/sync creates a `sync_runs` row
- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
- Cursor advances only after successful transaction commit per page/batch
- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
- Use explicit transactions for batch inserts

**Schema Preview:**
```sql
CREATE TABLE issues (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);

-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER,                  -- optional (only if available)
  project_id INTEGER NOT NULL REFERENCES projects(id),
  name TEXT NOT NULL,
  color TEXT,
  description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);

CREATE TABLE issue_labels (
  issue_id INTEGER REFERENCES issues(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
```

---

### Checkpoint 2: MR + Discussions Ingestion
**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context

**Automated Tests (Vitest):**
```
tests/unit/mr-transformer.test.ts
  ✓ transforms GitLab MR payload to normalized schema
  ✓ extracts labels from MR payload
  ✓ handles missing optional fields gracefully

tests/unit/discussion-transformer.test.ts
  ✓ transforms discussion payload to normalized schema
  ✓ extracts notes array from discussion
  ✓ sets individual_note flag correctly
  ✓ filters out system notes (system: true)
  ✓ preserves note order via position field

tests/integration/mr-ingestion.test.ts
  ✓ inserts MRs into database
  ✓ creates labels from MR payloads
  ✓ links MRs to labels via junction table
  ✓ stores raw payload for each MR

tests/integration/discussion-ingestion.test.ts
  ✓ fetches discussions for each issue
  ✓ fetches discussions for each MR
  ✓ creates discussion rows with correct parent FK
  ✓ creates note rows linked to discussions
  ✓ excludes system notes from storage
  ✓ captures note-level resolution status
  ✓ captures note type (DiscussionNote, DiffNote)
```

**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error |
| `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch |
| `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI |
| `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads |
| `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads |
| `gi count discussions` | `Discussions: 12,345` | Non-zero count |
| `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes |

**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
- [ ] Every discussion has at least one note
- [ ] `individual_note = true` discussions have exactly one note
- [ ] Discussion `first_note_at` <= `last_note_at` for all rows

**Scope:**
- MR fetcher with pagination
- Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
  - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
  - During initial ingest: fetch discussions for every issue/MR
  - During sync: refetch discussions only for issues/MRs updated since cursor
  - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
- Relationship linking (discussion → parent issue/MR, notes → discussion)
- Extended CLI commands for MR/issue display with threads

**Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.

**Schema Additions:**
```sql
CREATE TABLE merge_requests (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  source_branch TEXT,
  target_branch TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  merged_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);

-- Discussion threads (the semantic unit for conversations)
CREATE TABLE discussions (
  id INTEGER PRIMARY KEY,
  gitlab_discussion_id TEXT UNIQUE NOT NULL,  -- GitLab's string ID (e.g. "6a9c1750b37d...")
  project_id INTEGER NOT NULL REFERENCES projects(id),
  issue_id INTEGER REFERENCES issues(id),
  merge_request_id INTEGER REFERENCES merge_requests(id),
  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
  first_note_at INTEGER,                      -- for ordering discussions
  last_note_at INTEGER,                       -- for "recently active" queries
  resolvable BOOLEAN,                         -- MR discussions can be resolved
  resolved BOOLEAN,
  CHECK (
    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
  )
);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);

-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
  project_id INTEGER NOT NULL REFERENCES projects(id),
  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
  author_username TEXT,
  body TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  position INTEGER,                           -- derived from array order in API response (0-indexed)
  resolvable BOOLEAN,                         -- note-level resolvability (MR code comments)
  resolved BOOLEAN,                           -- note-level resolution status
  resolved_by TEXT,                           -- username who resolved
  resolved_at INTEGER,                        -- when resolved
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_type ON notes(type);

-- MR labels (reuse same labels table)
CREATE TABLE mr_labels (
  merge_request_id INTEGER REFERENCES merge_requests(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
```

**Discussion Processing Rules:**
- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
- Each discussion from the API becomes one row in `discussions` table
- All notes within a discussion are stored with their `discussion_id` foreign key
- `individual_note: true` discussions have exactly one note (standalone comment)
- `individual_note: false` discussions have multiple notes (threaded conversation)

---

### Checkpoint 3: Embedding Generation
**Deliverable:** Vector embeddings generated for all text content

**Automated Tests (Vitest):**
```
tests/unit/document-extractor.test.ts
  ✓ extracts issue document (title + description)
  ✓ extracts MR document (title + description)
  ✓ extracts discussion document with full thread context
  ✓ includes parent issue/MR title in discussion header
  ✓ formats notes with author and timestamp
  ✓ truncates content exceeding 8000 tokens
  ✓ preserves first and last notes when truncating middle
  ✓ computes SHA-256 content hash consistently

tests/unit/embedding-client.test.ts
  ✓ connects to Ollama API
  ✓ generates embedding for text input
  ✓ returns 768-dimension vector
  ✓ handles Ollama connection failure gracefully
  ✓ batches requests (32 documents per batch)

tests/integration/document-creation.test.ts
  ✓ creates document for each issue
  ✓ creates document for each MR
  ✓ creates document for each discussion
  ✓ populates document_labels junction table
  ✓ computes content_hash for each document

tests/integration/embedding-storage.test.ts
  ✓ stores embedding in sqlite-vss
  ✓ embedding rowid matches document id
  ✓ creates embedding_metadata record
  ✓ skips re-embedding when content_hash unchanged
  ✓ re-embeds when content_hash changes
```

**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi embed --all` | Progress bar with ETA | Completes without error |
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
| `gi stats` | Embedding coverage stats | Shows 100% coverage |
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |

**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents`
- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings
- [ ] Discussion documents include parent title in content_text

**Scope:**
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline (batch processing, 32 documents per batch)
- Vector storage in SQLite (sqlite-vss extension)
- Progress tracking and resumability
- Document extraction layer:
  - Canonical "search documents" derived from issues/MRs/discussions
  - Stable content hashing for change detection (SHA-256 of content_text)
  - Single embedding per document (chunking deferred to post-MVP)
  - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
- Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via `document_labels` join table

**Schema Additions:**
```sql
-- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,
  updated_at INTEGER,
  url TEXT,
  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);

-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- sqlite-vss virtual table
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0(
  embedding(768)
);

-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id),
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL
);
```

**Storage Rule (MVP):**
- Insert embedding with `rowid = documents.id`
- Upsert `embedding_metadata` by `document_id`
- This alignment simplifies joins and eliminates rowid mapping fragility

**Document Extraction Rules:**

| Source | content_text Construction |
|--------|--------------------------|
| Issue | `title + "\n\n" + description` |
| MR | `title + "\n\n" + description` |
| Discussion | Full thread with context (see below) |

**Discussion Document Format:**
```
[Issue #234: Authentication redesign] Discussion

@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...

@janedoe (2024-03-15):
Agreed. What about refresh token strategy?

@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
```

This format preserves:
- Parent context (issue/MR title and number)
- Author attribution for each note
- Temporal ordering of the conversation
- Full thread semantics for decision traceability

**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.

---

### Checkpoint 4: Semantic Search
**Deliverable:** Working semantic search across all indexed content

**Automated Tests (Vitest):**
```
tests/unit/search-query.test.ts
  ✓ parses filter flags (--type, --author, --after, --label)
  ✓ validates date format for --after
  ✓ handles multiple --label flags

tests/unit/rrf-ranking.test.ts
  ✓ computes RRF score correctly
  ✓ merges results from vector and FTS retrievers
  ✓ handles documents appearing in only one retriever
  ✓ respects k=60 parameter

tests/integration/vector-search.test.ts
  ✓ returns results for semantic query
  ✓ ranks similar content higher
  ✓ returns empty for nonsense query

tests/integration/fts-search.test.ts
  ✓ returns exact keyword matches
  ✓ handles porter stemming (search/searching)
  ✓ returns empty for non-matching query

tests/integration/hybrid-search.test.ts
  ✓ combines vector and FTS results
  ✓ applies type filter correctly
  ✓ applies author filter correctly
  ✓ applies date filter correctly
  ✓ applies label filter correctly
  ✓ falls back to FTS when Ollama unavailable

tests/e2e/golden-queries.test.ts
  ✓ "authentication redesign" returns known auth-related items
  ✓ "database migration" returns known migration items
  ✓ [8 more domain-specific golden queries]
```

**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
| `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
| `gi search "authentication" --json` | JSON output | Valid JSON array with schema |
| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
| `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |

**Golden Query Test Suite:**
Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs:
```json
[
  {
    "query": "authentication redesign",
    "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
    "minResults": 1,
    "maxRank": 10
  }
]
```
Each query must have at least one expected URL appear in top 10 results.

**Data Integrity Checks:**
- [ ] `documents_fts` row count matches `documents` row count
- [ ] Search returns results for known content (not empty)
- [ ] JSON output validates against defined schema
- [ ] All result URLs are valid GitLab URLs

**Scope:**
- Hybrid retrieval:
  - Vector recall (sqlite-vss) + FTS lexical recall (fts5)
  - Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Result ranking and scoring (document-level)
- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
  - Label filtering operates on `document_labels` (indexed, exact-match)
- Output formatting: ranked list with title, snippet, score, URL
- JSON output mode for AI agent consumption
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning

**Schema Additions:**
```sql
-- Full-text search for hybrid retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id',
  tokenize='porter unicode61'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
END;

CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;
```

**FTS5 Tokenizer Notes:**
- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
- `unicode61` handles Unicode properly
- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer

**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
1. Query both vector index (top 50) and FTS5 (top 50)
2. Merge results by document_id
3. Combine with Reciprocal Rank Fusion (RRF):
   - For each retriever list, assign ranks (1..N)
   - `rrf_score = Σ 1 / (k + rank)` with k=60 (tunable)
   - RRF is simpler than weighted sums and doesn't require score normalization
4. Apply filters (type, author, date, label)
5. Return top K

**Why RRF over Weighted Sums:**
- FTS5 BM25 scores and vector distances use different scales
- Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization
- RRF operates on ranks, not scores, making it robust to scale differences
- Well-established in information retrieval literature

**Graceful Degradation:**
- If Ollama is unreachable during search, automatically fall back to FTS5-only
- Display warning: "Embedding service unavailable, using lexical search only"
- `embed` command fails with actionable error if Ollama is down

**CLI Interface:**
```bash
# Basic semantic search
gi search "why did we choose Redis"

# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical

# Filtered search
gi search "authentication" --type=mr --after=2024-01-01

# Filter by label
gi search "performance" --label=bug --label=critical

# JSON output for programmatic use
gi search "payment processing" --json
```

**CLI Output Example:**
```
$ gi search "authentication redesign"

Found 23 results (hybrid search, 0.34s)

[1] MR !847 - Refactor auth to use JWT tokens (0.82)
    @johndoe · 2024-03-15 · group/project-one
    "...moving away from session cookies to JWT for authentication..."
    https://gitlab.example.com/group/project-one/-/merge_requests/847

[2] Issue #234 - Authentication redesign discussion (0.79)
    @janedoe · 2024-02-28 · group/project-one
    "...we need to redesign the authentication flow because..."
    https://gitlab.example.com/group/project-one/-/issues/234

[3] Discussion on Issue #234 (0.76)
    @johndoe · 2024-03-01 · group/project-one
    "I think we should move to JWT-based auth because the session..."
    https://gitlab.example.com/group/project-one/-/issues/234#note_12345
```

---

### Checkpoint 5: Incremental Sync
**Deliverable:** Efficient ongoing synchronization with GitLab

**Automated Tests (Vitest):**
```
tests/unit/cursor-management.test.ts
  ✓ advances cursor after successful page commit
  ✓ uses tie-breaker id for identical timestamps
  ✓ does not advance cursor on failure
  ✓ resets cursor on --full flag

tests/unit/change-detection.test.ts
  ✓ detects content_hash mismatch
  ✓ queues document for re-embedding on change
  ✓ skips re-embedding when hash unchanged

tests/integration/incremental-sync.test.ts
  ✓ fetches only items updated after cursor
  ✓ refetches discussions for updated issues
  ✓ refetches discussions for updated MRs
  ✓ updates existing records (not duplicates)
  ✓ creates new records for new items
  ✓ re-embeds documents with changed content

tests/integration/sync-recovery.test.ts
  ✓ resumes from cursor after interrupted sync
  ✓ marks failed run with error message
  ✓ handles rate limiting (429) with backoff
  ✓ respects Retry-After header
```

**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
| `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything |
| `gi sync-status` | Cursor positions, last sync time | Shows current state |
| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
| `gi search "new content"` (after sync) | Returns new content | New content is searchable |

**End-to-End Sync Verification:**
1. Note the current `sync_cursors` values
2. Create a new comment on an issue in GitLab
3. Run `gi sync`
4. Verify:
   - [ ] Issue's `updated_at` in DB matches GitLab
   - [ ] New discussion row exists
   - [ ] New note row exists
   - [ ] New document row exists for discussion
   - [ ] New embedding exists for document
   - [ ] `gi search "new comment text"` returns the new discussion
   - [ ] Cursor advanced past the updated issue

**Data Integrity Checks:**
- [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table
- [ ] No orphaned documents (all have valid source_id)
- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows
- [ ] `sync_runs` has complete audit trail

**Scope:**
- Delta sync based on stable cursor (updated_at + tie-breaker id)
- Dependent resources sync strategy (discussions refetched when parent updates)
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
- Sync status reporting
- Recommended: run via cron every 10 minutes

**Correctness Rules (MVP):**
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
2. Cursor advances only after successful DB commit for that page
3. Dependent resources:
   - For each updated issue/MR, refetch ALL its discussions
   - Discussion documents are regenerated and re-embedded if content_hash changes
4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

**Why Dependent Resource Model:**
- GitLab Discussions API doesn't provide a global `updated_after` stream
- Discussions are listed per-issue or per-MR, not as a top-level resource
- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct

**CLI Commands:**
```bash
# Full sync (respects cursors, only fetches new/updated)
gi sync

# Force full re-sync (resets cursors)
gi sync --full

# Override stale 'running' run after operator review
gi sync --force

# Show sync status
gi sync-status
```

---

## Future Work (Post-MVP)

The following features are explicitly deferred to keep MVP scope focused:

| Feature | Description | Depends On |
|---------|-------------|------------|
| **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
| **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
| **Person Context** | Aggregate contributions by author, expertise inference | MVP complete |
| **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
| **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
| **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |

**Checkpoint 6 (File History) Schema Preview:**
```sql
-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
  id INTEGER PRIMARY KEY,
  merge_request_id INTEGER REFERENCES merge_requests(id),
  old_path TEXT,
  new_path TEXT,
  new_file BOOLEAN,
  deleted_file BOOLEAN,
  renamed_file BOOLEAN,
  UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);

-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
  note_id INTEGER PRIMARY KEY REFERENCES notes(id),
  old_path TEXT,
  new_path TEXT,
  old_line INTEGER,
  new_line INTEGER,
  position_type TEXT                          -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
```

---

## Verification Strategy

Each checkpoint includes:

1. **Automated tests** - Unit tests for data transformations, integration tests for API calls
2. **CLI smoke tests** - Manual commands with expected outputs documented
3. **Data integrity checks** - Count verification against GitLab, schema validation
4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+)

---

## Risk Mitigation

| Risk | Mitigation |
|------|------------|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
| SQLite scale limits | Monitor performance; Postgres migration path documented |
| Stale data | Incremental sync with change detection |
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |

**SQLite Performance Defaults (MVP):**
- Enable `PRAGMA journal_mode=WAL;` on every connection
- Enable `PRAGMA foreign_keys=ON;` on every connection
- Use explicit transactions for page/batch inserts
- Targeted indexes on `(project_id, updated_at)` for primary resources

---

## Schema Summary

| Table | Checkpoint | Purpose |
|-------|------------|---------|
| projects | 0 | Configured GitLab projects |
| sync_runs | 0 | Audit trail of sync operations |
| sync_cursors | 0 | Resumable sync state per primary resource |
| raw_payloads | 0 | Decoupled raw JSON storage |
| issues | 1 | Normalized issues |
| labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction |
| merge_requests | 2 | Normalized MRs |
| discussions | 2 | Discussion threads (the semantic unit for conversations) |
| notes | 2 | Individual comments within discussions |
| mr_labels | 2 | MR-label junction |
| documents | 3 | Unified searchable documents (issues, MRs, discussions) |
| document_labels | 3 | Document-label junction for fast filtering |
| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
| embedding_metadata | 3 | Embedding provenance + change detection |
| documents_fts | 4 | Full-text search index (fts5 with porter stemmer) |
| mr_files | 6 | MR file changes (deferred to File History feature) |

---

## Resolved Decisions

| Question | Decision | Rationale |
|----------|----------|-----------|
| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread |
| System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value |
| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls |
| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering |
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available |
| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
| Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates |
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts |
| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 |
| Embedding batching | **32 documents per batch** | Balance between throughput and memory |
| FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text |
| Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching |

---

## Next Steps

1. User approves this spec
2. Generate Checkpoint 0 PRD for project setup
3. Implement Checkpoint 0
4. Human validates → proceed to Checkpoint 1
5. Repeat for each checkpoint