# GitLab Knowledge Engine - Spec Document ## Executive Summary A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability. --- ## Discovery Summary ### Pain Points Identified 1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find 2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions ### Constraints | Constraint | Detail | |------------|--------| | Hosting | Self-hosted only, no external APIs | | Compute | Local dev machine (M-series Mac assumed) | | GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) | | Build Method | AI agents will implement; user is TypeScript expert for review | ### Target Use Cases (Priority Order) 1. **MVP: Semantic Search** - "Find discussions about authentication redesign" 2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?" 3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?" 4. **Future: Person Context** - "What's @johndoe's background in this project?" --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ GitLab API │ │ (Issues, MRs, Notes) │ └─────────────────────────────────────────────────────────────────┘ (Commit-level indexing explicitly post-MVP) │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Data Ingestion Layer │ │ - Incremental sync (PAT-based polling) │ │ - Rate limiting / backoff │ │ - Raw JSON storage for replay │ │ - Dependent resource fetching (notes, MR changes) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Data Processing Layer │ │ - Normalize artifacts to unified schema │ │ - Extract searchable documents (canonical text + metadata) │ │ - Content hashing for change detection │ │ - Build relationship graph (issue↔MR↔note↔file) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Storage Layer │ │ - SQLite + sqlite-vss + FTS5 (hybrid search) │ │ - Structured metadata in relational tables │ │ - Vector embeddings for semantic search │ │ - Full-text index for lexical search fallback │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Query Interface │ │ - CLI for human testing │ │ - JSON API for AI agent testing │ │ - Semantic search with filters (author, date, type, label) │ └─────────────────────────────────────────────────────────────────┘ ``` ### Technology Choices | Component | Recommendation | Rationale | |-----------|---------------|-----------| | Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly | | Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in | | Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors | | CLI Framework | Commander.js or oclif | Standard, well-documented | ### Alternative Considered: Postgres + pgvector - Pros: More scalable, better for production multi-user - Cons: Requires running Postgres, heavier setup - Decision: Start with SQLite for simplicity; migration path exists if needed --- ## GitLab API Strategy ### Primary Resources (Bulk Fetch) Issues and MRs support efficient bulk fetching with incremental sync: ``` GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100 GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100 ``` ### Dependent Resources (Per-Parent Fetch) Discussions must be fetched per-issue and per-MR. There is no bulk endpoint: ``` GET /projects/:id/issues/:iid/discussions GET /projects/:id/merge_requests/:iid/discussions ``` ### Sync Pattern **Initial sync:** 1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page) 2. For EACH issue → fetch all discussions (~3K calls) 3. Fetch all MRs (paginated, ~60 calls) 4. For EACH MR → fetch all discussions (~3K calls) 5. Total: ~6,100+ API calls for initial sync **Incremental sync:** 1. Fetch issues where `updated_after=cursor` (bulk) 2. For EACH updated issue → refetch ALL its discussions 3. Fetch MRs where `updated_after=cursor` (bulk) 4. For EACH updated MR → refetch ALL its discussions ### Critical Assumption **Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed. Mitigation: Periodic full re-sync (weekly) as a safety net. ### Rate Limiting - Default: 10 requests/second with exponential backoff - Respect `Retry-After` headers on 429 responses - Add jitter to avoid thundering herd on retry - Initial sync estimate: 10-20 minutes depending on rate limits --- ## Checkpoint Structure Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding. ### Checkpoint 0: Project Setup **Deliverable:** Scaffolded project with GitLab API connection verified **Automated Tests (Vitest):** ``` tests/unit/config.test.ts ✓ loads config from gi.config.json ✓ throws if config file missing ✓ throws if required fields missing (baseUrl, projects) ✓ validates project paths are non-empty strings tests/unit/db.test.ts ✓ creates database file if not exists ✓ applies migrations in order ✓ sets WAL journal mode ✓ enables foreign keys tests/integration/gitlab-client.test.ts ✓ authenticates with valid PAT ✓ returns 401 for invalid PAT ✓ fetches project by path ✓ handles rate limiting (429) with retry ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name | | `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) | | `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks | | `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure | **Data Integrity Checks:** - [ ] `projects` table contains rows for each configured project path - [ ] `gitlab_project_id` matches actual GitLab project IDs - [ ] `raw_payloads` contains project JSON for each synced project **Scope:** - Project structure (TypeScript, ESLint, Vitest) - GitLab API client with PAT authentication - Environment and project configuration - Basic CLI scaffold with `auth-test` command - `doctor` command for environment verification - Projects table and initial sync **Configuration (MVP):** ```json // gi.config.json { "gitlab": { "baseUrl": "https://gitlab.example.com", "tokenEnvVar": "GITLAB_TOKEN" }, "projects": [ { "path": "group/project-one" }, { "path": "group/project-two" } ], "embedding": { "provider": "ollama", "model": "nomic-embed-text", "baseUrl": "http://localhost:11434" } } ``` **DB Runtime Defaults (Checkpoint 0):** - On every connection: - `PRAGMA journal_mode=WAL;` - `PRAGMA foreign_keys=ON;` **Schema (Checkpoint 0):** ```sql -- Projects table (configured targets) CREATE TABLE projects ( id INTEGER PRIMARY KEY, gitlab_project_id INTEGER UNIQUE NOT NULL, path_with_namespace TEXT NOT NULL, default_branch TEXT, web_url TEXT, created_at INTEGER, updated_at INTEGER, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_projects_path ON projects(path_with_namespace); -- Sync tracking for reliability CREATE TABLE sync_runs ( id INTEGER PRIMARY KEY, started_at INTEGER NOT NULL, finished_at INTEGER, status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc. error TEXT ); -- Sync cursors for primary resources only -- Notes and MR changes are dependent resources (fetched via parent updates) CREATE TABLE sync_cursors ( project_id INTEGER NOT NULL REFERENCES projects(id), resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests' updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch) tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering) PRIMARY KEY(project_id, resource_type) ); -- Raw payload storage (decoupled from entity tables) CREATE TABLE raw_payloads ( id INTEGER PRIMARY KEY, source TEXT NOT NULL, -- 'gitlab' resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' gitlab_id INTEGER NOT NULL, fetched_at INTEGER NOT NULL, json TEXT NOT NULL ); CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id); ``` --- ### Checkpoint 1: Issue Ingestion **Deliverable:** All issues from target repos stored locally **Automated Tests (Vitest):** ``` tests/unit/issue-transformer.test.ts ✓ transforms GitLab issue payload to normalized schema ✓ extracts labels from issue payload ✓ handles missing optional fields gracefully tests/unit/pagination.test.ts ✓ fetches all pages when multiple exist ✓ respects per_page parameter ✓ stops when empty page returned tests/integration/issue-ingestion.test.ts ✓ inserts issues into database ✓ creates labels from issue payloads ✓ links issues to labels via junction table ✓ stores raw payload for each issue ✓ updates cursor after successful page commit ✓ resumes from cursor on subsequent runs tests/integration/sync-runs.test.ts ✓ creates sync_run record on start ✓ marks run as succeeded on completion ✓ marks run as failed with error message on failure ✓ refuses concurrent run (single-flight) ✓ allows --force to override stale running status ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi ingest --type=issues` | Progress bar, final count | Completes without error | | `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author | | `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project | | `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI | | `gi show issue 123` | Issue detail view | Shows title, description, labels, URL | | `gi sync-status` | Last sync time, cursor positions | Shows successful last run | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects - [ ] Every issue has a corresponding `raw_payloads` row - [ ] Labels in `issue_labels` junction all exist in `labels` table - [ ] `sync_cursors` has entry for each (project_id, 'issues') pair - [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current) **Scope:** - Issue fetcher with pagination handling - Raw JSON storage in raw_payloads table - Normalized issue schema in SQLite - Labels ingestion derived from issue payload: - Always persist label names from `labels: string[]` - Optionally request `with_labels_details=true` to capture color/description when available - Incremental sync support (run tracking + per-project cursor) - Basic list/count CLI commands **Reliability/Idempotency Rules:** - Every ingest/sync creates a `sync_runs` row - Single-flight: refuse to start if an existing run is `running` (unless `--force`) - Cursor advances only after successful transaction commit per page/batch - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC` - Use explicit transactions for batch inserts **Schema Preview:** ```sql CREATE TABLE issues ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, title TEXT, description TEXT, state TEXT, author_username TEXT, created_at INTEGER, updated_at INTEGER, web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); CREATE INDEX idx_issues_author ON issues(author_username); -- Labels are derived from issue payloads (string array) -- Uniqueness is (project_id, name) since gitlab_id isn't always available CREATE TABLE labels ( id INTEGER PRIMARY KEY, gitlab_id INTEGER, -- optional (only if available) project_id INTEGER NOT NULL REFERENCES projects(id), name TEXT NOT NULL, color TEXT, description TEXT ); CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name); CREATE INDEX idx_labels_name ON labels(name); CREATE TABLE issue_labels ( issue_id INTEGER REFERENCES issues(id), label_id INTEGER REFERENCES labels(id), PRIMARY KEY(issue_id, label_id) ); CREATE INDEX idx_issue_labels_label ON issue_labels(label_id); ``` --- ### Checkpoint 2: MR + Discussions Ingestion **Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context **Automated Tests (Vitest):** ``` tests/unit/mr-transformer.test.ts ✓ transforms GitLab MR payload to normalized schema ✓ extracts labels from MR payload ✓ handles missing optional fields gracefully tests/unit/discussion-transformer.test.ts ✓ transforms discussion payload to normalized schema ✓ extracts notes array from discussion ✓ sets individual_note flag correctly ✓ filters out system notes (system: true) ✓ preserves note order via position field tests/integration/mr-ingestion.test.ts ✓ inserts MRs into database ✓ creates labels from MR payloads ✓ links MRs to labels via junction table ✓ stores raw payload for each MR tests/integration/discussion-ingestion.test.ts ✓ fetches discussions for each issue ✓ fetches discussions for each MR ✓ creates discussion rows with correct parent FK ✓ creates note rows linked to discussions ✓ excludes system notes from storage ✓ captures note-level resolution status ✓ captures note type (DiscussionNote, DiffNote) ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error | | `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch | | `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI | | `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads | | `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads | | `gi count discussions` | `Discussions: 12,345` | Non-zero count | | `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count - [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments - [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked) - [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes) - [ ] Every discussion has at least one note - [ ] `individual_note = true` discussions have exactly one note - [ ] Discussion `first_note_at` <= `last_note_at` for all rows **Scope:** - MR fetcher with pagination - Discussions fetcher (issue discussions + MR discussions) as a dependent resource: - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions` - During initial ingest: fetch discussions for every issue/MR - During sync: refetch discussions only for issues/MRs updated since cursor - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise - Relationship linking (discussion → parent issue/MR, notes → discussion) - Extended CLI commands for MR/issue display with threads **Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries. **Schema Additions:** ```sql CREATE TABLE merge_requests ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, title TEXT, description TEXT, state TEXT, author_username TEXT, source_branch TEXT, target_branch TEXT, created_at INTEGER, updated_at INTEGER, merged_at INTEGER, web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); CREATE INDEX idx_mrs_author ON merge_requests(author_username); -- Discussion threads (the semantic unit for conversations) CREATE TABLE discussions ( id INTEGER PRIMARY KEY, gitlab_discussion_id TEXT UNIQUE NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...") project_id INTEGER NOT NULL REFERENCES projects(id), issue_id INTEGER REFERENCES issues(id), merge_request_id INTEGER REFERENCES merge_requests(id), noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion first_note_at INTEGER, -- for ordering discussions last_note_at INTEGER, -- for "recently active" queries resolvable BOOLEAN, -- MR discussions can be resolved resolved BOOLEAN, CHECK ( (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) ) ); CREATE INDEX idx_discussions_issue ON discussions(issue_id); CREATE INDEX idx_discussions_mr ON discussions(merge_request_id); CREATE INDEX idx_discussions_last_note ON discussions(last_note_at); -- Notes belong to discussions (preserving thread context) CREATE TABLE notes ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, discussion_id INTEGER NOT NULL REFERENCES discussions(id), project_id INTEGER NOT NULL REFERENCES projects(id), type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API) author_username TEXT, body TEXT, created_at INTEGER, updated_at INTEGER, position INTEGER, -- derived from array order in API response (0-indexed) resolvable BOOLEAN, -- note-level resolvability (MR code comments) resolved BOOLEAN, -- note-level resolution status resolved_by TEXT, -- username who resolved resolved_at INTEGER, -- when resolved raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_notes_discussion ON notes(discussion_id); CREATE INDEX idx_notes_author ON notes(author_username); CREATE INDEX idx_notes_type ON notes(type); -- MR labels (reuse same labels table) CREATE TABLE mr_labels ( merge_request_id INTEGER REFERENCES merge_requests(id), label_id INTEGER REFERENCES labels(id), PRIMARY KEY(merge_request_id, label_id) ); CREATE INDEX idx_mr_labels_label ON mr_labels(label_id); ``` **Discussion Processing Rules:** - System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.) - Each discussion from the API becomes one row in `discussions` table - All notes within a discussion are stored with their `discussion_id` foreign key - `individual_note: true` discussions have exactly one note (standalone comment) - `individual_note: false` discussions have multiple notes (threaded conversation) --- ### Checkpoint 3: Embedding Generation **Deliverable:** Vector embeddings generated for all text content **Automated Tests (Vitest):** ``` tests/unit/document-extractor.test.ts ✓ extracts issue document (title + description) ✓ extracts MR document (title + description) ✓ extracts discussion document with full thread context ✓ includes parent issue/MR title in discussion header ✓ formats notes with author and timestamp ✓ truncates content exceeding 8000 tokens ✓ preserves first and last notes when truncating middle ✓ computes SHA-256 content hash consistently tests/unit/embedding-client.test.ts ✓ connects to Ollama API ✓ generates embedding for text input ✓ returns 768-dimension vector ✓ handles Ollama connection failure gracefully ✓ batches requests (32 documents per batch) tests/integration/document-creation.test.ts ✓ creates document for each issue ✓ creates document for each MR ✓ creates document for each discussion ✓ populates document_labels junction table ✓ computes content_hash for each document tests/integration/embedding-storage.test.ts ✓ stores embedding in sqlite-vss ✓ embedding rowid matches document id ✓ creates embedding_metadata record ✓ skips re-embedding when content_hash unchanged ✓ re-embeds when content_hash changes ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi embed --all` | Progress bar with ETA | Completes without error | | `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs | | `gi stats` | Embedding coverage stats | Shows 100% coverage | | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions - [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents` - [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents` - [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash` - [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings - [ ] Discussion documents include parent title in content_text **Scope:** - Ollama integration (nomic-embed-text model) - Embedding generation pipeline (batch processing, 32 documents per batch) - Vector storage in SQLite (sqlite-vss extension) - Progress tracking and resumability - Document extraction layer: - Canonical "search documents" derived from issues/MRs/discussions - Stable content hashing for change detection (SHA-256 of content_text) - Single embedding per document (chunking deferred to post-MVP) - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192) - Denormalized metadata for fast filtering (author, labels, dates) - Fast label filtering via `document_labels` join table **Schema Additions:** ```sql -- Unified searchable documents (derived from issues/MRs/discussions) CREATE TABLE documents ( id INTEGER PRIMARY KEY, source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' source_id INTEGER NOT NULL, -- local DB id in the source table project_id INTEGER NOT NULL REFERENCES projects(id), author_username TEXT, -- for discussions: first note author label_names TEXT, -- JSON array (display/debug only) created_at INTEGER, updated_at INTEGER, url TEXT, title TEXT, -- null for discussions content_text TEXT NOT NULL, -- canonical text for embedding/snippets content_hash TEXT NOT NULL, -- SHA-256 for change detection UNIQUE(source_type, source_id) ); CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at); CREATE INDEX idx_documents_author ON documents(author_username); CREATE INDEX idx_documents_source ON documents(source_type, source_id); -- Fast label filtering for documents (indexed exact-match) CREATE TABLE document_labels ( document_id INTEGER NOT NULL REFERENCES documents(id), label_name TEXT NOT NULL, PRIMARY KEY(document_id, label_name) ); CREATE INDEX idx_document_labels_label ON document_labels(label_name); -- sqlite-vss virtual table -- Storage rule: embeddings.rowid = documents.id CREATE VIRTUAL TABLE embeddings USING vss0( embedding(768) ); -- Embedding provenance + change detection -- document_id is PRIMARY KEY and equals embeddings.rowid CREATE TABLE embedding_metadata ( document_id INTEGER PRIMARY KEY REFERENCES documents(id), model TEXT NOT NULL, -- 'nomic-embed-text' dims INTEGER NOT NULL, -- 768 content_hash TEXT NOT NULL, -- copied from documents.content_hash created_at INTEGER NOT NULL ); ``` **Storage Rule (MVP):** - Insert embedding with `rowid = documents.id` - Upsert `embedding_metadata` by `document_id` - This alignment simplifies joins and eliminates rowid mapping fragility **Document Extraction Rules:** | Source | content_text Construction | |--------|--------------------------| | Issue | `title + "\n\n" + description` | | MR | `title + "\n\n" + description` | | Discussion | Full thread with context (see below) | **Discussion Document Format:** ``` [Issue #234: Authentication redesign] Discussion @johndoe (2024-03-15): I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients... @janedoe (2024-03-15): Agreed. What about refresh token strategy? @johndoe (2024-03-16): Short-lived access tokens (15min), longer refresh (7 days). Here's why... ``` This format preserves: - Parent context (issue/MR title and number) - Author attribution for each note - Temporal ordering of the conversation - Full thread semantics for decision traceability **Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning. --- ### Checkpoint 4: Semantic Search **Deliverable:** Working semantic search across all indexed content **Automated Tests (Vitest):** ``` tests/unit/search-query.test.ts ✓ parses filter flags (--type, --author, --after, --label) ✓ validates date format for --after ✓ handles multiple --label flags tests/unit/rrf-ranking.test.ts ✓ computes RRF score correctly ✓ merges results from vector and FTS retrievers ✓ handles documents appearing in only one retriever ✓ respects k=60 parameter tests/integration/vector-search.test.ts ✓ returns results for semantic query ✓ ranks similar content higher ✓ returns empty for nonsense query tests/integration/fts-search.test.ts ✓ returns exact keyword matches ✓ handles porter stemming (search/searching) ✓ returns empty for non-matching query tests/integration/hybrid-search.test.ts ✓ combines vector and FTS results ✓ applies type filter correctly ✓ applies author filter correctly ✓ applies date filter correctly ✓ applies label filter correctly ✓ falls back to FTS when Ollama unavailable tests/e2e/golden-queries.test.ts ✓ "authentication redesign" returns known auth-related items ✓ "database migration" returns known migration items ✓ [8 more domain-specific golden queries] ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score | | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output | | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | | `gi search "authentication" --label=bug` | Label filtered | All results have bug label | | `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama | | `gi search "authentication" --json` | JSON output | Valid JSON array with schema | | `gi search "xyznonexistent123"` | No results message | Graceful empty state | | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | **Golden Query Test Suite:** Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs: ```json [ { "query": "authentication redesign", "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"], "minResults": 1, "maxRank": 10 } ] ``` Each query must have at least one expected URL appear in top 10 results. **Data Integrity Checks:** - [ ] `documents_fts` row count matches `documents` row count - [ ] Search returns results for known content (not empty) - [ ] JSON output validates against defined schema - [ ] All result URLs are valid GitLab URLs **Scope:** - Hybrid retrieval: - Vector recall (sqlite-vss) + FTS lexical recall (fts5) - Merge + rerank results using Reciprocal Rank Fusion (RRF) - Result ranking and scoring (document-level) - Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name` - Label filtering operates on `document_labels` (indexed, exact-match) - Output formatting: ranked list with title, snippet, score, URL - JSON output mode for AI agent consumption - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning **Schema Additions:** ```sql -- Full-text search for hybrid retrieval -- Using porter stemmer for better matching of word variants CREATE VIRTUAL TABLE documents_fts USING fts5( title, content_text, content='documents', content_rowid='id', tokenize='porter unicode61' ); -- Triggers to keep FTS in sync CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN INSERT INTO documents_fts(rowid, title, content_text) VALUES (new.id, new.title, new.content_text); END; CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN INSERT INTO documents_fts(documents_fts, rowid, title, content_text) VALUES('delete', old.id, old.title, old.content_text); END; CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN INSERT INTO documents_fts(documents_fts, rowid, title, content_text) VALUES('delete', old.id, old.title, old.content_text); INSERT INTO documents_fts(rowid, title, content_text) VALUES (new.id, new.title, new.content_text); END; ``` **FTS5 Tokenizer Notes:** - `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated") - `unicode61` handles Unicode properly - Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer **Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:** 1. Query both vector index (top 50) and FTS5 (top 50) 2. Merge results by document_id 3. Combine with Reciprocal Rank Fusion (RRF): - For each retriever list, assign ranks (1..N) - `rrf_score = Σ 1 / (k + rank)` with k=60 (tunable) - RRF is simpler than weighted sums and doesn't require score normalization 4. Apply filters (type, author, date, label) 5. Return top K **Why RRF over Weighted Sums:** - FTS5 BM25 scores and vector distances use different scales - Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization - RRF operates on ranks, not scores, making it robust to scale differences - Well-established in information retrieval literature **Graceful Degradation:** - If Ollama is unreachable during search, automatically fall back to FTS5-only - Display warning: "Embedding service unavailable, using lexical search only" - `embed` command fails with actionable error if Ollama is down **CLI Interface:** ```bash # Basic semantic search gi search "why did we choose Redis" # Pure FTS search (fallback if embeddings unavailable) gi search "redis" --mode=lexical # Filtered search gi search "authentication" --type=mr --after=2024-01-01 # Filter by label gi search "performance" --label=bug --label=critical # JSON output for programmatic use gi search "payment processing" --json ``` **CLI Output Example:** ``` $ gi search "authentication redesign" Found 23 results (hybrid search, 0.34s) [1] MR !847 - Refactor auth to use JWT tokens (0.82) @johndoe · 2024-03-15 · group/project-one "...moving away from session cookies to JWT for authentication..." https://gitlab.example.com/group/project-one/-/merge_requests/847 [2] Issue #234 - Authentication redesign discussion (0.79) @janedoe · 2024-02-28 · group/project-one "...we need to redesign the authentication flow because..." https://gitlab.example.com/group/project-one/-/issues/234 [3] Discussion on Issue #234 (0.76) @johndoe · 2024-03-01 · group/project-one "I think we should move to JWT-based auth because the session..." https://gitlab.example.com/group/project-one/-/issues/234#note_12345 ``` --- ### Checkpoint 5: Incremental Sync **Deliverable:** Efficient ongoing synchronization with GitLab **Automated Tests (Vitest):** ``` tests/unit/cursor-management.test.ts ✓ advances cursor after successful page commit ✓ uses tie-breaker id for identical timestamps ✓ does not advance cursor on failure ✓ resets cursor on --full flag tests/unit/change-detection.test.ts ✓ detects content_hash mismatch ✓ queues document for re-embedding on change ✓ skips re-embedding when hash unchanged tests/integration/incremental-sync.test.ts ✓ fetches only items updated after cursor ✓ refetches discussions for updated issues ✓ refetches discussions for updated MRs ✓ updates existing records (not duplicates) ✓ creates new records for new items ✓ re-embeds documents with changed content tests/integration/sync-recovery.test.ts ✓ resumes from cursor after interrupted sync ✓ marks failed run with error message ✓ handles rate limiting (429) with backoff ✓ respects Retry-After header ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check | | `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change | | `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything | | `gi sync-status` | Cursor positions, last sync time | Shows current state | | `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually | | `gi search "new content"` (after sync) | Returns new content | New content is searchable | **End-to-End Sync Verification:** 1. Note the current `sync_cursors` values 2. Create a new comment on an issue in GitLab 3. Run `gi sync` 4. Verify: - [ ] Issue's `updated_at` in DB matches GitLab - [ ] New discussion row exists - [ ] New note row exists - [ ] New document row exists for discussion - [ ] New embedding exists for document - [ ] `gi search "new comment text"` returns the new discussion - [ ] Cursor advanced past the updated issue **Data Integrity Checks:** - [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table - [ ] No orphaned documents (all have valid source_id) - [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows - [ ] `sync_runs` has complete audit trail **Scope:** - Delta sync based on stable cursor (updated_at + tie-breaker id) - Dependent resources sync strategy (discussions refetched when parent updates) - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) - Sync status reporting - Recommended: run via cron every 10 minutes **Correctness Rules (MVP):** 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` 2. Cursor advances only after successful DB commit for that page 3. Dependent resources: - For each updated issue/MR, refetch ALL its discussions - Discussion documents are regenerated and re-embedded if content_hash changes 4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` 5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) **Why Dependent Resource Model:** - GitLab Discussions API doesn't provide a global `updated_after` stream - Discussions are listed per-issue or per-MR, not as a top-level resource - Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct **CLI Commands:** ```bash # Full sync (respects cursors, only fetches new/updated) gi sync # Force full re-sync (resets cursors) gi sync --full # Override stale 'running' run after operator review gi sync --force # Show sync status gi sync-status ``` --- ## Future Work (Post-MVP) The following features are explicitly deferred to keep MVP scope focused: | Feature | Description | Depends On | |---------|-------------|------------| | **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete | | **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete | | **Person Context** | Aggregate contributions by author, expertise inference | MVP complete | | **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration | | **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 | | **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 | **Checkpoint 6 (File History) Schema Preview:** ```sql -- Deferred from MVP; added when file-history feature is built CREATE TABLE mr_files ( id INTEGER PRIMARY KEY, merge_request_id INTEGER REFERENCES merge_requests(id), old_path TEXT, new_path TEXT, new_file BOOLEAN, deleted_file BOOLEAN, renamed_file BOOLEAN, UNIQUE(merge_request_id, old_path, new_path) ); CREATE INDEX idx_mr_files_old_path ON mr_files(old_path); CREATE INDEX idx_mr_files_new_path ON mr_files(new_path); -- DiffNote position data (for "show me comments on this file" queries) -- Populated from notes.type='DiffNote' position object in GitLab API CREATE TABLE note_positions ( note_id INTEGER PRIMARY KEY REFERENCES notes(id), old_path TEXT, new_path TEXT, old_line INTEGER, new_line INTEGER, position_type TEXT -- 'text' | 'image' | etc. ); CREATE INDEX idx_note_positions_new_path ON note_positions(new_path); ``` --- ## Verification Strategy Each checkpoint includes: 1. **Automated tests** - Unit tests for data transformations, integration tests for API calls 2. **CLI smoke tests** - Manual commands with expected outputs documented 3. **Data integrity checks** - Count verification against GitLab, schema validation 4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+) --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync | | Embedding model quality | Start with nomic-embed-text; architecture allows model swap | | SQLite scale limits | Monitor performance; Postgres migration path documented | | Stale data | Incremental sync with change detection | | Mid-sync failures | Cursor-based resumption, sync_runs audit trail | | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | | Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) | **SQLite Performance Defaults (MVP):** - Enable `PRAGMA journal_mode=WAL;` on every connection - Enable `PRAGMA foreign_keys=ON;` on every connection - Use explicit transactions for page/batch inserts - Targeted indexes on `(project_id, updated_at)` for primary resources --- ## Schema Summary | Table | Checkpoint | Purpose | |-------|------------|---------| | projects | 0 | Configured GitLab projects | | sync_runs | 0 | Audit trail of sync operations | | sync_cursors | 0 | Resumable sync state per primary resource | | raw_payloads | 0 | Decoupled raw JSON storage | | issues | 1 | Normalized issues | | labels | 1 | Label definitions (unique by project + name) | | issue_labels | 1 | Issue-label junction | | merge_requests | 2 | Normalized MRs | | discussions | 2 | Discussion threads (the semantic unit for conversations) | | notes | 2 | Individual comments within discussions | | mr_labels | 2 | MR-label junction | | documents | 3 | Unified searchable documents (issues, MRs, discussions) | | document_labels | 3 | Document-label junction for fast filtering | | embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) | | embedding_metadata | 3 | Embedding provenance + change detection | | documents_fts | 4 | Full-text search index (fts5 with porter stemmer) | | mr_files | 6 | MR file changes (deferred to File History feature) | --- ## Resolved Decisions | Question | Decision | Rationale | |----------|----------|-----------| | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread | | System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value | | MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls | | Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering | | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available | | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient | | Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts | | Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 | | Embedding batching | **32 documents per batch** | Balance between throughput and memory | | FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text | | Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching | --- ## Next Steps 1. User approves this spec 2. Generate Checkpoint 0 PRD for project setup 3. Implement Checkpoint 0 4. Human validates → proceed to Checkpoint 1 5. Repeat for each checkpoint