Files
gitlore/SPEC.md
2026-01-20 16:43:39 -05:00

44 KiB

GitLab Knowledge Engine - Spec Document

Executive Summary

A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.


Discovery Summary

Pain Points Identified

  1. Knowledge discovery - Tribal knowledge buried in old MRs/issues that nobody can find
  2. Decision traceability - Hard to find why decisions were made; context scattered across issue comments and MR discussions

Constraints

Constraint Detail
Hosting Self-hosted only, no external APIs
Compute Local dev machine (M-series Mac assumed)
GitLab Access Self-hosted instance, PAT access, no webhooks (could request)
Build Method AI agents will implement; user is TypeScript expert for review

Target Use Cases (Priority Order)

  1. MVP: Semantic Search - "Find discussions about authentication redesign"
  2. Future: File/Feature History - "What decisions were made about src/auth/login.ts?"
  3. Future: Personal Tracking - "What am I assigned to or mentioned in?"
  4. Future: Person Context - "What's @johndoe's background in this project?"

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        GitLab API                                │
│                    (Issues, MRs, Notes)                          │
└─────────────────────────────────────────────────────────────────┘
  (Commit-level indexing explicitly post-MVP)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Ingestion Layer                         │
│  - Incremental sync (PAT-based polling)                         │
│  - Rate limiting / backoff                                       │
│  - Raw JSON storage for replay                                   │
│  - Dependent resource fetching (notes, MR changes)              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Data Processing Layer                        │
│  - Normalize artifacts to unified schema                        │
│  - Extract searchable documents (canonical text + metadata)     │
│  - Content hashing for change detection                         │
│  - Build relationship graph (issue↔MR↔note↔file)               │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                               │
│  - SQLite + sqlite-vss + FTS5 (hybrid search)                   │
│  - Structured metadata in relational tables                      │
│  - Vector embeddings for semantic search                         │
│  - Full-text index for lexical search fallback                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Query Interface                             │
│  - CLI for human testing                                         │
│  - JSON API for AI agent testing                                 │
│  - Semantic search with filters (author, date, type, label)     │
└─────────────────────────────────────────────────────────────────┘

Technology Choices

Component Recommendation Rationale
Language TypeScript/Node.js User expertise, good GitLab libs, AI agent friendly
Database SQLite + sqlite-vss Zero-config, portable, vector search built-in
Embeddings Ollama + nomic-embed-text Self-hosted, runs well on Apple Silicon, 768-dim vectors
CLI Framework Commander.js or oclif Standard, well-documented

Alternative Considered: Postgres + pgvector

  • Pros: More scalable, better for production multi-user
  • Cons: Requires running Postgres, heavier setup
  • Decision: Start with SQLite for simplicity; migration path exists if needed

GitLab API Strategy

Primary Resources (Bulk Fetch)

Issues and MRs support efficient bulk fetching with incremental sync:

GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100

Dependent Resources (Per-Parent Fetch)

Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:

GET /projects/:id/issues/:iid/discussions
GET /projects/:id/merge_requests/:iid/discussions

Sync Pattern

Initial sync:

  1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
  2. For EACH issue → fetch all discussions (~3K calls)
  3. Fetch all MRs (paginated, ~60 calls)
  4. For EACH MR → fetch all discussions (~3K calls)
  5. Total: ~6,100+ API calls for initial sync

Incremental sync:

  1. Fetch issues where updated_after=cursor (bulk)
  2. For EACH updated issue → refetch ALL its discussions
  3. Fetch MRs where updated_after=cursor (bulk)
  4. For EACH updated MR → refetch ALL its discussions

Critical Assumption

Adding a comment/discussion updates the parent's updated_at timestamp. This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.

Mitigation: Periodic full re-sync (weekly) as a safety net.

Rate Limiting

  • Default: 10 requests/second with exponential backoff
  • Respect Retry-After headers on 429 responses
  • Add jitter to avoid thundering herd on retry
  • Initial sync estimate: 10-20 minutes depending on rate limits

Checkpoint Structure

Each checkpoint is a testable milestone where a human can validate the system works before proceeding.

Checkpoint 0: Project Setup

Deliverable: Scaffolded project with GitLab API connection verified

Automated Tests (Vitest):

tests/unit/config.test.ts
  ✓ loads config from gi.config.json
  ✓ throws if config file missing
  ✓ throws if required fields missing (baseUrl, projects)
  ✓ validates project paths are non-empty strings

tests/unit/db.test.ts
  ✓ creates database file if not exists
  ✓ applies migrations in order
  ✓ sets WAL journal mode
  ✓ enables foreign keys

tests/integration/gitlab-client.test.ts
  ✓ authenticates with valid PAT
  ✓ returns 401 for invalid PAT
  ✓ fetches project by path
  ✓ handles rate limiting (429) with retry

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi auth-test Authenticated as @username (User Name) Shows GitLab username and display name
gi doctor Status table with ✓/✗ for each check All checks pass (or Ollama shows warning if not running)
gi doctor --json JSON object with check results Valid JSON, success: true for required checks
GITLAB_TOKEN=invalid gi auth-test Error message Non-zero exit code, clear error about auth failure

Data Integrity Checks:

  • projects table contains rows for each configured project path
  • gitlab_project_id matches actual GitLab project IDs
  • raw_payloads contains project JSON for each synced project

Scope:

  • Project structure (TypeScript, ESLint, Vitest)
  • GitLab API client with PAT authentication
  • Environment and project configuration
  • Basic CLI scaffold with auth-test command
  • doctor command for environment verification
  • Projects table and initial sync

Configuration (MVP):

// gi.config.json
{
  "gitlab": {
    "baseUrl": "https://gitlab.example.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project-one" },
    { "path": "group/project-two" }
  ],
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "baseUrl": "http://localhost:11434"
  }
}

DB Runtime Defaults (Checkpoint 0):

  • On every connection:
    • PRAGMA journal_mode=WAL;
    • PRAGMA foreign_keys=ON;

Schema (Checkpoint 0):

-- Projects table (configured targets)
CREATE TABLE projects (
  id INTEGER PRIMARY KEY,
  gitlab_project_id INTEGER UNIQUE NOT NULL,
  path_with_namespace TEXT NOT NULL,
  default_branch TEXT,
  web_url TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);

-- Sync tracking for reliability
CREATE TABLE sync_runs (
  id INTEGER PRIMARY KEY,
  started_at INTEGER NOT NULL,
  finished_at INTEGER,
  status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
  command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
  error TEXT
);

-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  resource_type TEXT NOT NULL,   -- 'issues' | 'merge_requests'
  updated_at_cursor INTEGER,     -- last fully processed updated_at (ms epoch)
  tie_breaker_id INTEGER,        -- last fully processed gitlab_id (for stable ordering)
  PRIMARY KEY(project_id, resource_type)
);

-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
  id INTEGER PRIMARY KEY,
  source TEXT NOT NULL,          -- 'gitlab'
  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note'
  gitlab_id INTEGER NOT NULL,
  fetched_at INTEGER NOT NULL,
  json TEXT NOT NULL
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);

Checkpoint 1: Issue Ingestion

Deliverable: All issues from target repos stored locally

Automated Tests (Vitest):

tests/unit/issue-transformer.test.ts
  ✓ transforms GitLab issue payload to normalized schema
  ✓ extracts labels from issue payload
  ✓ handles missing optional fields gracefully

tests/unit/pagination.test.ts
  ✓ fetches all pages when multiple exist
  ✓ respects per_page parameter
  ✓ stops when empty page returned

tests/integration/issue-ingestion.test.ts
  ✓ inserts issues into database
  ✓ creates labels from issue payloads
  ✓ links issues to labels via junction table
  ✓ stores raw payload for each issue
  ✓ updates cursor after successful page commit
  ✓ resumes from cursor on subsequent runs

tests/integration/sync-runs.test.ts
  ✓ creates sync_run record on start
  ✓ marks run as succeeded on completion
  ✓ marks run as failed with error message on failure
  ✓ refuses concurrent run (single-flight)
  ✓ allows --force to override stale running status

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi ingest --type=issues Progress bar, final count Completes without error
gi list issues --limit=10 Table of 10 issues Shows iid, title, state, author
gi list issues --project=group/project-one Filtered list Only shows issues from that project
gi count issues Issues: 1,234 (example) Count matches GitLab UI
gi show issue 123 Issue detail view Shows title, description, labels, URL
gi sync-status Last sync time, cursor positions Shows successful last run

Data Integrity Checks:

  • SELECT COUNT(*) FROM issues matches GitLab issue count for configured projects
  • Every issue has a corresponding raw_payloads row
  • Labels in issue_labels junction all exist in labels table
  • sync_cursors has entry for each (project_id, 'issues') pair
  • Re-running gi ingest --type=issues fetches 0 new items (cursor is current)

Scope:

  • Issue fetcher with pagination handling
  • Raw JSON storage in raw_payloads table
  • Normalized issue schema in SQLite
  • Labels ingestion derived from issue payload:
    • Always persist label names from labels: string[]
    • Optionally request with_labels_details=true to capture color/description when available
  • Incremental sync support (run tracking + per-project cursor)
  • Basic list/count CLI commands

Reliability/Idempotency Rules:

  • Every ingest/sync creates a sync_runs row
  • Single-flight: refuse to start if an existing run is running (unless --force)
  • Cursor advances only after successful transaction commit per page/batch
  • Ordering: updated_at ASC, tie-breaker gitlab_id ASC
  • Use explicit transactions for batch inserts

Schema Preview:

CREATE TABLE issues (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);

-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER,                  -- optional (only if available)
  project_id INTEGER NOT NULL REFERENCES projects(id),
  name TEXT NOT NULL,
  color TEXT,
  description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);

CREATE TABLE issue_labels (
  issue_id INTEGER REFERENCES issues(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);

Checkpoint 2: MR + Discussions Ingestion

Deliverable: All MRs and discussion threads (for both issues and MRs) stored locally with full thread context

Automated Tests (Vitest):

tests/unit/mr-transformer.test.ts
  ✓ transforms GitLab MR payload to normalized schema
  ✓ extracts labels from MR payload
  ✓ handles missing optional fields gracefully

tests/unit/discussion-transformer.test.ts
  ✓ transforms discussion payload to normalized schema
  ✓ extracts notes array from discussion
  ✓ sets individual_note flag correctly
  ✓ filters out system notes (system: true)
  ✓ preserves note order via position field

tests/integration/mr-ingestion.test.ts
  ✓ inserts MRs into database
  ✓ creates labels from MR payloads
  ✓ links MRs to labels via junction table
  ✓ stores raw payload for each MR

tests/integration/discussion-ingestion.test.ts
  ✓ fetches discussions for each issue
  ✓ fetches discussions for each MR
  ✓ creates discussion rows with correct parent FK
  ✓ creates note rows linked to discussions
  ✓ excludes system notes from storage
  ✓ captures note-level resolution status
  ✓ captures note type (DiscussionNote, DiffNote)

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi ingest --type=merge_requests Progress bar, final count Completes without error
gi list mrs --limit=10 Table of 10 MRs Shows iid, title, state, author, branch
gi count mrs Merge Requests: 567 (example) Count matches GitLab UI
gi show mr 123 MR detail with discussions Shows title, description, discussion threads
gi show issue 456 Issue detail with discussions Shows title, description, discussion threads
gi count discussions Discussions: 12,345 Non-zero count
gi count notes Notes: 45,678 Non-zero count, no system notes

Data Integrity Checks:

  • SELECT COUNT(*) FROM merge_requests matches GitLab MR count
  • SELECT COUNT(*) FROM discussions is non-zero for projects with comments
  • SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL = 0 (all notes linked)
  • SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true = 0 (no system notes)
  • Every discussion has at least one note
  • individual_note = true discussions have exactly one note
  • Discussion first_note_at <= last_note_at for all rows

Scope:

  • MR fetcher with pagination
  • Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
    • Uses GET /projects/:id/issues/:iid/discussions and GET /projects/:id/merge_requests/:iid/discussions
    • During initial ingest: fetch discussions for every issue/MR
    • During sync: refetch discussions only for issues/MRs updated since cursor
    • Filter out system notes (system: true) - these are automated messages (assignments, label changes) that add noise
  • Relationship linking (discussion → parent issue/MR, notes → discussion)
  • Extended CLI commands for MR/issue display with threads

Note: MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.

Schema Additions:

CREATE TABLE merge_requests (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  source_branch TEXT,
  target_branch TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  merged_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);

-- Discussion threads (the semantic unit for conversations)
CREATE TABLE discussions (
  id INTEGER PRIMARY KEY,
  gitlab_discussion_id TEXT UNIQUE NOT NULL,  -- GitLab's string ID (e.g. "6a9c1750b37d...")
  project_id INTEGER NOT NULL REFERENCES projects(id),
  issue_id INTEGER REFERENCES issues(id),
  merge_request_id INTEGER REFERENCES merge_requests(id),
  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
  first_note_at INTEGER,                      -- for ordering discussions
  last_note_at INTEGER,                       -- for "recently active" queries
  resolvable BOOLEAN,                         -- MR discussions can be resolved
  resolved BOOLEAN,
  CHECK (
    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
  )
);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);

-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
  project_id INTEGER NOT NULL REFERENCES projects(id),
  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
  author_username TEXT,
  body TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  position INTEGER,                           -- derived from array order in API response (0-indexed)
  resolvable BOOLEAN,                         -- note-level resolvability (MR code comments)
  resolved BOOLEAN,                           -- note-level resolution status
  resolved_by TEXT,                           -- username who resolved
  resolved_at INTEGER,                        -- when resolved
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_type ON notes(type);

-- MR labels (reuse same labels table)
CREATE TABLE mr_labels (
  merge_request_id INTEGER REFERENCES merge_requests(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);

Discussion Processing Rules:

  • System notes (system: true) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
  • Each discussion from the API becomes one row in discussions table
  • All notes within a discussion are stored with their discussion_id foreign key
  • individual_note: true discussions have exactly one note (standalone comment)
  • individual_note: false discussions have multiple notes (threaded conversation)

Checkpoint 3: Embedding Generation

Deliverable: Vector embeddings generated for all text content

Automated Tests (Vitest):

tests/unit/document-extractor.test.ts
  ✓ extracts issue document (title + description)
  ✓ extracts MR document (title + description)
  ✓ extracts discussion document with full thread context
  ✓ includes parent issue/MR title in discussion header
  ✓ formats notes with author and timestamp
  ✓ truncates content exceeding 8000 tokens
  ✓ preserves first and last notes when truncating middle
  ✓ computes SHA-256 content hash consistently

tests/unit/embedding-client.test.ts
  ✓ connects to Ollama API
  ✓ generates embedding for text input
  ✓ returns 768-dimension vector
  ✓ handles Ollama connection failure gracefully
  ✓ batches requests (32 documents per batch)

tests/integration/document-creation.test.ts
  ✓ creates document for each issue
  ✓ creates document for each MR
  ✓ creates document for each discussion
  ✓ populates document_labels junction table
  ✓ computes content_hash for each document

tests/integration/embedding-storage.test.ts
  ✓ stores embedding in sqlite-vss
  ✓ embedding rowid matches document id
  ✓ creates embedding_metadata record
  ✓ skips re-embedding when content_hash unchanged
  ✓ re-embeds when content_hash changes

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi embed --all Progress bar with ETA Completes without error
gi embed --all (re-run) 0 documents to embed Skips already-embedded docs
gi stats Embedding coverage stats Shows 100% coverage
gi stats --json JSON stats object Valid JSON with document/embedding counts
gi embed --all (Ollama stopped) Clear error message Non-zero exit, actionable error

Data Integrity Checks:

  • SELECT COUNT(*) FROM documents = issues + MRs + discussions
  • SELECT COUNT(*) FROM embeddings = SELECT COUNT(*) FROM documents
  • SELECT COUNT(*) FROM embedding_metadata = SELECT COUNT(*) FROM documents
  • All embedding_metadata.content_hash matches corresponding documents.content_hash
  • SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000 logs truncation warnings
  • Discussion documents include parent title in content_text

Scope:

  • Ollama integration (nomic-embed-text model)
  • Embedding generation pipeline (batch processing, 32 documents per batch)
  • Vector storage in SQLite (sqlite-vss extension)
  • Progress tracking and resumability
  • Document extraction layer:
    • Canonical "search documents" derived from issues/MRs/discussions
    • Stable content hashing for change detection (SHA-256 of content_text)
    • Single embedding per document (chunking deferred to post-MVP)
    • Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
  • Denormalized metadata for fast filtering (author, labels, dates)
  • Fast label filtering via document_labels join table

Schema Additions:

-- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,
  updated_at INTEGER,
  url TEXT,
  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);

-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- sqlite-vss virtual table
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0(
  embedding(768)
);

-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id),
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL
);

Storage Rule (MVP):

  • Insert embedding with rowid = documents.id
  • Upsert embedding_metadata by document_id
  • This alignment simplifies joins and eliminates rowid mapping fragility

Document Extraction Rules:

Source content_text Construction
Issue title + "\n\n" + description
MR title + "\n\n" + description
Discussion Full thread with context (see below)

Discussion Document Format:

[Issue #234: Authentication redesign] Discussion

@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...

@janedoe (2024-03-15):
Agreed. What about refresh token strategy?

@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...

This format preserves:

  • Parent context (issue/MR title and number)
  • Author attribution for each note
  • Temporal ordering of the conversation
  • Full thread semantics for decision traceability

Truncation: If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.


Deliverable: Working semantic search across all indexed content

Automated Tests (Vitest):

tests/unit/search-query.test.ts
  ✓ parses filter flags (--type, --author, --after, --label)
  ✓ validates date format for --after
  ✓ handles multiple --label flags

tests/unit/rrf-ranking.test.ts
  ✓ computes RRF score correctly
  ✓ merges results from vector and FTS retrievers
  ✓ handles documents appearing in only one retriever
  ✓ respects k=60 parameter

tests/integration/vector-search.test.ts
  ✓ returns results for semantic query
  ✓ ranks similar content higher
  ✓ returns empty for nonsense query

tests/integration/fts-search.test.ts
  ✓ returns exact keyword matches
  ✓ handles porter stemming (search/searching)
  ✓ returns empty for non-matching query

tests/integration/hybrid-search.test.ts
  ✓ combines vector and FTS results
  ✓ applies type filter correctly
  ✓ applies author filter correctly
  ✓ applies date filter correctly
  ✓ applies label filter correctly
  ✓ falls back to FTS when Ollama unavailable

tests/e2e/golden-queries.test.ts
  ✓ "authentication redesign" returns known auth-related items
  ✓ "database migration" returns known migration items
  ✓ [8 more domain-specific golden queries]

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi search "authentication" Ranked results with snippets Returns relevant items, shows score
gi search "authentication" --type=mr Only MR results No issues or discussions in output
gi search "authentication" --author=johndoe Filtered by author All results have @johndoe
gi search "authentication" --after=2024-01-01 Date filtered All results after date
gi search "authentication" --label=bug Label filtered All results have bug label
gi search "redis" --mode=lexical FTS-only results Works without Ollama
gi search "authentication" --json JSON output Valid JSON array with schema
gi search "xyznonexistent123" No results message Graceful empty state
gi search "auth" (Ollama stopped) FTS results + warning Shows warning, still returns results

Golden Query Test Suite: Create tests/fixtures/golden-queries.json with 10 queries and expected URLs:

[
  {
    "query": "authentication redesign",
    "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
    "minResults": 1,
    "maxRank": 10
  }
]

Each query must have at least one expected URL appear in top 10 results.

Data Integrity Checks:

  • documents_fts row count matches documents row count
  • Search returns results for known content (not empty)
  • JSON output validates against defined schema
  • All result URLs are valid GitLab URLs

Scope:

  • Hybrid retrieval:
    • Vector recall (sqlite-vss) + FTS lexical recall (fts5)
    • Merge + rerank results using Reciprocal Rank Fusion (RRF)
  • Result ranking and scoring (document-level)
  • Search filters: --type=issue|mr|discussion, --author=username, --after=date, --label=name
    • Label filtering operates on document_labels (indexed, exact-match)
  • Output formatting: ranked list with title, snippet, score, URL
  • JSON output mode for AI agent consumption
  • Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning

Schema Additions:

-- Full-text search for hybrid retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id',
  tokenize='porter unicode61'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
END;

CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

FTS5 Tokenizer Notes:

  • porter enables stemming (searching "authentication" matches "authenticating", "authenticated")
  • unicode61 handles Unicode properly
  • Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer

Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:

  1. Query both vector index (top 50) and FTS5 (top 50)
  2. Merge results by document_id
  3. Combine with Reciprocal Rank Fusion (RRF):
    • For each retriever list, assign ranks (1..N)
    • rrf_score = Σ 1 / (k + rank) with k=60 (tunable)
    • RRF is simpler than weighted sums and doesn't require score normalization
  4. Apply filters (type, author, date, label)
  5. Return top K

Why RRF over Weighted Sums:

  • FTS5 BM25 scores and vector distances use different scales
  • Weighted sums (0.7 * vector + 0.3 * fts) require careful normalization
  • RRF operates on ranks, not scores, making it robust to scale differences
  • Well-established in information retrieval literature

Graceful Degradation:

  • If Ollama is unreachable during search, automatically fall back to FTS5-only
  • Display warning: "Embedding service unavailable, using lexical search only"
  • embed command fails with actionable error if Ollama is down

CLI Interface:

# Basic semantic search
gi search "why did we choose Redis"

# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical

# Filtered search
gi search "authentication" --type=mr --after=2024-01-01

# Filter by label
gi search "performance" --label=bug --label=critical

# JSON output for programmatic use
gi search "payment processing" --json

CLI Output Example:

$ gi search "authentication redesign"

Found 23 results (hybrid search, 0.34s)

[1] MR !847 - Refactor auth to use JWT tokens (0.82)
    @johndoe · 2024-03-15 · group/project-one
    "...moving away from session cookies to JWT for authentication..."
    https://gitlab.example.com/group/project-one/-/merge_requests/847

[2] Issue #234 - Authentication redesign discussion (0.79)
    @janedoe · 2024-02-28 · group/project-one
    "...we need to redesign the authentication flow because..."
    https://gitlab.example.com/group/project-one/-/issues/234

[3] Discussion on Issue #234 (0.76)
    @johndoe · 2024-03-01 · group/project-one
    "I think we should move to JWT-based auth because the session..."
    https://gitlab.example.com/group/project-one/-/issues/234#note_12345

Checkpoint 5: Incremental Sync

Deliverable: Efficient ongoing synchronization with GitLab

Automated Tests (Vitest):

tests/unit/cursor-management.test.ts
  ✓ advances cursor after successful page commit
  ✓ uses tie-breaker id for identical timestamps
  ✓ does not advance cursor on failure
  ✓ resets cursor on --full flag

tests/unit/change-detection.test.ts
  ✓ detects content_hash mismatch
  ✓ queues document for re-embedding on change
  ✓ skips re-embedding when hash unchanged

tests/integration/incremental-sync.test.ts
  ✓ fetches only items updated after cursor
  ✓ refetches discussions for updated issues
  ✓ refetches discussions for updated MRs
  ✓ updates existing records (not duplicates)
  ✓ creates new records for new items
  ✓ re-embeds documents with changed content

tests/integration/sync-recovery.test.ts
  ✓ resumes from cursor after interrupted sync
  ✓ marks failed run with error message
  ✓ handles rate limiting (429) with backoff
  ✓ respects Retry-After header

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi sync (no changes) 0 issues, 0 MRs updated Fast completion, no API calls beyond cursor check
gi sync (after GitLab change) 1 issue updated, 3 discussions refetched Detects and syncs the change
gi sync --full Full re-sync progress Resets cursors, fetches everything
gi sync-status Cursor positions, last sync time Shows current state
gi sync (with rate limit) Backoff messages Respects rate limits, completes eventually
gi search "new content" (after sync) Returns new content New content is searchable

End-to-End Sync Verification:

  1. Note the current sync_cursors values
  2. Create a new comment on an issue in GitLab
  3. Run gi sync
  4. Verify:
    • Issue's updated_at in DB matches GitLab
    • New discussion row exists
    • New note row exists
    • New document row exists for discussion
    • New embedding exists for document
    • gi search "new comment text" returns the new discussion
    • Cursor advanced past the updated issue

Data Integrity Checks:

  • sync_cursors timestamp <= max updated_at in corresponding table
  • No orphaned documents (all have valid source_id)
  • embedding_metadata.content_hash = documents.content_hash for all rows
  • sync_runs has complete audit trail

Scope:

  • Delta sync based on stable cursor (updated_at + tie-breaker id)
  • Dependent resources sync strategy (discussions refetched when parent updates)
  • Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
  • Sync status reporting
  • Recommended: run via cron every 10 minutes

Correctness Rules (MVP):

  1. Fetch pages ordered by updated_at ASC, within identical timestamps advance by gitlab_id ASC
  2. Cursor advances only after successful DB commit for that page
  3. Dependent resources:
    • For each updated issue/MR, refetch ALL its discussions
    • Discussion documents are regenerated and re-embedded if content_hash changes
  4. A document is queued for embedding iff documents.content_hash != embedding_metadata.content_hash
  5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

Why Dependent Resource Model:

  • GitLab Discussions API doesn't provide a global updated_after stream
  • Discussions are listed per-issue or per-MR, not as a top-level resource
  • Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct

CLI Commands:

# Full sync (respects cursors, only fetches new/updated)
gi sync

# Force full re-sync (resets cursors)
gi sync --full

# Override stale 'running' run after operator review
gi sync --force

# Show sync status
gi sync-status

Future Work (Post-MVP)

The following features are explicitly deferred to keep MVP scope focused:

Feature Description Depends On
File History Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing MVP complete
Personal Dashboard Filter by assigned/mentioned, integrate with gitlab-inbox tool MVP complete
Person Context Aggregate contributions by author, expertise inference MVP complete
Decision Graph LLM-assisted decision extraction, relationship visualization MVP + LLM integration
MCP Server Expose search as MCP tool for Claude Code integration Checkpoint 4
Custom Tokenizer Better handling of code identifiers (snake_case, paths) Checkpoint 4

Checkpoint 6 (File History) Schema Preview:

-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
  id INTEGER PRIMARY KEY,
  merge_request_id INTEGER REFERENCES merge_requests(id),
  old_path TEXT,
  new_path TEXT,
  new_file BOOLEAN,
  deleted_file BOOLEAN,
  renamed_file BOOLEAN,
  UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);

-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
  note_id INTEGER PRIMARY KEY REFERENCES notes(id),
  old_path TEXT,
  new_path TEXT,
  old_line INTEGER,
  new_line INTEGER,
  position_type TEXT                          -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);

Verification Strategy

Each checkpoint includes:

  1. Automated tests - Unit tests for data transformations, integration tests for API calls
  2. CLI smoke tests - Manual commands with expected outputs documented
  3. Data integrity checks - Count verification against GitLab, schema validation
  4. Search quality tests - Known queries with expected results (for Checkpoint 4+)

Risk Mitigation

Risk Mitigation
GitLab rate limiting Exponential backoff, respect Retry-After headers, incremental sync
Embedding model quality Start with nomic-embed-text; architecture allows model swap
SQLite scale limits Monitor performance; Postgres migration path documented
Stale data Incremental sync with change detection
Mid-sync failures Cursor-based resumption, sync_runs audit trail
Search quality Hybrid (vector + FTS5) retrieval with RRF, golden query test suite
Concurrent sync corruption Single-flight protection (refuse if existing run is running)

SQLite Performance Defaults (MVP):

  • Enable PRAGMA journal_mode=WAL; on every connection
  • Enable PRAGMA foreign_keys=ON; on every connection
  • Use explicit transactions for page/batch inserts
  • Targeted indexes on (project_id, updated_at) for primary resources

Schema Summary

Table Checkpoint Purpose
projects 0 Configured GitLab projects
sync_runs 0 Audit trail of sync operations
sync_cursors 0 Resumable sync state per primary resource
raw_payloads 0 Decoupled raw JSON storage
issues 1 Normalized issues
labels 1 Label definitions (unique by project + name)
issue_labels 1 Issue-label junction
merge_requests 2 Normalized MRs
discussions 2 Discussion threads (the semantic unit for conversations)
notes 2 Individual comments within discussions
mr_labels 2 MR-label junction
documents 3 Unified searchable documents (issues, MRs, discussions)
document_labels 3 Document-label junction for fast filtering
embeddings 3 Vector embeddings (sqlite-vss, rowid=document_id)
embedding_metadata 3 Embedding provenance + change detection
documents_fts 4 Full-text search index (fts5 with porter stemmer)
mr_files 6 MR file changes (deferred to File History feature)

Resolved Decisions

Question Decision Rationale
Comments structure Discussions as first-class entities Thread context is essential for decision traceability; individual notes are meaningless without their thread
System notes Exclude during ingestion System notes (assignments, label changes) add noise without semantic value
MR file linkage Deferred to post-MVP (CP6) Only needed for file-history feature; reduces initial API calls
Labels Index as filters Labels are well-used; document_labels table enables fast --label=X filtering
Labels uniqueness By (project_id, name) GitLab API returns labels as strings; gitlab_id isn't always available
Sync method Polling only for MVP Webhooks add complexity; polling every 10min is sufficient
Discussions sync Dependent resource model Discussions API is per-parent, not global; refetch all discussions when parent updates
Hybrid ranking RRF over weighted sums Simpler, no score normalization needed
Embedding rowid rowid = documents.id Eliminates fragile rowid mapping during upserts
Embedding truncation 8000 tokens, truncate middle Preserve first/last notes for context; nomic-embed-text limit is 8192
Embedding batching 32 documents per batch Balance between throughput and memory
FTS5 tokenizer porter unicode61 Stemming improves recall; unicode61 handles international text
Ollama unavailable Graceful degradation to FTS5 Search still works, just without semantic matching

Next Steps

  1. User approves this spec
  2. Generate Checkpoint 0 PRD for project setup
  3. Implement Checkpoint 0
  4. Human validates → proceed to Checkpoint 1
  5. Repeat for each checkpoint