44 KiB
GitLab Knowledge Engine - Spec Document
Executive Summary
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
Discovery Summary
Pain Points Identified
- Knowledge discovery - Tribal knowledge buried in old MRs/issues that nobody can find
- Decision traceability - Hard to find why decisions were made; context scattered across issue comments and MR discussions
Constraints
| Constraint | Detail |
|---|---|
| Hosting | Self-hosted only, no external APIs |
| Compute | Local dev machine (M-series Mac assumed) |
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
| Build Method | AI agents will implement; user is TypeScript expert for review |
Target Use Cases (Priority Order)
- MVP: Semantic Search - "Find discussions about authentication redesign"
- Future: File/Feature History - "What decisions were made about src/auth/login.ts?"
- Future: Personal Tracking - "What am I assigned to or mentioned in?"
- Future: Person Context - "What's @johndoe's background in this project?"
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ GitLab API │
│ (Issues, MRs, Notes) │
└─────────────────────────────────────────────────────────────────┘
(Commit-level indexing explicitly post-MVP)
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Ingestion Layer │
│ - Incremental sync (PAT-based polling) │
│ - Rate limiting / backoff │
│ - Raw JSON storage for replay │
│ - Dependent resource fetching (notes, MR changes) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Processing Layer │
│ - Normalize artifacts to unified schema │
│ - Extract searchable documents (canonical text + metadata) │
│ - Content hashing for change detection │
│ - Build relationship graph (issue↔MR↔note↔file) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ - SQLite + sqlite-vss + FTS5 (hybrid search) │
│ - Structured metadata in relational tables │
│ - Vector embeddings for semantic search │
│ - Full-text index for lexical search fallback │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Query Interface │
│ - CLI for human testing │
│ - JSON API for AI agent testing │
│ - Semantic search with filters (author, date, type, label) │
└─────────────────────────────────────────────────────────────────┘
Technology Choices
| Component | Recommendation | Rationale |
|---|---|---|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
| CLI Framework | Commander.js or oclif | Standard, well-documented |
Alternative Considered: Postgres + pgvector
- Pros: More scalable, better for production multi-user
- Cons: Requires running Postgres, heavier setup
- Decision: Start with SQLite for simplicity; migration path exists if needed
GitLab API Strategy
Primary Resources (Bulk Fetch)
Issues and MRs support efficient bulk fetching with incremental sync:
GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100
Dependent Resources (Per-Parent Fetch)
Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
GET /projects/:id/issues/:iid/discussions
GET /projects/:id/merge_requests/:iid/discussions
Sync Pattern
Initial sync:
- Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
- For EACH issue → fetch all discussions (~3K calls)
- Fetch all MRs (paginated, ~60 calls)
- For EACH MR → fetch all discussions (~3K calls)
- Total: ~6,100+ API calls for initial sync
Incremental sync:
- Fetch issues where
updated_after=cursor(bulk) - For EACH updated issue → refetch ALL its discussions
- Fetch MRs where
updated_after=cursor(bulk) - For EACH updated MR → refetch ALL its discussions
Critical Assumption
Adding a comment/discussion updates the parent's updated_at timestamp. This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
Mitigation: Periodic full re-sync (weekly) as a safety net.
Rate Limiting
- Default: 10 requests/second with exponential backoff
- Respect
Retry-Afterheaders on 429 responses - Add jitter to avoid thundering herd on retry
- Initial sync estimate: 10-20 minutes depending on rate limits
Checkpoint Structure
Each checkpoint is a testable milestone where a human can validate the system works before proceeding.
Checkpoint 0: Project Setup
Deliverable: Scaffolded project with GitLab API connection verified
Automated Tests (Vitest):
tests/unit/config.test.ts
✓ loads config from gi.config.json
✓ throws if config file missing
✓ throws if required fields missing (baseUrl, projects)
✓ validates project paths are non-empty strings
tests/unit/db.test.ts
✓ creates database file if not exists
✓ applies migrations in order
✓ sets WAL journal mode
✓ enables foreign keys
tests/integration/gitlab-client.test.ts
✓ authenticates with valid PAT
✓ returns 401 for invalid PAT
✓ fetches project by path
✓ handles rate limiting (429) with retry
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi auth-test |
Authenticated as @username (User Name) |
Shows GitLab username and display name |
gi doctor |
Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
gi doctor --json |
JSON object with check results | Valid JSON, success: true for required checks |
GITLAB_TOKEN=invalid gi auth-test |
Error message | Non-zero exit code, clear error about auth failure |
Data Integrity Checks:
projectstable contains rows for each configured project pathgitlab_project_idmatches actual GitLab project IDsraw_payloadscontains project JSON for each synced project
Scope:
- Project structure (TypeScript, ESLint, Vitest)
- GitLab API client with PAT authentication
- Environment and project configuration
- Basic CLI scaffold with
auth-testcommand doctorcommand for environment verification- Projects table and initial sync
Configuration (MVP):
// gi.config.json
{
"gitlab": {
"baseUrl": "https://gitlab.example.com",
"tokenEnvVar": "GITLAB_TOKEN"
},
"projects": [
{ "path": "group/project-one" },
{ "path": "group/project-two" }
],
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"baseUrl": "http://localhost:11434"
}
}
DB Runtime Defaults (Checkpoint 0):
- On every connection:
PRAGMA journal_mode=WAL;PRAGMA foreign_keys=ON;
Schema (Checkpoint 0):
-- Projects table (configured targets)
CREATE TABLE projects (
id INTEGER PRIMARY KEY,
gitlab_project_id INTEGER UNIQUE NOT NULL,
path_with_namespace TEXT NOT NULL,
default_branch TEXT,
web_url TEXT,
created_at INTEGER,
updated_at INTEGER,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);
-- Sync tracking for reliability
CREATE TABLE sync_runs (
id INTEGER PRIMARY KEY,
started_at INTEGER NOT NULL,
finished_at INTEGER,
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
error TEXT
);
-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
project_id INTEGER NOT NULL REFERENCES projects(id),
resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests'
updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch)
tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering)
PRIMARY KEY(project_id, resource_type)
);
-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
id INTEGER PRIMARY KEY,
source TEXT NOT NULL, -- 'gitlab'
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note'
gitlab_id INTEGER NOT NULL,
fetched_at INTEGER NOT NULL,
json TEXT NOT NULL
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
Checkpoint 1: Issue Ingestion
Deliverable: All issues from target repos stored locally
Automated Tests (Vitest):
tests/unit/issue-transformer.test.ts
✓ transforms GitLab issue payload to normalized schema
✓ extracts labels from issue payload
✓ handles missing optional fields gracefully
tests/unit/pagination.test.ts
✓ fetches all pages when multiple exist
✓ respects per_page parameter
✓ stops when empty page returned
tests/integration/issue-ingestion.test.ts
✓ inserts issues into database
✓ creates labels from issue payloads
✓ links issues to labels via junction table
✓ stores raw payload for each issue
✓ updates cursor after successful page commit
✓ resumes from cursor on subsequent runs
tests/integration/sync-runs.test.ts
✓ creates sync_run record on start
✓ marks run as succeeded on completion
✓ marks run as failed with error message on failure
✓ refuses concurrent run (single-flight)
✓ allows --force to override stale running status
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi ingest --type=issues |
Progress bar, final count | Completes without error |
gi list issues --limit=10 |
Table of 10 issues | Shows iid, title, state, author |
gi list issues --project=group/project-one |
Filtered list | Only shows issues from that project |
gi count issues |
Issues: 1,234 (example) |
Count matches GitLab UI |
gi show issue 123 |
Issue detail view | Shows title, description, labels, URL |
gi sync-status |
Last sync time, cursor positions | Shows successful last run |
Data Integrity Checks:
SELECT COUNT(*) FROM issuesmatches GitLab issue count for configured projects- Every issue has a corresponding
raw_payloadsrow - Labels in
issue_labelsjunction all exist inlabelstable sync_cursorshas entry for each (project_id, 'issues') pair- Re-running
gi ingest --type=issuesfetches 0 new items (cursor is current)
Scope:
- Issue fetcher with pagination handling
- Raw JSON storage in raw_payloads table
- Normalized issue schema in SQLite
- Labels ingestion derived from issue payload:
- Always persist label names from
labels: string[] - Optionally request
with_labels_details=trueto capture color/description when available
- Always persist label names from
- Incremental sync support (run tracking + per-project cursor)
- Basic list/count CLI commands
Reliability/Idempotency Rules:
- Every ingest/sync creates a
sync_runsrow - Single-flight: refuse to start if an existing run is
running(unless--force) - Cursor advances only after successful transaction commit per page/batch
- Ordering:
updated_at ASC, tie-breakergitlab_id ASC - Use explicit transactions for batch inserts
Schema Preview:
CREATE TABLE issues (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
created_at INTEGER,
updated_at INTEGER,
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);
-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER, -- optional (only if available)
project_id INTEGER NOT NULL REFERENCES projects(id),
name TEXT NOT NULL,
color TEXT,
description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);
CREATE TABLE issue_labels (
issue_id INTEGER REFERENCES issues(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
Checkpoint 2: MR + Discussions Ingestion
Deliverable: All MRs and discussion threads (for both issues and MRs) stored locally with full thread context
Automated Tests (Vitest):
tests/unit/mr-transformer.test.ts
✓ transforms GitLab MR payload to normalized schema
✓ extracts labels from MR payload
✓ handles missing optional fields gracefully
tests/unit/discussion-transformer.test.ts
✓ transforms discussion payload to normalized schema
✓ extracts notes array from discussion
✓ sets individual_note flag correctly
✓ filters out system notes (system: true)
✓ preserves note order via position field
tests/integration/mr-ingestion.test.ts
✓ inserts MRs into database
✓ creates labels from MR payloads
✓ links MRs to labels via junction table
✓ stores raw payload for each MR
tests/integration/discussion-ingestion.test.ts
✓ fetches discussions for each issue
✓ fetches discussions for each MR
✓ creates discussion rows with correct parent FK
✓ creates note rows linked to discussions
✓ excludes system notes from storage
✓ captures note-level resolution status
✓ captures note type (DiscussionNote, DiffNote)
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi ingest --type=merge_requests |
Progress bar, final count | Completes without error |
gi list mrs --limit=10 |
Table of 10 MRs | Shows iid, title, state, author, branch |
gi count mrs |
Merge Requests: 567 (example) |
Count matches GitLab UI |
gi show mr 123 |
MR detail with discussions | Shows title, description, discussion threads |
gi show issue 456 |
Issue detail with discussions | Shows title, description, discussion threads |
gi count discussions |
Discussions: 12,345 |
Non-zero count |
gi count notes |
Notes: 45,678 |
Non-zero count, no system notes |
Data Integrity Checks:
SELECT COUNT(*) FROM merge_requestsmatches GitLab MR countSELECT COUNT(*) FROM discussionsis non-zero for projects with commentsSELECT COUNT(*) FROM notes WHERE discussion_id IS NULL= 0 (all notes linked)SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true= 0 (no system notes)- Every discussion has at least one note
individual_note = truediscussions have exactly one note- Discussion
first_note_at<=last_note_atfor all rows
Scope:
- MR fetcher with pagination
- Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
- Uses
GET /projects/:id/issues/:iid/discussionsandGET /projects/:id/merge_requests/:iid/discussions - During initial ingest: fetch discussions for every issue/MR
- During sync: refetch discussions only for issues/MRs updated since cursor
- Filter out system notes (
system: true) - these are automated messages (assignments, label changes) that add noise
- Uses
- Relationship linking (discussion → parent issue/MR, notes → discussion)
- Extended CLI commands for MR/issue display with threads
Note: MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.
Schema Additions:
CREATE TABLE merge_requests (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
source_branch TEXT,
target_branch TEXT,
created_at INTEGER,
updated_at INTEGER,
merged_at INTEGER,
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
-- Discussion threads (the semantic unit for conversations)
CREATE TABLE discussions (
id INTEGER PRIMARY KEY,
gitlab_discussion_id TEXT UNIQUE NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...")
project_id INTEGER NOT NULL REFERENCES projects(id),
issue_id INTEGER REFERENCES issues(id),
merge_request_id INTEGER REFERENCES merge_requests(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion
first_note_at INTEGER, -- for ordering discussions
last_note_at INTEGER, -- for "recently active" queries
resolvable BOOLEAN, -- MR discussions can be resolved
resolved BOOLEAN,
CHECK (
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
)
);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
discussion_id INTEGER NOT NULL REFERENCES discussions(id),
project_id INTEGER NOT NULL REFERENCES projects(id),
type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
author_username TEXT,
body TEXT,
created_at INTEGER,
updated_at INTEGER,
position INTEGER, -- derived from array order in API response (0-indexed)
resolvable BOOLEAN, -- note-level resolvability (MR code comments)
resolved BOOLEAN, -- note-level resolution status
resolved_by TEXT, -- username who resolved
resolved_at INTEGER, -- when resolved
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_type ON notes(type);
-- MR labels (reuse same labels table)
CREATE TABLE mr_labels (
merge_request_id INTEGER REFERENCES merge_requests(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
Discussion Processing Rules:
- System notes (
system: true) are excluded during ingestion - they're noise (assignment changes, label updates, etc.) - Each discussion from the API becomes one row in
discussionstable - All notes within a discussion are stored with their
discussion_idforeign key individual_note: truediscussions have exactly one note (standalone comment)individual_note: falsediscussions have multiple notes (threaded conversation)
Checkpoint 3: Embedding Generation
Deliverable: Vector embeddings generated for all text content
Automated Tests (Vitest):
tests/unit/document-extractor.test.ts
✓ extracts issue document (title + description)
✓ extracts MR document (title + description)
✓ extracts discussion document with full thread context
✓ includes parent issue/MR title in discussion header
✓ formats notes with author and timestamp
✓ truncates content exceeding 8000 tokens
✓ preserves first and last notes when truncating middle
✓ computes SHA-256 content hash consistently
tests/unit/embedding-client.test.ts
✓ connects to Ollama API
✓ generates embedding for text input
✓ returns 768-dimension vector
✓ handles Ollama connection failure gracefully
✓ batches requests (32 documents per batch)
tests/integration/document-creation.test.ts
✓ creates document for each issue
✓ creates document for each MR
✓ creates document for each discussion
✓ populates document_labels junction table
✓ computes content_hash for each document
tests/integration/embedding-storage.test.ts
✓ stores embedding in sqlite-vss
✓ embedding rowid matches document id
✓ creates embedding_metadata record
✓ skips re-embedding when content_hash unchanged
✓ re-embeds when content_hash changes
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi embed --all |
Progress bar with ETA | Completes without error |
gi embed --all (re-run) |
0 documents to embed |
Skips already-embedded docs |
gi stats |
Embedding coverage stats | Shows 100% coverage |
gi stats --json |
JSON stats object | Valid JSON with document/embedding counts |
gi embed --all (Ollama stopped) |
Clear error message | Non-zero exit, actionable error |
Data Integrity Checks:
SELECT COUNT(*) FROM documents= issues + MRs + discussionsSELECT COUNT(*) FROM embeddings=SELECT COUNT(*) FROM documentsSELECT COUNT(*) FROM embedding_metadata=SELECT COUNT(*) FROM documents- All
embedding_metadata.content_hashmatches correspondingdocuments.content_hash SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000logs truncation warnings- Discussion documents include parent title in content_text
Scope:
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline (batch processing, 32 documents per batch)
- Vector storage in SQLite (sqlite-vss extension)
- Progress tracking and resumability
- Document extraction layer:
- Canonical "search documents" derived from issues/MRs/discussions
- Stable content hashing for change detection (SHA-256 of content_text)
- Single embedding per document (chunking deferred to post-MVP)
- Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
- Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via
document_labelsjoin table
Schema Additions:
-- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id in the source table
project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT, -- for discussions: first note author
label_names TEXT, -- JSON array (display/debug only)
created_at INTEGER,
updated_at INTEGER,
url TEXT,
title TEXT, -- null for discussions
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
content_hash TEXT NOT NULL, -- SHA-256 for change detection
UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
document_id INTEGER NOT NULL REFERENCES documents(id),
label_name TEXT NOT NULL,
PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
-- sqlite-vss virtual table
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0(
embedding(768)
);
-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
model TEXT NOT NULL, -- 'nomic-embed-text'
dims INTEGER NOT NULL, -- 768
content_hash TEXT NOT NULL, -- copied from documents.content_hash
created_at INTEGER NOT NULL
);
Storage Rule (MVP):
- Insert embedding with
rowid = documents.id - Upsert
embedding_metadatabydocument_id - This alignment simplifies joins and eliminates rowid mapping fragility
Document Extraction Rules:
| Source | content_text Construction |
|---|---|
| Issue | title + "\n\n" + description |
| MR | title + "\n\n" + description |
| Discussion | Full thread with context (see below) |
Discussion Document Format:
[Issue #234: Authentication redesign] Discussion
@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
@janedoe (2024-03-15):
Agreed. What about refresh token strategy?
@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
This format preserves:
- Parent context (issue/MR title and number)
- Author attribution for each note
- Temporal ordering of the conversation
- Full thread semantics for decision traceability
Truncation: If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.
Checkpoint 4: Semantic Search
Deliverable: Working semantic search across all indexed content
Automated Tests (Vitest):
tests/unit/search-query.test.ts
✓ parses filter flags (--type, --author, --after, --label)
✓ validates date format for --after
✓ handles multiple --label flags
tests/unit/rrf-ranking.test.ts
✓ computes RRF score correctly
✓ merges results from vector and FTS retrievers
✓ handles documents appearing in only one retriever
✓ respects k=60 parameter
tests/integration/vector-search.test.ts
✓ returns results for semantic query
✓ ranks similar content higher
✓ returns empty for nonsense query
tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ handles porter stemming (search/searching)
✓ returns empty for non-matching query
tests/integration/hybrid-search.test.ts
✓ combines vector and FTS results
✓ applies type filter correctly
✓ applies author filter correctly
✓ applies date filter correctly
✓ applies label filter correctly
✓ falls back to FTS when Ollama unavailable
tests/e2e/golden-queries.test.ts
✓ "authentication redesign" returns known auth-related items
✓ "database migration" returns known migration items
✓ [8 more domain-specific golden queries]
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi search "authentication" |
Ranked results with snippets | Returns relevant items, shows score |
gi search "authentication" --type=mr |
Only MR results | No issues or discussions in output |
gi search "authentication" --author=johndoe |
Filtered by author | All results have @johndoe |
gi search "authentication" --after=2024-01-01 |
Date filtered | All results after date |
gi search "authentication" --label=bug |
Label filtered | All results have bug label |
gi search "redis" --mode=lexical |
FTS-only results | Works without Ollama |
gi search "authentication" --json |
JSON output | Valid JSON array with schema |
gi search "xyznonexistent123" |
No results message | Graceful empty state |
gi search "auth" (Ollama stopped) |
FTS results + warning | Shows warning, still returns results |
Golden Query Test Suite:
Create tests/fixtures/golden-queries.json with 10 queries and expected URLs:
[
{
"query": "authentication redesign",
"expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
"minResults": 1,
"maxRank": 10
}
]
Each query must have at least one expected URL appear in top 10 results.
Data Integrity Checks:
documents_ftsrow count matchesdocumentsrow count- Search returns results for known content (not empty)
- JSON output validates against defined schema
- All result URLs are valid GitLab URLs
Scope:
- Hybrid retrieval:
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Result ranking and scoring (document-level)
- Search filters:
--type=issue|mr|discussion,--author=username,--after=date,--label=name- Label filtering operates on
document_labels(indexed, exact-match)
- Label filtering operates on
- Output formatting: ranked list with title, snippet, score, URL
- JSON output mode for AI agent consumption
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
Schema Additions:
-- Full-text search for hybrid retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
FTS5 Tokenizer Notes:
porterenables stemming (searching "authentication" matches "authenticating", "authenticated")unicode61handles Unicode properly- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:
- Query both vector index (top 50) and FTS5 (top 50)
- Merge results by document_id
- Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N)
rrf_score = Σ 1 / (k + rank)with k=60 (tunable)- RRF is simpler than weighted sums and doesn't require score normalization
- Apply filters (type, author, date, label)
- Return top K
Why RRF over Weighted Sums:
- FTS5 BM25 scores and vector distances use different scales
- Weighted sums (
0.7 * vector + 0.3 * fts) require careful normalization - RRF operates on ranks, not scores, making it robust to scale differences
- Well-established in information retrieval literature
Graceful Degradation:
- If Ollama is unreachable during search, automatically fall back to FTS5-only
- Display warning: "Embedding service unavailable, using lexical search only"
embedcommand fails with actionable error if Ollama is down
CLI Interface:
# Basic semantic search
gi search "why did we choose Redis"
# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical
# Filtered search
gi search "authentication" --type=mr --after=2024-01-01
# Filter by label
gi search "performance" --label=bug --label=critical
# JSON output for programmatic use
gi search "payment processing" --json
CLI Output Example:
$ gi search "authentication redesign"
Found 23 results (hybrid search, 0.34s)
[1] MR !847 - Refactor auth to use JWT tokens (0.82)
@johndoe · 2024-03-15 · group/project-one
"...moving away from session cookies to JWT for authentication..."
https://gitlab.example.com/group/project-one/-/merge_requests/847
[2] Issue #234 - Authentication redesign discussion (0.79)
@janedoe · 2024-02-28 · group/project-one
"...we need to redesign the authentication flow because..."
https://gitlab.example.com/group/project-one/-/issues/234
[3] Discussion on Issue #234 (0.76)
@johndoe · 2024-03-01 · group/project-one
"I think we should move to JWT-based auth because the session..."
https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Checkpoint 5: Incremental Sync
Deliverable: Efficient ongoing synchronization with GitLab
Automated Tests (Vitest):
tests/unit/cursor-management.test.ts
✓ advances cursor after successful page commit
✓ uses tie-breaker id for identical timestamps
✓ does not advance cursor on failure
✓ resets cursor on --full flag
tests/unit/change-detection.test.ts
✓ detects content_hash mismatch
✓ queues document for re-embedding on change
✓ skips re-embedding when hash unchanged
tests/integration/incremental-sync.test.ts
✓ fetches only items updated after cursor
✓ refetches discussions for updated issues
✓ refetches discussions for updated MRs
✓ updates existing records (not duplicates)
✓ creates new records for new items
✓ re-embeds documents with changed content
tests/integration/sync-recovery.test.ts
✓ resumes from cursor after interrupted sync
✓ marks failed run with error message
✓ handles rate limiting (429) with backoff
✓ respects Retry-After header
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi sync (no changes) |
0 issues, 0 MRs updated |
Fast completion, no API calls beyond cursor check |
gi sync (after GitLab change) |
1 issue updated, 3 discussions refetched |
Detects and syncs the change |
gi sync --full |
Full re-sync progress | Resets cursors, fetches everything |
gi sync-status |
Cursor positions, last sync time | Shows current state |
gi sync (with rate limit) |
Backoff messages | Respects rate limits, completes eventually |
gi search "new content" (after sync) |
Returns new content | New content is searchable |
End-to-End Sync Verification:
- Note the current
sync_cursorsvalues - Create a new comment on an issue in GitLab
- Run
gi sync - Verify:
- Issue's
updated_atin DB matches GitLab - New discussion row exists
- New note row exists
- New document row exists for discussion
- New embedding exists for document
gi search "new comment text"returns the new discussion- Cursor advanced past the updated issue
- Issue's
Data Integrity Checks:
sync_cursorstimestamp <= maxupdated_atin corresponding table- No orphaned documents (all have valid source_id)
embedding_metadata.content_hash=documents.content_hashfor all rowssync_runshas complete audit trail
Scope:
- Delta sync based on stable cursor (updated_at + tie-breaker id)
- Dependent resources sync strategy (discussions refetched when parent updates)
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
- Sync status reporting
- Recommended: run via cron every 10 minutes
Correctness Rules (MVP):
- Fetch pages ordered by
updated_at ASC, within identical timestamps advance bygitlab_id ASC - Cursor advances only after successful DB commit for that page
- Dependent resources:
- For each updated issue/MR, refetch ALL its discussions
- Discussion documents are regenerated and re-embedded if content_hash changes
- A document is queued for embedding iff
documents.content_hash != embedding_metadata.content_hash - Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
Why Dependent Resource Model:
- GitLab Discussions API doesn't provide a global
updated_afterstream - Discussions are listed per-issue or per-MR, not as a top-level resource
- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct
CLI Commands:
# Full sync (respects cursors, only fetches new/updated)
gi sync
# Force full re-sync (resets cursors)
gi sync --full
# Override stale 'running' run after operator review
gi sync --force
# Show sync status
gi sync-status
Future Work (Post-MVP)
The following features are explicitly deferred to keep MVP scope focused:
| Feature | Description | Depends On |
|---|---|---|
| File History | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
| Personal Dashboard | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
| Person Context | Aggregate contributions by author, expertise inference | MVP complete |
| Decision Graph | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
| MCP Server | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
| Custom Tokenizer | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |
Checkpoint 6 (File History) Schema Preview:
-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER REFERENCES merge_requests(id),
old_path TEXT,
new_path TEXT,
new_file BOOLEAN,
deleted_file BOOLEAN,
renamed_file BOOLEAN,
UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
note_id INTEGER PRIMARY KEY REFERENCES notes(id),
old_path TEXT,
new_path TEXT,
old_line INTEGER,
new_line INTEGER,
position_type TEXT -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
Verification Strategy
Each checkpoint includes:
- Automated tests - Unit tests for data transformations, integration tests for API calls
- CLI smoke tests - Manual commands with expected outputs documented
- Data integrity checks - Count verification against GitLab, schema validation
- Search quality tests - Known queries with expected results (for Checkpoint 4+)
Risk Mitigation
| Risk | Mitigation |
|---|---|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
| SQLite scale limits | Monitor performance; Postgres migration path documented |
| Stale data | Incremental sync with change detection |
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
| Concurrent sync corruption | Single-flight protection (refuse if existing run is running) |
SQLite Performance Defaults (MVP):
- Enable
PRAGMA journal_mode=WAL;on every connection - Enable
PRAGMA foreign_keys=ON;on every connection - Use explicit transactions for page/batch inserts
- Targeted indexes on
(project_id, updated_at)for primary resources
Schema Summary
| Table | Checkpoint | Purpose |
|---|---|---|
| projects | 0 | Configured GitLab projects |
| sync_runs | 0 | Audit trail of sync operations |
| sync_cursors | 0 | Resumable sync state per primary resource |
| raw_payloads | 0 | Decoupled raw JSON storage |
| issues | 1 | Normalized issues |
| labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction |
| merge_requests | 2 | Normalized MRs |
| discussions | 2 | Discussion threads (the semantic unit for conversations) |
| notes | 2 | Individual comments within discussions |
| mr_labels | 2 | MR-label junction |
| documents | 3 | Unified searchable documents (issues, MRs, discussions) |
| document_labels | 3 | Document-label junction for fast filtering |
| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
| embedding_metadata | 3 | Embedding provenance + change detection |
| documents_fts | 4 | Full-text search index (fts5 with porter stemmer) |
| mr_files | 6 | MR file changes (deferred to File History feature) |
Resolved Decisions
| Question | Decision | Rationale |
|---|---|---|
| Comments structure | Discussions as first-class entities | Thread context is essential for decision traceability; individual notes are meaningless without their thread |
| System notes | Exclude during ingestion | System notes (assignments, label changes) add noise without semantic value |
| MR file linkage | Deferred to post-MVP (CP6) | Only needed for file-history feature; reduces initial API calls |
| Labels | Index as filters | Labels are well-used; document_labels table enables fast --label=X filtering |
| Labels uniqueness | By (project_id, name) | GitLab API returns labels as strings; gitlab_id isn't always available |
| Sync method | Polling only for MVP | Webhooks add complexity; polling every 10min is sufficient |
| Discussions sync | Dependent resource model | Discussions API is per-parent, not global; refetch all discussions when parent updates |
| Hybrid ranking | RRF over weighted sums | Simpler, no score normalization needed |
| Embedding rowid | rowid = documents.id | Eliminates fragile rowid mapping during upserts |
| Embedding truncation | 8000 tokens, truncate middle | Preserve first/last notes for context; nomic-embed-text limit is 8192 |
| Embedding batching | 32 documents per batch | Balance between throughput and memory |
| FTS5 tokenizer | porter unicode61 | Stemming improves recall; unicode61 handles international text |
| Ollama unavailable | Graceful degradation to FTS5 | Search still works, just without semantic matching |
Next Steps
- User approves this spec
- Generate Checkpoint 0 PRD for project setup
- Implement Checkpoint 0
- Human validates → proceed to Checkpoint 1
- Repeat for each checkpoint