Files

teernisse 7702d2a493 initial

2026-01-20 13:11:40 -05:00

25 KiB

Raw Blame History

GitLab Knowledge Engine - Spec Document

Executive Summary

A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP.

Discovery Summary

Pain Points Identified

Knowledge discovery - Tribal knowledge buried in old MRs/issues that nobody can find
Decision traceability - Hard to find why decisions were made; context scattered across issue comments and MR discussions

Constraints

Constraint	Detail
Hosting	Self-hosted only, no external APIs
Compute	Local dev machine (M-series Mac assumed)
GitLab Access	Self-hosted instance, PAT access, no webhooks (could request)
Build Method	AI agents will implement; user is TypeScript expert for review

Target Use Cases (Priority Order)

MVP: Semantic Search - "Find discussions about authentication redesign"
Future: File/Feature History - "What decisions were made about src/auth/login.ts?"
Future: Personal Tracking - "What am I assigned to or mentioned in?"
Future: Person Context - "What's @johndoe's background in this project?"

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        GitLab API                                │
│                    (Issues, MRs, Notes)                          │
└─────────────────────────────────────────────────────────────────┘
  (Commit-level indexing explicitly post-MVP)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Ingestion Layer                         │
│  - Incremental sync (PAT-based polling)                         │
│  - Rate limiting / backoff                                       │
│  - Raw JSON storage for replay                                   │
│  - Dependent resource fetching (notes, MR changes)              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Data Processing Layer                        │
│  - Normalize artifacts to unified schema                        │
│  - Extract searchable documents (canonical text + metadata)     │
│  - Content hashing for change detection                         │
│  - Build relationship graph (issue↔MR↔note↔file)               │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                               │
│  - SQLite + sqlite-vss + FTS5 (hybrid search)                   │
│  - Structured metadata in relational tables                      │
│  - Vector embeddings for semantic search                         │
│  - Full-text index for lexical search fallback                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Query Interface                             │
│  - CLI for human testing                                         │
│  - JSON API for AI agent testing                                 │
│  - Semantic search with filters (author, date, type, label)     │
└─────────────────────────────────────────────────────────────────┘

Technology Choices

Component	Recommendation	Rationale
Language	TypeScript/Node.js	User expertise, good GitLab libs, AI agent friendly
Database	SQLite + sqlite-vss	Zero-config, portable, vector search built-in
Embeddings	Ollama + nomic-embed-text	Self-hosted, runs well on Apple Silicon, 768-dim vectors
CLI Framework	Commander.js or oclif	Standard, well-documented

Alternative Considered: Postgres + pgvector

Pros: More scalable, better for production multi-user
Cons: Requires running Postgres, heavier setup
Decision: Start with SQLite for simplicity; migration path exists if needed

Checkpoint Structure

Each checkpoint is a testable milestone where a human can validate the system works before proceeding.

Checkpoint 0: Project Setup

Deliverable: Scaffolded project with GitLab API connection verified

Tests:

Run gitlab-engine auth-test → returns authenticated user info
Run gitlab-engine doctor → verifies:
- Can reach GitLab baseUrl
- PAT is present and can read configured projects
- SQLite opens DB and migrations apply
- Ollama reachable OR embedding disabled with clear warning

Scope:

Project structure (TypeScript, ESLint, Vitest)
GitLab API client with PAT authentication
Environment and project configuration
Basic CLI scaffold with auth-test command
doctor command for environment verification
Projects table and initial sync

Configuration (MVP):

// gitlab-engine.config.json
{
  "gitlab": {
    "baseUrl": "https://gitlab.example.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project-one" },
    { "path": "group/project-two" }
  ],
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "baseUrl": "http://localhost:11434"
  }
}

DB Runtime Defaults (Checkpoint 0):

On every connection:
- PRAGMA journal_mode=WAL;
- PRAGMA foreign_keys=ON;

Schema (Checkpoint 0):

-- Projects table (configured targets)
CREATE TABLE projects (
  id INTEGER PRIMARY KEY,
  gitlab_project_id INTEGER UNIQUE NOT NULL,
  path_with_namespace TEXT NOT NULL,
  default_branch TEXT,
  web_url TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);

-- Sync tracking for reliability
CREATE TABLE sync_runs (
  id INTEGER PRIMARY KEY,
  started_at INTEGER NOT NULL,
  finished_at INTEGER,
  status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
  command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
  error TEXT
);

-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  resource_type TEXT NOT NULL,   -- 'issues' | 'merge_requests'
  updated_at_cursor INTEGER,     -- last fully processed updated_at (ms epoch)
  tie_breaker_id INTEGER,        -- last fully processed gitlab_id (for stable ordering)
  PRIMARY KEY(project_id, resource_type)
);

-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
  id INTEGER PRIMARY KEY,
  source TEXT NOT NULL,          -- 'gitlab'
  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note'
  gitlab_id INTEGER NOT NULL,
  fetched_at INTEGER NOT NULL,
  json TEXT NOT NULL
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);

Checkpoint 1: Issue Ingestion

Deliverable: All issues from target repos stored locally

Test: Run gitlab-engine ingest --type=issues → count matches GitLab; run gitlab-engine list issues --limit=10 → displays issues correctly

Scope:

Issue fetcher with pagination handling
Raw JSON storage in raw_payloads table
Normalized issue schema in SQLite
Labels ingestion derived from issue payload:
- Always persist label names from labels: string[]
- Optionally request with_labels_details=true to capture color/description when available
Incremental sync support (run tracking + per-project cursor)
Basic list/count CLI commands

Reliability/Idempotency Rules:

Every ingest/sync creates a sync_runs row
Single-flight: refuse to start if an existing run is running (unless --force)
Cursor advances only after successful transaction commit per page/batch
Ordering: updated_at ASC, tie-breaker gitlab_id ASC
Use explicit transactions for batch inserts

Schema Preview:

CREATE TABLE issues (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);

-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER,                  -- optional (only if available)
  project_id INTEGER NOT NULL REFERENCES projects(id),
  name TEXT NOT NULL,
  color TEXT,
  description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);

CREATE TABLE issue_labels (
  issue_id INTEGER REFERENCES issues(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);

Checkpoint 2: MR + Comments + File Links Ingestion

Deliverable: All MRs, discussion threads, and file-change links stored locally

Test: Run gitlab-engine ingest --type=merge_requests → count matches; run gitlab-engine show mr 1234 → displays MR with comments and files changed

Scope:

MR fetcher with pagination
Notes fetcher (issue notes + MR notes) as a dependent resource:
- During initial ingest: fetch notes for every issue/MR
- During sync: refetch notes only for issues/MRs updated since cursor
MR changes/diffs fetcher as a dependent resource:
- During initial ingest: fetch changes for every MR
- During sync: refetch changes only for MRs updated since cursor
Relationship linking (note → parent issue/MR via foreign keys, MR → files)
Extended CLI commands for MR display

Schema Additions:

CREATE TABLE merge_requests (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  source_branch TEXT,
  target_branch TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  merged_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);

-- Notes with explicit parent foreign keys for referential integrity
CREATE TABLE notes (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  issue_id INTEGER REFERENCES issues(id),
  merge_request_id INTEGER REFERENCES merge_requests(id),
  noteable_type TEXT NOT NULL,      -- 'Issue' | 'MergeRequest'
  noteable_iid INTEGER NOT NULL,    -- parent IID (from API path)
  author_username TEXT,
  body TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  system BOOLEAN,
  raw_payload_id INTEGER REFERENCES raw_payloads(id),
  -- Exactly one parent FK must be set
  CHECK (
    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
  )
);
CREATE INDEX idx_notes_issue ON notes(issue_id);
CREATE INDEX idx_notes_mr ON notes(merge_request_id);
CREATE INDEX idx_notes_author ON notes(author_username);

-- File linkage for "what MRs touched this file?" queries (with rename support)
CREATE TABLE mr_files (
  id INTEGER PRIMARY KEY,
  merge_request_id INTEGER REFERENCES merge_requests(id),
  old_path TEXT,
  new_path TEXT,
  new_file BOOLEAN,
  deleted_file BOOLEAN,
  renamed_file BOOLEAN,
  UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);

-- MR labels (reuse same labels table)
CREATE TABLE mr_labels (
  merge_request_id INTEGER REFERENCES merge_requests(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);

Checkpoint 3: Embedding Generation

Deliverable: Vector embeddings generated for all text content

Test: Run gitlab-engine embed --all → progress indicator; run gitlab-engine stats → shows embedding coverage percentage

Scope:

Ollama integration (nomic-embed-text model)
Embedding generation pipeline (batch processing)
Vector storage in SQLite (sqlite-vss extension)
Progress tracking and resumability
Document extraction layer:
- Canonical "search documents" derived from issues/MRs/notes
- Stable content hashing for change detection (SHA-256 of content_text)
- Single embedding per document (chunking deferred to post-MVP)
Denormalized metadata for fast filtering (author, labels, dates)
Fast label filtering via document_labels join table

Schema Additions:

-- Unified searchable documents (derived from issues/MRs/notes)
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'note'
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,
  updated_at INTEGER,
  url TEXT,
  title TEXT,                    -- null for notes
  content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);

-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- sqlite-vss virtual table
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0(
  embedding(768)
);

-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id),
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL
);

Storage Rule (MVP):

Insert embedding with rowid = documents.id
Upsert embedding_metadata by document_id
This alignment simplifies joins and eliminates rowid mapping fragility

Document Extraction Rules:

Issue → title + "\n\n" + description
MR → title + "\n\n" + description
Note → body (skip system notes unless they contain meaningful content)

Checkpoint 4: Semantic Search

Deliverable: Working semantic search across all indexed content

Tests:

Run gitlab-engine search "authentication redesign" → returns ranked results with snippets
Golden queries: curated list of 10 queries with expected result containment (e.g., "at least one of these 3 known URLs appears in top 10")
gitlab-engine search "..." --json validates against JSON schema (stable fields present)

Scope:

Hybrid retrieval:
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
Result ranking and scoring (document-level)
Search filters: --type=issue|mr|note, --author=username, --after=date, --label=name
- Label filtering operates on document_labels (indexed, exact-match)
Output formatting: ranked list with title, snippet, score, URL
JSON output mode for AI agent consumption

Schema Additions:

-- Full-text search for hybrid retrieval
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
END;

CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:

Query both vector index (top 50) and FTS5 (top 50)
Merge results by document_id
Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N)
- rrf_score = Σ 1 / (k + rank) with k=60 (tunable)
- RRF is simpler than weighted sums and doesn't require score normalization
Apply filters (type, author, date, label)
Return top K

Why RRF over Weighted Sums:

FTS5 BM25 scores and vector distances use different scales
Weighted sums (0.7 * vector + 0.3 * fts) require careful normalization
RRF operates on ranks, not scores, making it robust to scale differences
Well-established in information retrieval literature

CLI Interface:

# Basic semantic search
gitlab-engine search "why did we choose Redis"

# Pure FTS search (fallback if embeddings unavailable)
gitlab-engine search "redis" --mode=lexical

# Filtered search
gitlab-engine search "authentication" --type=mr --after=2024-01-01

# Filter by label
gitlab-engine search "performance" --label=bug --label=critical

# JSON output for programmatic use
gitlab-engine search "payment processing" --json

Checkpoint 5: Incremental Sync

Deliverable: Efficient ongoing synchronization with GitLab

Test: Make a change in GitLab; run gitlab-engine sync → only fetches changed items; verify change appears in search

Scope:

Delta sync based on stable cursor (updated_at + tie-breaker id)
Dependent resources sync strategy (notes, MR changes)
Webhook handler (optional, if webhook access granted)
Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
Sync status reporting

Correctness Rules (MVP):

Fetch pages ordered by updated_at ASC, within identical timestamps advance by gitlab_id ASC
Cursor advances only after successful DB commit for that page
Dependent resources:
- For each updated issue/MR, refetch its notes (sorted by updated_at)
- For each updated MR, refetch its file changes
A document is queued for embedding iff documents.content_hash != embedding_metadata.content_hash
Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

Why Dependent Resource Model:

GitLab Notes API doesn't provide a clean global updated_after stream
Notes are listed per-issue or per-MR, not as a top-level resource
Treating notes as dependent resources (refetch when parent updates) is simpler and more correct
Same applies to MR changes/diffs

CLI Commands:

# Full sync (respects cursors, only fetches new/updated)
gitlab-engine sync

# Force full re-sync (resets cursors)
gitlab-engine sync --full

# Override stale 'running' run after operator review
gitlab-engine sync --force

# Show sync status
gitlab-engine sync-status

Future Checkpoints (Post-MVP)

Checkpoint 6: File/Feature History View

Map commits to MRs to discussions
Query: "Show decision history for src/auth/login.ts"
Ship gitlab-engine file-history <path> as a first-class feature here
This command is deferred from MVP to sharpen checkpoint focus

Checkpoint 7: Personal Dashboard

Filter by assigned/mentioned
Integrate with existing gitlab-inbox tool

Checkpoint 8: Person Context

Aggregate contributions by author
Expertise inference from activity

Checkpoint 9: Decision Graph

Extract decisions from discussions (LLM-assisted)
Visualize decision relationships

Verification Strategy

Each checkpoint includes:

Automated tests - Unit tests for data transformations, integration tests for API calls
CLI smoke tests - Manual commands with expected outputs documented
Data integrity checks - Count verification against GitLab, schema validation
Search quality tests - Known queries with expected results (for Checkpoint 4+)

Risk Mitigation

Risk	Mitigation
GitLab rate limiting	Exponential backoff, respect Retry-After headers, incremental sync
Embedding model quality	Start with nomic-embed-text; architecture allows model swap
SQLite scale limits	Monitor performance; Postgres migration path documented
Stale data	Incremental sync with change detection
Mid-sync failures	Cursor-based resumption, sync_runs audit trail
Search quality	Hybrid (vector + FTS5) retrieval with RRF, golden query test suite
Concurrent sync corruption	Single-flight protection (refuse if existing run is `running`)

SQLite Performance Defaults (MVP):

Enable PRAGMA journal_mode=WAL; on every connection
Enable PRAGMA foreign_keys=ON; on every connection
Use explicit transactions for page/batch inserts
Targeted indexes on (project_id, updated_at) for primary resources

Schema Summary

Table	Checkpoint	Purpose
projects	0	Configured GitLab projects
sync_runs	0	Audit trail of sync operations
sync_cursors	0	Resumable sync state per primary resource
raw_payloads	0	Decoupled raw JSON storage
issues	1	Normalized issues
labels	1	Label definitions (unique by project + name)
issue_labels	1	Issue-label junction
merge_requests	2	Normalized MRs
notes	2	Issue and MR comments (with parent FKs)
mr_files	2	MR file changes (with rename tracking)
mr_labels	2	MR-label junction
documents	3	Unified searchable documents
document_labels	3	Document-label junction for fast filtering
embeddings	3	Vector embeddings (sqlite-vss, rowid=document_id)
embedding_metadata	3	Embedding provenance + change detection
documents_fts	4	Full-text search index (fts5)

Resolved Decisions

Question	Decision	Rationale
Commit/file linkage	Include MR→file links	Enables "what MRs touched this file?" without full commit history
Labels	Index as filters	Labels are well-used; `document_labels` table enables fast `--label=X` filtering
Labels uniqueness	By (project_id, name)	GitLab API returns labels as strings; gitlab_id isn't always available
Sync method	Polling for MVP	Decide on webhooks after using the system
Notes sync	Dependent resource	Notes API is per-parent, not global; refetch on parent update
Hybrid ranking	RRF over weighted sums	Simpler, no score normalization needed
Embedding rowid	rowid = documents.id	Eliminates fragile rowid mapping during upserts
file-history CLI	Post-MVP (CP6)	Sharpens MVP checkpoint focus

Next Steps

User approves this spec
Generate Checkpoint 0 PRD for project setup
Implement Checkpoint 0
Human validates → proceed to Checkpoint 1
Repeat for each checkpoint

25 KiB Raw Blame History

GitLab Knowledge Engine - Spec Document

Executive Summary

Discovery Summary

Pain Points Identified

Constraints

Target Use Cases (Priority Order)

Architecture Overview

Technology Choices

Alternative Considered: Postgres + pgvector

Checkpoint Structure

Checkpoint 0: Project Setup

Checkpoint 1: Issue Ingestion

Checkpoint 2: MR + Comments + File Links Ingestion

Checkpoint 3: Embedding Generation

Checkpoint 4: Semantic Search

Checkpoint 5: Incremental Sync

Future Checkpoints (Post-MVP)

Checkpoint 6: File/Feature History View

Checkpoint 7: Personal Dashboard

Checkpoint 8: Person Context

Checkpoint 9: Decision Graph

Verification Strategy

Risk Mitigation

Schema Summary

Resolved Decisions

Next Steps

25 KiB

Raw Blame History