642 lines
25 KiB
Markdown
642 lines
25 KiB
Markdown
# GitLab Knowledge Engine - Spec Document
|
|
|
|
## Executive Summary
|
|
|
|
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP.
|
|
|
|
---
|
|
|
|
## Discovery Summary
|
|
|
|
### Pain Points Identified
|
|
1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find
|
|
2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions
|
|
|
|
### Constraints
|
|
| Constraint | Detail |
|
|
|------------|--------|
|
|
| Hosting | Self-hosted only, no external APIs |
|
|
| Compute | Local dev machine (M-series Mac assumed) |
|
|
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
|
|
| Build Method | AI agents will implement; user is TypeScript expert for review |
|
|
|
|
### Target Use Cases (Priority Order)
|
|
1. **MVP: Semantic Search** - "Find discussions about authentication redesign"
|
|
2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?"
|
|
3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?"
|
|
4. **Future: Person Context** - "What's @johndoe's background in this project?"
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ GitLab API │
|
|
│ (Issues, MRs, Notes) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
(Commit-level indexing explicitly post-MVP)
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Data Ingestion Layer │
|
|
│ - Incremental sync (PAT-based polling) │
|
|
│ - Rate limiting / backoff │
|
|
│ - Raw JSON storage for replay │
|
|
│ - Dependent resource fetching (notes, MR changes) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Data Processing Layer │
|
|
│ - Normalize artifacts to unified schema │
|
|
│ - Extract searchable documents (canonical text + metadata) │
|
|
│ - Content hashing for change detection │
|
|
│ - Build relationship graph (issue↔MR↔note↔file) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Storage Layer │
|
|
│ - SQLite + sqlite-vss + FTS5 (hybrid search) │
|
|
│ - Structured metadata in relational tables │
|
|
│ - Vector embeddings for semantic search │
|
|
│ - Full-text index for lexical search fallback │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Query Interface │
|
|
│ - CLI for human testing │
|
|
│ - JSON API for AI agent testing │
|
|
│ - Semantic search with filters (author, date, type, label) │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Technology Choices
|
|
|
|
| Component | Recommendation | Rationale |
|
|
|-----------|---------------|-----------|
|
|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
|
|
| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
|
|
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
|
|
| CLI Framework | Commander.js or oclif | Standard, well-documented |
|
|
|
|
### Alternative Considered: Postgres + pgvector
|
|
- Pros: More scalable, better for production multi-user
|
|
- Cons: Requires running Postgres, heavier setup
|
|
- Decision: Start with SQLite for simplicity; migration path exists if needed
|
|
|
|
---
|
|
|
|
## Checkpoint Structure
|
|
|
|
Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
|
|
|
|
### Checkpoint 0: Project Setup
|
|
**Deliverable:** Scaffolded project with GitLab API connection verified
|
|
|
|
**Tests:**
|
|
1. Run `gitlab-engine auth-test` → returns authenticated user info
|
|
2. Run `gitlab-engine doctor` → verifies:
|
|
- Can reach GitLab baseUrl
|
|
- PAT is present and can read configured projects
|
|
- SQLite opens DB and migrations apply
|
|
- Ollama reachable OR embedding disabled with clear warning
|
|
|
|
**Scope:**
|
|
- Project structure (TypeScript, ESLint, Vitest)
|
|
- GitLab API client with PAT authentication
|
|
- Environment and project configuration
|
|
- Basic CLI scaffold with `auth-test` command
|
|
- `doctor` command for environment verification
|
|
- Projects table and initial sync
|
|
|
|
**Configuration (MVP):**
|
|
```json
|
|
// gitlab-engine.config.json
|
|
{
|
|
"gitlab": {
|
|
"baseUrl": "https://gitlab.example.com",
|
|
"tokenEnvVar": "GITLAB_TOKEN"
|
|
},
|
|
"projects": [
|
|
{ "path": "group/project-one" },
|
|
{ "path": "group/project-two" }
|
|
],
|
|
"embedding": {
|
|
"provider": "ollama",
|
|
"model": "nomic-embed-text",
|
|
"baseUrl": "http://localhost:11434"
|
|
}
|
|
}
|
|
```
|
|
|
|
**DB Runtime Defaults (Checkpoint 0):**
|
|
- On every connection:
|
|
- `PRAGMA journal_mode=WAL;`
|
|
- `PRAGMA foreign_keys=ON;`
|
|
|
|
**Schema (Checkpoint 0):**
|
|
```sql
|
|
-- Projects table (configured targets)
|
|
CREATE TABLE projects (
|
|
id INTEGER PRIMARY KEY,
|
|
gitlab_project_id INTEGER UNIQUE NOT NULL,
|
|
path_with_namespace TEXT NOT NULL,
|
|
default_branch TEXT,
|
|
web_url TEXT,
|
|
created_at INTEGER,
|
|
updated_at INTEGER,
|
|
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
|
);
|
|
CREATE INDEX idx_projects_path ON projects(path_with_namespace);
|
|
|
|
-- Sync tracking for reliability
|
|
CREATE TABLE sync_runs (
|
|
id INTEGER PRIMARY KEY,
|
|
started_at INTEGER NOT NULL,
|
|
finished_at INTEGER,
|
|
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
|
|
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
|
|
error TEXT
|
|
);
|
|
|
|
-- Sync cursors for primary resources only
|
|
-- Notes and MR changes are dependent resources (fetched via parent updates)
|
|
CREATE TABLE sync_cursors (
|
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
|
resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests'
|
|
updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch)
|
|
tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering)
|
|
PRIMARY KEY(project_id, resource_type)
|
|
);
|
|
|
|
-- Raw payload storage (decoupled from entity tables)
|
|
CREATE TABLE raw_payloads (
|
|
id INTEGER PRIMARY KEY,
|
|
source TEXT NOT NULL, -- 'gitlab'
|
|
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note'
|
|
gitlab_id INTEGER NOT NULL,
|
|
fetched_at INTEGER NOT NULL,
|
|
json TEXT NOT NULL
|
|
);
|
|
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
|
|
```
|
|
|
|
---
|
|
|
|
### Checkpoint 1: Issue Ingestion
|
|
**Deliverable:** All issues from target repos stored locally
|
|
|
|
**Test:** Run `gitlab-engine ingest --type=issues` → count matches GitLab; run `gitlab-engine list issues --limit=10` → displays issues correctly
|
|
|
|
**Scope:**
|
|
- Issue fetcher with pagination handling
|
|
- Raw JSON storage in raw_payloads table
|
|
- Normalized issue schema in SQLite
|
|
- Labels ingestion derived from issue payload:
|
|
- Always persist label names from `labels: string[]`
|
|
- Optionally request `with_labels_details=true` to capture color/description when available
|
|
- Incremental sync support (run tracking + per-project cursor)
|
|
- Basic list/count CLI commands
|
|
|
|
**Reliability/Idempotency Rules:**
|
|
- Every ingest/sync creates a `sync_runs` row
|
|
- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
|
|
- Cursor advances only after successful transaction commit per page/batch
|
|
- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
|
|
- Use explicit transactions for batch inserts
|
|
|
|
**Schema Preview:**
|
|
```sql
|
|
CREATE TABLE issues (
|
|
id INTEGER PRIMARY KEY,
|
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
|
iid INTEGER NOT NULL,
|
|
title TEXT,
|
|
description TEXT,
|
|
state TEXT,
|
|
author_username TEXT,
|
|
created_at INTEGER,
|
|
updated_at INTEGER,
|
|
web_url TEXT,
|
|
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
|
);
|
|
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
|
|
CREATE INDEX idx_issues_author ON issues(author_username);
|
|
|
|
-- Labels are derived from issue payloads (string array)
|
|
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
|
|
CREATE TABLE labels (
|
|
id INTEGER PRIMARY KEY,
|
|
gitlab_id INTEGER, -- optional (only if available)
|
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
|
name TEXT NOT NULL,
|
|
color TEXT,
|
|
description TEXT
|
|
);
|
|
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
|
|
CREATE INDEX idx_labels_name ON labels(name);
|
|
|
|
CREATE TABLE issue_labels (
|
|
issue_id INTEGER REFERENCES issues(id),
|
|
label_id INTEGER REFERENCES labels(id),
|
|
PRIMARY KEY(issue_id, label_id)
|
|
);
|
|
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
|
|
```
|
|
|
|
---
|
|
|
|
### Checkpoint 2: MR + Comments + File Links Ingestion
|
|
**Deliverable:** All MRs, discussion threads, and file-change links stored locally
|
|
|
|
**Test:** Run `gitlab-engine ingest --type=merge_requests` → count matches; run `gitlab-engine show mr 1234` → displays MR with comments and files changed
|
|
|
|
**Scope:**
|
|
- MR fetcher with pagination
|
|
- Notes fetcher (issue notes + MR notes) as a dependent resource:
|
|
- During initial ingest: fetch notes for every issue/MR
|
|
- During sync: refetch notes only for issues/MRs updated since cursor
|
|
- MR changes/diffs fetcher as a dependent resource:
|
|
- During initial ingest: fetch changes for every MR
|
|
- During sync: refetch changes only for MRs updated since cursor
|
|
- Relationship linking (note → parent issue/MR via foreign keys, MR → files)
|
|
- Extended CLI commands for MR display
|
|
|
|
**Schema Additions:**
|
|
```sql
|
|
CREATE TABLE merge_requests (
|
|
id INTEGER PRIMARY KEY,
|
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
|
iid INTEGER NOT NULL,
|
|
title TEXT,
|
|
description TEXT,
|
|
state TEXT,
|
|
author_username TEXT,
|
|
source_branch TEXT,
|
|
target_branch TEXT,
|
|
created_at INTEGER,
|
|
updated_at INTEGER,
|
|
merged_at INTEGER,
|
|
web_url TEXT,
|
|
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
|
);
|
|
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
|
|
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
|
|
|
|
-- Notes with explicit parent foreign keys for referential integrity
|
|
CREATE TABLE notes (
|
|
id INTEGER PRIMARY KEY,
|
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
|
issue_id INTEGER REFERENCES issues(id),
|
|
merge_request_id INTEGER REFERENCES merge_requests(id),
|
|
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
|
|
noteable_iid INTEGER NOT NULL, -- parent IID (from API path)
|
|
author_username TEXT,
|
|
body TEXT,
|
|
created_at INTEGER,
|
|
updated_at INTEGER,
|
|
system BOOLEAN,
|
|
raw_payload_id INTEGER REFERENCES raw_payloads(id),
|
|
-- Exactly one parent FK must be set
|
|
CHECK (
|
|
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
|
|
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
|
|
)
|
|
);
|
|
CREATE INDEX idx_notes_issue ON notes(issue_id);
|
|
CREATE INDEX idx_notes_mr ON notes(merge_request_id);
|
|
CREATE INDEX idx_notes_author ON notes(author_username);
|
|
|
|
-- File linkage for "what MRs touched this file?" queries (with rename support)
|
|
CREATE TABLE mr_files (
|
|
id INTEGER PRIMARY KEY,
|
|
merge_request_id INTEGER REFERENCES merge_requests(id),
|
|
old_path TEXT,
|
|
new_path TEXT,
|
|
new_file BOOLEAN,
|
|
deleted_file BOOLEAN,
|
|
renamed_file BOOLEAN,
|
|
UNIQUE(merge_request_id, old_path, new_path)
|
|
);
|
|
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
|
|
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
|
|
|
|
-- MR labels (reuse same labels table)
|
|
CREATE TABLE mr_labels (
|
|
merge_request_id INTEGER REFERENCES merge_requests(id),
|
|
label_id INTEGER REFERENCES labels(id),
|
|
PRIMARY KEY(merge_request_id, label_id)
|
|
);
|
|
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
|
|
```
|
|
|
|
---
|
|
|
|
### Checkpoint 3: Embedding Generation
|
|
**Deliverable:** Vector embeddings generated for all text content
|
|
|
|
**Test:** Run `gitlab-engine embed --all` → progress indicator; run `gitlab-engine stats` → shows embedding coverage percentage
|
|
|
|
**Scope:**
|
|
- Ollama integration (nomic-embed-text model)
|
|
- Embedding generation pipeline (batch processing)
|
|
- Vector storage in SQLite (sqlite-vss extension)
|
|
- Progress tracking and resumability
|
|
- Document extraction layer:
|
|
- Canonical "search documents" derived from issues/MRs/notes
|
|
- Stable content hashing for change detection (SHA-256 of content_text)
|
|
- Single embedding per document (chunking deferred to post-MVP)
|
|
- Denormalized metadata for fast filtering (author, labels, dates)
|
|
- Fast label filtering via `document_labels` join table
|
|
|
|
**Schema Additions:**
|
|
```sql
|
|
-- Unified searchable documents (derived from issues/MRs/notes)
|
|
CREATE TABLE documents (
|
|
id INTEGER PRIMARY KEY,
|
|
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'note'
|
|
source_id INTEGER NOT NULL, -- local DB id in the source table
|
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
|
author_username TEXT,
|
|
label_names TEXT, -- JSON array (display/debug only)
|
|
created_at INTEGER,
|
|
updated_at INTEGER,
|
|
url TEXT,
|
|
title TEXT, -- null for notes
|
|
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
|
|
content_hash TEXT NOT NULL, -- SHA-256 for change detection
|
|
UNIQUE(source_type, source_id)
|
|
);
|
|
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
|
|
CREATE INDEX idx_documents_author ON documents(author_username);
|
|
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
|
|
|
|
-- Fast label filtering for documents (indexed exact-match)
|
|
CREATE TABLE document_labels (
|
|
document_id INTEGER NOT NULL REFERENCES documents(id),
|
|
label_name TEXT NOT NULL,
|
|
PRIMARY KEY(document_id, label_name)
|
|
);
|
|
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
|
|
|
|
-- sqlite-vss virtual table
|
|
-- Storage rule: embeddings.rowid = documents.id
|
|
CREATE VIRTUAL TABLE embeddings USING vss0(
|
|
embedding(768)
|
|
);
|
|
|
|
-- Embedding provenance + change detection
|
|
-- document_id is PRIMARY KEY and equals embeddings.rowid
|
|
CREATE TABLE embedding_metadata (
|
|
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
|
|
model TEXT NOT NULL, -- 'nomic-embed-text'
|
|
dims INTEGER NOT NULL, -- 768
|
|
content_hash TEXT NOT NULL, -- copied from documents.content_hash
|
|
created_at INTEGER NOT NULL
|
|
);
|
|
```
|
|
|
|
**Storage Rule (MVP):**
|
|
- Insert embedding with `rowid = documents.id`
|
|
- Upsert `embedding_metadata` by `document_id`
|
|
- This alignment simplifies joins and eliminates rowid mapping fragility
|
|
|
|
**Document Extraction Rules:**
|
|
- Issue → title + "\n\n" + description
|
|
- MR → title + "\n\n" + description
|
|
- Note → body (skip system notes unless they contain meaningful content)
|
|
|
|
---
|
|
|
|
### Checkpoint 4: Semantic Search
|
|
**Deliverable:** Working semantic search across all indexed content
|
|
|
|
**Tests:**
|
|
1. Run `gitlab-engine search "authentication redesign"` → returns ranked results with snippets
|
|
2. Golden queries: curated list of 10 queries with expected result *containment* (e.g., "at least one of these 3 known URLs appears in top 10")
|
|
3. `gitlab-engine search "..." --json` validates against JSON schema (stable fields present)
|
|
|
|
**Scope:**
|
|
- Hybrid retrieval:
|
|
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
|
|
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
|
|
- Result ranking and scoring (document-level)
|
|
- Search filters: `--type=issue|mr|note`, `--author=username`, `--after=date`, `--label=name`
|
|
- Label filtering operates on `document_labels` (indexed, exact-match)
|
|
- Output formatting: ranked list with title, snippet, score, URL
|
|
- JSON output mode for AI agent consumption
|
|
|
|
**Schema Additions:**
|
|
```sql
|
|
-- Full-text search for hybrid retrieval
|
|
CREATE VIRTUAL TABLE documents_fts USING fts5(
|
|
title,
|
|
content_text,
|
|
content='documents',
|
|
content_rowid='id'
|
|
);
|
|
|
|
-- Triggers to keep FTS in sync
|
|
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
|
|
INSERT INTO documents_fts(rowid, title, content_text)
|
|
VALUES (new.id, new.title, new.content_text);
|
|
END;
|
|
|
|
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
|
|
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
|
|
VALUES('delete', old.id, old.title, old.content_text);
|
|
END;
|
|
|
|
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
|
|
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
|
|
VALUES('delete', old.id, old.title, old.content_text);
|
|
INSERT INTO documents_fts(rowid, title, content_text)
|
|
VALUES (new.id, new.title, new.content_text);
|
|
END;
|
|
```
|
|
|
|
**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
|
|
1. Query both vector index (top 50) and FTS5 (top 50)
|
|
2. Merge results by document_id
|
|
3. Combine with Reciprocal Rank Fusion (RRF):
|
|
- For each retriever list, assign ranks (1..N)
|
|
- `rrf_score = Σ 1 / (k + rank)` with k=60 (tunable)
|
|
- RRF is simpler than weighted sums and doesn't require score normalization
|
|
4. Apply filters (type, author, date, label)
|
|
5. Return top K
|
|
|
|
**Why RRF over Weighted Sums:**
|
|
- FTS5 BM25 scores and vector distances use different scales
|
|
- Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization
|
|
- RRF operates on ranks, not scores, making it robust to scale differences
|
|
- Well-established in information retrieval literature
|
|
|
|
**CLI Interface:**
|
|
```bash
|
|
# Basic semantic search
|
|
gitlab-engine search "why did we choose Redis"
|
|
|
|
# Pure FTS search (fallback if embeddings unavailable)
|
|
gitlab-engine search "redis" --mode=lexical
|
|
|
|
# Filtered search
|
|
gitlab-engine search "authentication" --type=mr --after=2024-01-01
|
|
|
|
# Filter by label
|
|
gitlab-engine search "performance" --label=bug --label=critical
|
|
|
|
# JSON output for programmatic use
|
|
gitlab-engine search "payment processing" --json
|
|
```
|
|
|
|
---
|
|
|
|
### Checkpoint 5: Incremental Sync
|
|
**Deliverable:** Efficient ongoing synchronization with GitLab
|
|
|
|
**Test:** Make a change in GitLab; run `gitlab-engine sync` → only fetches changed items; verify change appears in search
|
|
|
|
**Scope:**
|
|
- Delta sync based on stable cursor (updated_at + tie-breaker id)
|
|
- Dependent resources sync strategy (notes, MR changes)
|
|
- Webhook handler (optional, if webhook access granted)
|
|
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
|
|
- Sync status reporting
|
|
|
|
**Correctness Rules (MVP):**
|
|
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
|
|
2. Cursor advances only after successful DB commit for that page
|
|
3. Dependent resources:
|
|
- For each updated issue/MR, refetch its notes (sorted by `updated_at`)
|
|
- For each updated MR, refetch its file changes
|
|
4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
|
|
5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
|
|
|
|
**Why Dependent Resource Model:**
|
|
- GitLab Notes API doesn't provide a clean global `updated_after` stream
|
|
- Notes are listed per-issue or per-MR, not as a top-level resource
|
|
- Treating notes as dependent resources (refetch when parent updates) is simpler and more correct
|
|
- Same applies to MR changes/diffs
|
|
|
|
**CLI Commands:**
|
|
```bash
|
|
# Full sync (respects cursors, only fetches new/updated)
|
|
gitlab-engine sync
|
|
|
|
# Force full re-sync (resets cursors)
|
|
gitlab-engine sync --full
|
|
|
|
# Override stale 'running' run after operator review
|
|
gitlab-engine sync --force
|
|
|
|
# Show sync status
|
|
gitlab-engine sync-status
|
|
```
|
|
|
|
---
|
|
|
|
## Future Checkpoints (Post-MVP)
|
|
|
|
### Checkpoint 6: File/Feature History View
|
|
- Map commits to MRs to discussions
|
|
- Query: "Show decision history for src/auth/login.ts"
|
|
- Ship `gitlab-engine file-history <path>` as a first-class feature here
|
|
- This command is deferred from MVP to sharpen checkpoint focus
|
|
|
|
### Checkpoint 7: Personal Dashboard
|
|
- Filter by assigned/mentioned
|
|
- Integrate with existing gitlab-inbox tool
|
|
|
|
### Checkpoint 8: Person Context
|
|
- Aggregate contributions by author
|
|
- Expertise inference from activity
|
|
|
|
### Checkpoint 9: Decision Graph
|
|
- Extract decisions from discussions (LLM-assisted)
|
|
- Visualize decision relationships
|
|
|
|
---
|
|
|
|
## Verification Strategy
|
|
|
|
Each checkpoint includes:
|
|
|
|
1. **Automated tests** - Unit tests for data transformations, integration tests for API calls
|
|
2. **CLI smoke tests** - Manual commands with expected outputs documented
|
|
3. **Data integrity checks** - Count verification against GitLab, schema validation
|
|
4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+)
|
|
|
|
---
|
|
|
|
## Risk Mitigation
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
|
|
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
|
|
| SQLite scale limits | Monitor performance; Postgres migration path documented |
|
|
| Stale data | Incremental sync with change detection |
|
|
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
|
|
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
|
|
| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |
|
|
|
|
**SQLite Performance Defaults (MVP):**
|
|
- Enable `PRAGMA journal_mode=WAL;` on every connection
|
|
- Enable `PRAGMA foreign_keys=ON;` on every connection
|
|
- Use explicit transactions for page/batch inserts
|
|
- Targeted indexes on `(project_id, updated_at)` for primary resources
|
|
|
|
---
|
|
|
|
## Schema Summary
|
|
|
|
| Table | Checkpoint | Purpose |
|
|
|-------|------------|---------|
|
|
| projects | 0 | Configured GitLab projects |
|
|
| sync_runs | 0 | Audit trail of sync operations |
|
|
| sync_cursors | 0 | Resumable sync state per primary resource |
|
|
| raw_payloads | 0 | Decoupled raw JSON storage |
|
|
| issues | 1 | Normalized issues |
|
|
| labels | 1 | Label definitions (unique by project + name) |
|
|
| issue_labels | 1 | Issue-label junction |
|
|
| merge_requests | 2 | Normalized MRs |
|
|
| notes | 2 | Issue and MR comments (with parent FKs) |
|
|
| mr_files | 2 | MR file changes (with rename tracking) |
|
|
| mr_labels | 2 | MR-label junction |
|
|
| documents | 3 | Unified searchable documents |
|
|
| document_labels | 3 | Document-label junction for fast filtering |
|
|
| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
|
|
| embedding_metadata | 3 | Embedding provenance + change detection |
|
|
| documents_fts | 4 | Full-text search index (fts5) |
|
|
|
|
---
|
|
|
|
## Resolved Decisions
|
|
|
|
| Question | Decision | Rationale |
|
|
|----------|----------|-----------|
|
|
| Commit/file linkage | **Include MR→file links** | Enables "what MRs touched this file?" without full commit history |
|
|
| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering |
|
|
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available |
|
|
| Sync method | **Polling for MVP** | Decide on webhooks after using the system |
|
|
| Notes sync | **Dependent resource** | Notes API is per-parent, not global; refetch on parent update |
|
|
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
|
|
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts |
|
|
| file-history CLI | **Post-MVP (CP6)** | Sharpens MVP checkpoint focus |
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. User approves this spec
|
|
2. Generate Checkpoint 0 PRD for project setup
|
|
3. Implement Checkpoint 0
|
|
4. Human validates → proceed to Checkpoint 1
|
|
5. Repeat for each checkpoint
|