Spec iterations

This commit is contained in:
teernisse
2026-01-20 16:26:27 -05:00
parent 7702d2a493
commit 97a303eca9

645
SPEC.md
View File

@@ -2,7 +2,7 @@
## Executive Summary ## Executive Summary
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP. A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
--- ---
@@ -89,6 +89,56 @@ A self-hosted tool to extract, index, and semantically search 2+ years of GitLab
--- ---
## GitLab API Strategy
### Primary Resources (Bulk Fetch)
Issues and MRs support efficient bulk fetching with incremental sync:
```
GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100
```
### Dependent Resources (Per-Parent Fetch)
Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
```
GET /projects/:id/issues/:iid/discussions
GET /projects/:id/merge_requests/:iid/discussions
```
### Sync Pattern
**Initial sync:**
1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
2. For EACH issue → fetch all discussions (~3K calls)
3. Fetch all MRs (paginated, ~60 calls)
4. For EACH MR → fetch all discussions (~3K calls)
5. Total: ~6,100+ API calls for initial sync
**Incremental sync:**
1. Fetch issues where `updated_after=cursor` (bulk)
2. For EACH updated issue → refetch ALL its discussions
3. Fetch MRs where `updated_after=cursor` (bulk)
4. For EACH updated MR → refetch ALL its discussions
### Critical Assumption
**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
Mitigation: Periodic full re-sync (weekly) as a safety net.
### Rate Limiting
- Default: 10 requests/second with exponential backoff
- Respect `Retry-After` headers on 429 responses
- Add jitter to avoid thundering herd on retry
- Initial sync estimate: 10-20 minutes depending on rate limits
---
## Checkpoint Structure ## Checkpoint Structure
Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding. Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
@@ -96,13 +146,39 @@ Each checkpoint is a **testable milestone** where a human can validate the syste
### Checkpoint 0: Project Setup ### Checkpoint 0: Project Setup
**Deliverable:** Scaffolded project with GitLab API connection verified **Deliverable:** Scaffolded project with GitLab API connection verified
**Tests:** **Automated Tests (Vitest):**
1. Run `gitlab-engine auth-test` → returns authenticated user info ```
2. Run `gitlab-engine doctor` → verifies: tests/unit/config.test.ts
- Can reach GitLab baseUrl ✓ loads config from gi.config.json
- PAT is present and can read configured projects ✓ throws if config file missing
- SQLite opens DB and migrations apply ✓ throws if required fields missing (baseUrl, projects)
- Ollama reachable OR embedding disabled with clear warning ✓ validates project paths are non-empty strings
tests/unit/db.test.ts
✓ creates database file if not exists
✓ applies migrations in order
✓ sets WAL journal mode
✓ enables foreign keys
tests/integration/gitlab-client.test.ts
✓ authenticates with valid PAT
✓ returns 401 for invalid PAT
✓ fetches project by path
✓ handles rate limiting (429) with retry
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name |
| `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
| `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks |
| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
**Data Integrity Checks:**
- [ ] `projects` table contains rows for each configured project path
- [ ] `gitlab_project_id` matches actual GitLab project IDs
- [ ] `raw_payloads` contains project JSON for each synced project
**Scope:** **Scope:**
- Project structure (TypeScript, ESLint, Vitest) - Project structure (TypeScript, ESLint, Vitest)
@@ -114,7 +190,7 @@ Each checkpoint is a **testable milestone** where a human can validate the syste
**Configuration (MVP):** **Configuration (MVP):**
```json ```json
// gitlab-engine.config.json // gi.config.json
{ {
"gitlab": { "gitlab": {
"baseUrl": "https://gitlab.example.com", "baseUrl": "https://gitlab.example.com",
@@ -189,7 +265,50 @@ CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
### Checkpoint 1: Issue Ingestion ### Checkpoint 1: Issue Ingestion
**Deliverable:** All issues from target repos stored locally **Deliverable:** All issues from target repos stored locally
**Test:** Run `gitlab-engine ingest --type=issues` → count matches GitLab; run `gitlab-engine list issues --limit=10` → displays issues correctly **Automated Tests (Vitest):**
```
tests/unit/issue-transformer.test.ts
✓ transforms GitLab issue payload to normalized schema
✓ extracts labels from issue payload
✓ handles missing optional fields gracefully
tests/unit/pagination.test.ts
✓ fetches all pages when multiple exist
✓ respects per_page parameter
✓ stops when empty page returned
tests/integration/issue-ingestion.test.ts
✓ inserts issues into database
✓ creates labels from issue payloads
✓ links issues to labels via junction table
✓ stores raw payload for each issue
✓ updates cursor after successful page commit
✓ resumes from cursor on subsequent runs
tests/integration/sync-runs.test.ts
✓ creates sync_run record on start
✓ marks run as succeeded on completion
✓ marks run as failed with error message on failure
✓ refuses concurrent run (single-flight)
✓ allows --force to override stale running status
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi ingest --type=issues` | Progress bar, final count | Completes without error |
| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
| `gi show issue 123` | Issue detail view | Shows title, description, labels, URL |
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects
- [ ] Every issue has a corresponding `raw_payloads` row
- [ ] Labels in `issue_labels` junction all exist in `labels` table
- [ ] `sync_cursors` has entry for each (project_id, 'issues') pair
- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current)
**Scope:** **Scope:**
- Issue fetcher with pagination handling - Issue fetcher with pagination handling
@@ -250,21 +369,70 @@ CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
--- ---
### Checkpoint 2: MR + Comments + File Links Ingestion ### Checkpoint 2: MR + Discussions Ingestion
**Deliverable:** All MRs, discussion threads, and file-change links stored locally **Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context
**Test:** Run `gitlab-engine ingest --type=merge_requests` → count matches; run `gitlab-engine show mr 1234` → displays MR with comments and files changed **Automated Tests (Vitest):**
```
tests/unit/mr-transformer.test.ts
✓ transforms GitLab MR payload to normalized schema
✓ extracts labels from MR payload
✓ handles missing optional fields gracefully
tests/unit/discussion-transformer.test.ts
✓ transforms discussion payload to normalized schema
✓ extracts notes array from discussion
✓ sets individual_note flag correctly
✓ filters out system notes (system: true)
✓ preserves note order via position field
tests/integration/mr-ingestion.test.ts
✓ inserts MRs into database
✓ creates labels from MR payloads
✓ links MRs to labels via junction table
✓ stores raw payload for each MR
tests/integration/discussion-ingestion.test.ts
✓ fetches discussions for each issue
✓ fetches discussions for each MR
✓ creates discussion rows with correct parent FK
✓ creates note rows linked to discussions
✓ excludes system notes from storage
✓ captures note-level resolution status
✓ captures note type (DiscussionNote, DiffNote)
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error |
| `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch |
| `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI |
| `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads |
| `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads |
| `gi count discussions` | `Discussions: 12,345` | Non-zero count |
| `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
- [ ] Every discussion has at least one note
- [ ] `individual_note = true` discussions have exactly one note
- [ ] Discussion `first_note_at` <= `last_note_at` for all rows
**Scope:** **Scope:**
- MR fetcher with pagination - MR fetcher with pagination
- Notes fetcher (issue notes + MR notes) as a dependent resource: - Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
- During initial ingest: fetch notes for every issue/MR - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
- During sync: refetch notes only for issues/MRs updated since cursor - During initial ingest: fetch discussions for every issue/MR
- MR changes/diffs fetcher as a dependent resource: - During sync: refetch discussions only for issues/MRs updated since cursor
- During initial ingest: fetch changes for every MR - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
- During sync: refetch changes only for MRs updated since cursor - Relationship linking (discussion → parent issue/MR, notes → discussion)
- Relationship linking (note → parent issue/MR via foreign keys, MR → files) - Extended CLI commands for MR/issue display with threads
- Extended CLI commands for MR display
**Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.
**Schema Additions:** **Schema Additions:**
```sql ```sql
@@ -288,44 +456,49 @@ CREATE TABLE merge_requests (
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username); CREATE INDEX idx_mrs_author ON merge_requests(author_username);
-- Notes with explicit parent foreign keys for referential integrity -- Discussion threads (the semantic unit for conversations)
CREATE TABLE notes ( CREATE TABLE discussions (
id INTEGER PRIMARY KEY, id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL, gitlab_discussion_id TEXT UNIQUE NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...")
project_id INTEGER NOT NULL REFERENCES projects(id), project_id INTEGER NOT NULL REFERENCES projects(id),
issue_id INTEGER REFERENCES issues(id), issue_id INTEGER REFERENCES issues(id),
merge_request_id INTEGER REFERENCES merge_requests(id), merge_request_id INTEGER REFERENCES merge_requests(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
noteable_iid INTEGER NOT NULL, -- parent IID (from API path) individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion
author_username TEXT, first_note_at INTEGER, -- for ordering discussions
body TEXT, last_note_at INTEGER, -- for "recently active" queries
created_at INTEGER, resolvable BOOLEAN, -- MR discussions can be resolved
updated_at INTEGER, resolved BOOLEAN,
system BOOLEAN,
raw_payload_id INTEGER REFERENCES raw_payloads(id),
-- Exactly one parent FK must be set
CHECK ( CHECK (
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
) )
); );
CREATE INDEX idx_notes_issue ON notes(issue_id); CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_notes_mr ON notes(merge_request_id); CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_notes_author ON notes(author_username); CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
-- File linkage for "what MRs touched this file?" queries (with rename support) -- Notes belong to discussions (preserving thread context)
CREATE TABLE mr_files ( CREATE TABLE notes (
id INTEGER PRIMARY KEY, id INTEGER PRIMARY KEY,
merge_request_id INTEGER REFERENCES merge_requests(id), gitlab_id INTEGER UNIQUE NOT NULL,
old_path TEXT, discussion_id INTEGER NOT NULL REFERENCES discussions(id),
new_path TEXT, project_id INTEGER NOT NULL REFERENCES projects(id),
new_file BOOLEAN, type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
deleted_file BOOLEAN, author_username TEXT,
renamed_file BOOLEAN, body TEXT,
UNIQUE(merge_request_id, old_path, new_path) created_at INTEGER,
updated_at INTEGER,
position INTEGER, -- derived from array order in API response (0-indexed)
resolvable BOOLEAN, -- note-level resolvability (MR code comments)
resolved BOOLEAN, -- note-level resolution status
resolved_by TEXT, -- username who resolved
resolved_at INTEGER, -- when resolved
raw_payload_id INTEGER REFERENCES raw_payloads(id)
); );
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path); CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path); CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_type ON notes(type);
-- MR labels (reuse same labels table) -- MR labels (reuse same labels table)
CREATE TABLE mr_labels ( CREATE TABLE mr_labels (
@@ -336,39 +509,96 @@ CREATE TABLE mr_labels (
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id); CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
``` ```
**Discussion Processing Rules:**
- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
- Each discussion from the API becomes one row in `discussions` table
- All notes within a discussion are stored with their `discussion_id` foreign key
- `individual_note: true` discussions have exactly one note (standalone comment)
- `individual_note: false` discussions have multiple notes (threaded conversation)
--- ---
### Checkpoint 3: Embedding Generation ### Checkpoint 3: Embedding Generation
**Deliverable:** Vector embeddings generated for all text content **Deliverable:** Vector embeddings generated for all text content
**Test:** Run `gitlab-engine embed --all` → progress indicator; run `gitlab-engine stats` → shows embedding coverage percentage **Automated Tests (Vitest):**
```
tests/unit/document-extractor.test.ts
✓ extracts issue document (title + description)
✓ extracts MR document (title + description)
✓ extracts discussion document with full thread context
✓ includes parent issue/MR title in discussion header
✓ formats notes with author and timestamp
✓ truncates content exceeding 8000 tokens
✓ preserves first and last notes when truncating middle
✓ computes SHA-256 content hash consistently
tests/unit/embedding-client.test.ts
✓ connects to Ollama API
✓ generates embedding for text input
✓ returns 768-dimension vector
✓ handles Ollama connection failure gracefully
✓ batches requests (32 documents per batch)
tests/integration/document-creation.test.ts
✓ creates document for each issue
✓ creates document for each MR
✓ creates document for each discussion
✓ populates document_labels junction table
✓ computes content_hash for each document
tests/integration/embedding-storage.test.ts
✓ stores embedding in sqlite-vss
✓ embedding rowid matches document id
✓ creates embedding_metadata record
✓ skips re-embedding when content_hash unchanged
✓ re-embeds when content_hash changes
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi embed --all` | Progress bar with ETA | Completes without error |
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
| `gi stats` | Embedding coverage stats | Shows 100% coverage |
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents`
- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings
- [ ] Discussion documents include parent title in content_text
**Scope:** **Scope:**
- Ollama integration (nomic-embed-text model) - Ollama integration (nomic-embed-text model)
- Embedding generation pipeline (batch processing) - Embedding generation pipeline (batch processing, 32 documents per batch)
- Vector storage in SQLite (sqlite-vss extension) - Vector storage in SQLite (sqlite-vss extension)
- Progress tracking and resumability - Progress tracking and resumability
- Document extraction layer: - Document extraction layer:
- Canonical "search documents" derived from issues/MRs/notes - Canonical "search documents" derived from issues/MRs/discussions
- Stable content hashing for change detection (SHA-256 of content_text) - Stable content hashing for change detection (SHA-256 of content_text)
- Single embedding per document (chunking deferred to post-MVP) - Single embedding per document (chunking deferred to post-MVP)
- Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
- Denormalized metadata for fast filtering (author, labels, dates) - Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via `document_labels` join table - Fast label filtering via `document_labels` join table
**Schema Additions:** **Schema Additions:**
```sql ```sql
-- Unified searchable documents (derived from issues/MRs/notes) -- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents ( CREATE TABLE documents (
id INTEGER PRIMARY KEY, id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'note' source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id in the source table source_id INTEGER NOT NULL, -- local DB id in the source table
project_id INTEGER NOT NULL REFERENCES projects(id), project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT, author_username TEXT, -- for discussions: first note author
label_names TEXT, -- JSON array (display/debug only) label_names TEXT, -- JSON array (display/debug only)
created_at INTEGER, created_at INTEGER,
updated_at INTEGER, updated_at INTEGER,
url TEXT, url TEXT,
title TEXT, -- null for notes title TEXT, -- null for discussions
content_text TEXT NOT NULL, -- canonical text for embedding/snippets content_text TEXT NOT NULL, -- canonical text for embedding/snippets
content_hash TEXT NOT NULL, -- SHA-256 for change detection content_hash TEXT NOT NULL, -- SHA-256 for change detection
UNIQUE(source_type, source_id) UNIQUE(source_type, source_id)
@@ -408,38 +638,131 @@ CREATE TABLE embedding_metadata (
- This alignment simplifies joins and eliminates rowid mapping fragility - This alignment simplifies joins and eliminates rowid mapping fragility
**Document Extraction Rules:** **Document Extraction Rules:**
- Issue → title + "\n\n" + description
- MR → title + "\n\n" + description | Source | content_text Construction |
- Note → body (skip system notes unless they contain meaningful content) |--------|--------------------------|
| Issue | `title + "\n\n" + description` |
| MR | `title + "\n\n" + description` |
| Discussion | Full thread with context (see below) |
**Discussion Document Format:**
```
[Issue #234: Authentication redesign] Discussion
@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
@janedoe (2024-03-15):
Agreed. What about refresh token strategy?
@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
```
This format preserves:
- Parent context (issue/MR title and number)
- Author attribution for each note
- Temporal ordering of the conversation
- Full thread semantics for decision traceability
**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.
--- ---
### Checkpoint 4: Semantic Search ### Checkpoint 4: Semantic Search
**Deliverable:** Working semantic search across all indexed content **Deliverable:** Working semantic search across all indexed content
**Tests:** **Automated Tests (Vitest):**
1. Run `gitlab-engine search "authentication redesign"` → returns ranked results with snippets ```
2. Golden queries: curated list of 10 queries with expected result *containment* (e.g., "at least one of these 3 known URLs appears in top 10") tests/unit/search-query.test.ts
3. `gitlab-engine search "..." --json` validates against JSON schema (stable fields present) ✓ parses filter flags (--type, --author, --after, --label)
✓ validates date format for --after
✓ handles multiple --label flags
tests/unit/rrf-ranking.test.ts
✓ computes RRF score correctly
✓ merges results from vector and FTS retrievers
✓ handles documents appearing in only one retriever
✓ respects k=60 parameter
tests/integration/vector-search.test.ts
✓ returns results for semantic query
✓ ranks similar content higher
✓ returns empty for nonsense query
tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ handles porter stemming (search/searching)
✓ returns empty for non-matching query
tests/integration/hybrid-search.test.ts
✓ combines vector and FTS results
✓ applies type filter correctly
✓ applies author filter correctly
✓ applies date filter correctly
✓ applies label filter correctly
✓ falls back to FTS when Ollama unavailable
tests/e2e/golden-queries.test.ts
✓ "authentication redesign" returns known auth-related items
✓ "database migration" returns known migration items
✓ [8 more domain-specific golden queries]
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
| `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
| `gi search "authentication" --json` | JSON output | Valid JSON array with schema |
| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
| `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
**Golden Query Test Suite:**
Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs:
```json
[
{
"query": "authentication redesign",
"expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
"minResults": 1,
"maxRank": 10
}
]
```
Each query must have at least one expected URL appear in top 10 results.
**Data Integrity Checks:**
- [ ] `documents_fts` row count matches `documents` row count
- [ ] Search returns results for known content (not empty)
- [ ] JSON output validates against defined schema
- [ ] All result URLs are valid GitLab URLs
**Scope:** **Scope:**
- Hybrid retrieval: - Hybrid retrieval:
- Vector recall (sqlite-vss) + FTS lexical recall (fts5) - Vector recall (sqlite-vss) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF) - Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Result ranking and scoring (document-level) - Result ranking and scoring (document-level)
- Search filters: `--type=issue|mr|note`, `--author=username`, `--after=date`, `--label=name` - Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
- Label filtering operates on `document_labels` (indexed, exact-match) - Label filtering operates on `document_labels` (indexed, exact-match)
- Output formatting: ranked list with title, snippet, score, URL - Output formatting: ranked list with title, snippet, score, URL
- JSON output mode for AI agent consumption - JSON output mode for AI agent consumption
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
**Schema Additions:** **Schema Additions:**
```sql ```sql
-- Full-text search for hybrid retrieval -- Full-text search for hybrid retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5( CREATE VIRTUAL TABLE documents_fts USING fts5(
title, title,
content_text, content_text,
content='documents', content='documents',
content_rowid='id' content_rowid='id',
tokenize='porter unicode61'
); );
-- Triggers to keep FTS in sync -- Triggers to keep FTS in sync
@@ -461,6 +784,11 @@ CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
END; END;
``` ```
**FTS5 Tokenizer Notes:**
- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
- `unicode61` handles Unicode properly
- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:** **Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
1. Query both vector index (top 50) and FTS5 (top 50) 1. Query both vector index (top 50) and FTS5 (top 50)
2. Merge results by document_id 2. Merge results by document_id
@@ -477,22 +805,49 @@ END;
- RRF operates on ranks, not scores, making it robust to scale differences - RRF operates on ranks, not scores, making it robust to scale differences
- Well-established in information retrieval literature - Well-established in information retrieval literature
**Graceful Degradation:**
- If Ollama is unreachable during search, automatically fall back to FTS5-only
- Display warning: "Embedding service unavailable, using lexical search only"
- `embed` command fails with actionable error if Ollama is down
**CLI Interface:** **CLI Interface:**
```bash ```bash
# Basic semantic search # Basic semantic search
gitlab-engine search "why did we choose Redis" gi search "why did we choose Redis"
# Pure FTS search (fallback if embeddings unavailable) # Pure FTS search (fallback if embeddings unavailable)
gitlab-engine search "redis" --mode=lexical gi search "redis" --mode=lexical
# Filtered search # Filtered search
gitlab-engine search "authentication" --type=mr --after=2024-01-01 gi search "authentication" --type=mr --after=2024-01-01
# Filter by label # Filter by label
gitlab-engine search "performance" --label=bug --label=critical gi search "performance" --label=bug --label=critical
# JSON output for programmatic use # JSON output for programmatic use
gitlab-engine search "payment processing" --json gi search "payment processing" --json
```
**CLI Output Example:**
```
$ gi search "authentication redesign"
Found 23 results (hybrid search, 0.34s)
[1] MR !847 - Refactor auth to use JWT tokens (0.82)
@johndoe · 2024-03-15 · group/project-one
"...moving away from session cookies to JWT for authentication..."
https://gitlab.example.com/group/project-one/-/merge_requests/847
[2] Issue #234 - Authentication redesign discussion (0.79)
@janedoe · 2024-02-28 · group/project-one
"...we need to redesign the authentication flow because..."
https://gitlab.example.com/group/project-one/-/issues/234
[3] Discussion on Issue #234 (0.76)
@johndoe · 2024-03-01 · group/project-one
"I think we should move to JWT-based auth because the session..."
https://gitlab.example.com/group/project-one/-/issues/234#note_12345
``` ```
--- ---
@@ -500,66 +855,142 @@ gitlab-engine search "payment processing" --json
### Checkpoint 5: Incremental Sync ### Checkpoint 5: Incremental Sync
**Deliverable:** Efficient ongoing synchronization with GitLab **Deliverable:** Efficient ongoing synchronization with GitLab
**Test:** Make a change in GitLab; run `gitlab-engine sync` → only fetches changed items; verify change appears in search **Automated Tests (Vitest):**
```
tests/unit/cursor-management.test.ts
✓ advances cursor after successful page commit
✓ uses tie-breaker id for identical timestamps
✓ does not advance cursor on failure
✓ resets cursor on --full flag
tests/unit/change-detection.test.ts
✓ detects content_hash mismatch
✓ queues document for re-embedding on change
✓ skips re-embedding when hash unchanged
tests/integration/incremental-sync.test.ts
✓ fetches only items updated after cursor
✓ refetches discussions for updated issues
✓ refetches discussions for updated MRs
✓ updates existing records (not duplicates)
✓ creates new records for new items
✓ re-embeds documents with changed content
tests/integration/sync-recovery.test.ts
✓ resumes from cursor after interrupted sync
✓ marks failed run with error message
✓ handles rate limiting (429) with backoff
✓ respects Retry-After header
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
| `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything |
| `gi sync-status` | Cursor positions, last sync time | Shows current state |
| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
| `gi search "new content"` (after sync) | Returns new content | New content is searchable |
**End-to-End Sync Verification:**
1. Note the current `sync_cursors` values
2. Create a new comment on an issue in GitLab
3. Run `gi sync`
4. Verify:
- [ ] Issue's `updated_at` in DB matches GitLab
- [ ] New discussion row exists
- [ ] New note row exists
- [ ] New document row exists for discussion
- [ ] New embedding exists for document
- [ ] `gi search "new comment text"` returns the new discussion
- [ ] Cursor advanced past the updated issue
**Data Integrity Checks:**
- [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table
- [ ] No orphaned documents (all have valid source_id)
- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows
- [ ] `sync_runs` has complete audit trail
**Scope:** **Scope:**
- Delta sync based on stable cursor (updated_at + tie-breaker id) - Delta sync based on stable cursor (updated_at + tie-breaker id)
- Dependent resources sync strategy (notes, MR changes) - Dependent resources sync strategy (discussions refetched when parent updates)
- Webhook handler (optional, if webhook access granted)
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
- Sync status reporting - Sync status reporting
- Recommended: run via cron every 10 minutes
**Correctness Rules (MVP):** **Correctness Rules (MVP):**
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
2. Cursor advances only after successful DB commit for that page 2. Cursor advances only after successful DB commit for that page
3. Dependent resources: 3. Dependent resources:
- For each updated issue/MR, refetch its notes (sorted by `updated_at`) - For each updated issue/MR, refetch ALL its discussions
- For each updated MR, refetch its file changes - Discussion documents are regenerated and re-embedded if content_hash changes
4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` 4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) 5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
**Why Dependent Resource Model:** **Why Dependent Resource Model:**
- GitLab Notes API doesn't provide a clean global `updated_after` stream - GitLab Discussions API doesn't provide a global `updated_after` stream
- Notes are listed per-issue or per-MR, not as a top-level resource - Discussions are listed per-issue or per-MR, not as a top-level resource
- Treating notes as dependent resources (refetch when parent updates) is simpler and more correct - Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct
- Same applies to MR changes/diffs
**CLI Commands:** **CLI Commands:**
```bash ```bash
# Full sync (respects cursors, only fetches new/updated) # Full sync (respects cursors, only fetches new/updated)
gitlab-engine sync gi sync
# Force full re-sync (resets cursors) # Force full re-sync (resets cursors)
gitlab-engine sync --full gi sync --full
# Override stale 'running' run after operator review # Override stale 'running' run after operator review
gitlab-engine sync --force gi sync --force
# Show sync status # Show sync status
gitlab-engine sync-status gi sync-status
``` ```
--- ---
## Future Checkpoints (Post-MVP) ## Future Work (Post-MVP)
### Checkpoint 6: File/Feature History View The following features are explicitly deferred to keep MVP scope focused:
- Map commits to MRs to discussions
- Query: "Show decision history for src/auth/login.ts"
- Ship `gitlab-engine file-history <path>` as a first-class feature here
- This command is deferred from MVP to sharpen checkpoint focus
### Checkpoint 7: Personal Dashboard | Feature | Description | Depends On |
- Filter by assigned/mentioned |---------|-------------|------------|
- Integrate with existing gitlab-inbox tool | **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
| **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
| **Person Context** | Aggregate contributions by author, expertise inference | MVP complete |
| **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
| **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
| **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |
### Checkpoint 8: Person Context **Checkpoint 6 (File History) Schema Preview:**
- Aggregate contributions by author ```sql
- Expertise inference from activity -- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER REFERENCES merge_requests(id),
old_path TEXT,
new_path TEXT,
new_file BOOLEAN,
deleted_file BOOLEAN,
renamed_file BOOLEAN,
UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
### Checkpoint 9: Decision Graph -- DiffNote position data (for "show me comments on this file" queries)
- Extract decisions from discussions (LLM-assisted) -- Populated from notes.type='DiffNote' position object in GitLab API
- Visualize decision relationships CREATE TABLE note_positions (
note_id INTEGER PRIMARY KEY REFERENCES notes(id),
old_path TEXT,
new_path TEXT,
old_line INTEGER,
new_line INTEGER,
position_type TEXT -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
```
--- ---
@@ -606,14 +1037,15 @@ Each checkpoint includes:
| labels | 1 | Label definitions (unique by project + name) | | labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction | | issue_labels | 1 | Issue-label junction |
| merge_requests | 2 | Normalized MRs | | merge_requests | 2 | Normalized MRs |
| notes | 2 | Issue and MR comments (with parent FKs) | | discussions | 2 | Discussion threads (the semantic unit for conversations) |
| mr_files | 2 | MR file changes (with rename tracking) | | notes | 2 | Individual comments within discussions |
| mr_labels | 2 | MR-label junction | | mr_labels | 2 | MR-label junction |
| documents | 3 | Unified searchable documents | | documents | 3 | Unified searchable documents (issues, MRs, discussions) |
| document_labels | 3 | Document-label junction for fast filtering | | document_labels | 3 | Document-label junction for fast filtering |
| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) | | embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
| embedding_metadata | 3 | Embedding provenance + change detection | | embedding_metadata | 3 | Embedding provenance + change detection |
| documents_fts | 4 | Full-text search index (fts5) | | documents_fts | 4 | Full-text search index (fts5 with porter stemmer) |
| mr_files | 6 | MR file changes (deferred to File History feature) |
--- ---
@@ -621,14 +1053,19 @@ Each checkpoint includes:
| Question | Decision | Rationale | | Question | Decision | Rationale |
|----------|----------|-----------| |----------|----------|-----------|
| Commit/file linkage | **Include MR→file links** | Enables "what MRs touched this file?" without full commit history | | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread |
| System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value |
| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls |
| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering | | Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering |
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available | | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available |
| Sync method | **Polling for MVP** | Decide on webhooks after using the system | | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
| Notes sync | **Dependent resource** | Notes API is per-parent, not global; refetch on parent update | | Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates |
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts | | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts |
| file-history CLI | **Post-MVP (CP6)** | Sharpens MVP checkpoint focus | | Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 |
| Embedding batching | **32 documents per batch** | Balance between throughput and memory |
| FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text |
| Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching |
--- ---