From 97a303eca943ec56b4476024f2a42287292eb9c8 Mon Sep 17 00:00:00 2001 From: teernisse Date: Tue, 20 Jan 2026 16:26:27 -0500 Subject: [PATCH] Spec iterations --- SPEC.md | 645 +++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 541 insertions(+), 104 deletions(-) diff --git a/SPEC.md b/SPEC.md index 3279dc8..0d242bf 100644 --- a/SPEC.md +++ b/SPEC.md @@ -2,7 +2,7 @@ ## Executive Summary -A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP. +A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability. --- @@ -89,6 +89,56 @@ A self-hosted tool to extract, index, and semantically search 2+ years of GitLab --- +## GitLab API Strategy + +### Primary Resources (Bulk Fetch) + +Issues and MRs support efficient bulk fetching with incremental sync: + +``` +GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100 +GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100 +``` + +### Dependent Resources (Per-Parent Fetch) + +Discussions must be fetched per-issue and per-MR. There is no bulk endpoint: + +``` +GET /projects/:id/issues/:iid/discussions +GET /projects/:id/merge_requests/:iid/discussions +``` + +### Sync Pattern + +**Initial sync:** +1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page) +2. For EACH issue → fetch all discussions (~3K calls) +3. Fetch all MRs (paginated, ~60 calls) +4. For EACH MR → fetch all discussions (~3K calls) +5. Total: ~6,100+ API calls for initial sync + +**Incremental sync:** +1. Fetch issues where `updated_after=cursor` (bulk) +2. For EACH updated issue → refetch ALL its discussions +3. Fetch MRs where `updated_after=cursor` (bulk) +4. For EACH updated MR → refetch ALL its discussions + +### Critical Assumption + +**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed. + +Mitigation: Periodic full re-sync (weekly) as a safety net. + +### Rate Limiting + +- Default: 10 requests/second with exponential backoff +- Respect `Retry-After` headers on 429 responses +- Add jitter to avoid thundering herd on retry +- Initial sync estimate: 10-20 minutes depending on rate limits + +--- + ## Checkpoint Structure Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding. @@ -96,13 +146,39 @@ Each checkpoint is a **testable milestone** where a human can validate the syste ### Checkpoint 0: Project Setup **Deliverable:** Scaffolded project with GitLab API connection verified -**Tests:** -1. Run `gitlab-engine auth-test` → returns authenticated user info -2. Run `gitlab-engine doctor` → verifies: - - Can reach GitLab baseUrl - - PAT is present and can read configured projects - - SQLite opens DB and migrations apply - - Ollama reachable OR embedding disabled with clear warning +**Automated Tests (Vitest):** +``` +tests/unit/config.test.ts + ✓ loads config from gi.config.json + ✓ throws if config file missing + ✓ throws if required fields missing (baseUrl, projects) + ✓ validates project paths are non-empty strings + +tests/unit/db.test.ts + ✓ creates database file if not exists + ✓ applies migrations in order + ✓ sets WAL journal mode + ✓ enables foreign keys + +tests/integration/gitlab-client.test.ts + ✓ authenticates with valid PAT + ✓ returns 401 for invalid PAT + ✓ fetches project by path + ✓ handles rate limiting (429) with retry +``` + +**Manual CLI Smoke Tests:** +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name | +| `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) | +| `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks | +| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure | + +**Data Integrity Checks:** +- [ ] `projects` table contains rows for each configured project path +- [ ] `gitlab_project_id` matches actual GitLab project IDs +- [ ] `raw_payloads` contains project JSON for each synced project **Scope:** - Project structure (TypeScript, ESLint, Vitest) @@ -114,7 +190,7 @@ Each checkpoint is a **testable milestone** where a human can validate the syste **Configuration (MVP):** ```json -// gitlab-engine.config.json +// gi.config.json { "gitlab": { "baseUrl": "https://gitlab.example.com", @@ -189,7 +265,50 @@ CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id); ### Checkpoint 1: Issue Ingestion **Deliverable:** All issues from target repos stored locally -**Test:** Run `gitlab-engine ingest --type=issues` → count matches GitLab; run `gitlab-engine list issues --limit=10` → displays issues correctly +**Automated Tests (Vitest):** +``` +tests/unit/issue-transformer.test.ts + ✓ transforms GitLab issue payload to normalized schema + ✓ extracts labels from issue payload + ✓ handles missing optional fields gracefully + +tests/unit/pagination.test.ts + ✓ fetches all pages when multiple exist + ✓ respects per_page parameter + ✓ stops when empty page returned + +tests/integration/issue-ingestion.test.ts + ✓ inserts issues into database + ✓ creates labels from issue payloads + ✓ links issues to labels via junction table + ✓ stores raw payload for each issue + ✓ updates cursor after successful page commit + ✓ resumes from cursor on subsequent runs + +tests/integration/sync-runs.test.ts + ✓ creates sync_run record on start + ✓ marks run as succeeded on completion + ✓ marks run as failed with error message on failure + ✓ refuses concurrent run (single-flight) + ✓ allows --force to override stale running status +``` + +**Manual CLI Smoke Tests:** +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi ingest --type=issues` | Progress bar, final count | Completes without error | +| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author | +| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project | +| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI | +| `gi show issue 123` | Issue detail view | Shows title, description, labels, URL | +| `gi sync-status` | Last sync time, cursor positions | Shows successful last run | + +**Data Integrity Checks:** +- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects +- [ ] Every issue has a corresponding `raw_payloads` row +- [ ] Labels in `issue_labels` junction all exist in `labels` table +- [ ] `sync_cursors` has entry for each (project_id, 'issues') pair +- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current) **Scope:** - Issue fetcher with pagination handling @@ -250,21 +369,70 @@ CREATE INDEX idx_issue_labels_label ON issue_labels(label_id); --- -### Checkpoint 2: MR + Comments + File Links Ingestion -**Deliverable:** All MRs, discussion threads, and file-change links stored locally +### Checkpoint 2: MR + Discussions Ingestion +**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context -**Test:** Run `gitlab-engine ingest --type=merge_requests` → count matches; run `gitlab-engine show mr 1234` → displays MR with comments and files changed +**Automated Tests (Vitest):** +``` +tests/unit/mr-transformer.test.ts + ✓ transforms GitLab MR payload to normalized schema + ✓ extracts labels from MR payload + ✓ handles missing optional fields gracefully + +tests/unit/discussion-transformer.test.ts + ✓ transforms discussion payload to normalized schema + ✓ extracts notes array from discussion + ✓ sets individual_note flag correctly + ✓ filters out system notes (system: true) + ✓ preserves note order via position field + +tests/integration/mr-ingestion.test.ts + ✓ inserts MRs into database + ✓ creates labels from MR payloads + ✓ links MRs to labels via junction table + ✓ stores raw payload for each MR + +tests/integration/discussion-ingestion.test.ts + ✓ fetches discussions for each issue + ✓ fetches discussions for each MR + ✓ creates discussion rows with correct parent FK + ✓ creates note rows linked to discussions + ✓ excludes system notes from storage + ✓ captures note-level resolution status + ✓ captures note type (DiscussionNote, DiffNote) +``` + +**Manual CLI Smoke Tests:** +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error | +| `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch | +| `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI | +| `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads | +| `gi show issue 456` | Issue detail with discussions | Shows title, description, discussion threads | +| `gi count discussions` | `Discussions: 12,345` | Non-zero count | +| `gi count notes` | `Notes: 45,678` | Non-zero count, no system notes | + +**Data Integrity Checks:** +- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count +- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments +- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked) +- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes) +- [ ] Every discussion has at least one note +- [ ] `individual_note = true` discussions have exactly one note +- [ ] Discussion `first_note_at` <= `last_note_at` for all rows **Scope:** - MR fetcher with pagination -- Notes fetcher (issue notes + MR notes) as a dependent resource: - - During initial ingest: fetch notes for every issue/MR - - During sync: refetch notes only for issues/MRs updated since cursor -- MR changes/diffs fetcher as a dependent resource: - - During initial ingest: fetch changes for every MR - - During sync: refetch changes only for MRs updated since cursor -- Relationship linking (note → parent issue/MR via foreign keys, MR → files) -- Extended CLI commands for MR display +- Discussions fetcher (issue discussions + MR discussions) as a dependent resource: + - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions` + - During initial ingest: fetch discussions for every issue/MR + - During sync: refetch discussions only for issues/MRs updated since cursor + - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise +- Relationship linking (discussion → parent issue/MR, notes → discussion) +- Extended CLI commands for MR/issue display with threads + +**Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries. **Schema Additions:** ```sql @@ -288,44 +456,49 @@ CREATE TABLE merge_requests ( CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); CREATE INDEX idx_mrs_author ON merge_requests(author_username); --- Notes with explicit parent foreign keys for referential integrity -CREATE TABLE notes ( +-- Discussion threads (the semantic unit for conversations) +CREATE TABLE discussions ( id INTEGER PRIMARY KEY, - gitlab_id INTEGER UNIQUE NOT NULL, + gitlab_discussion_id TEXT UNIQUE NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...") project_id INTEGER NOT NULL REFERENCES projects(id), issue_id INTEGER REFERENCES issues(id), merge_request_id INTEGER REFERENCES merge_requests(id), - noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' - noteable_iid INTEGER NOT NULL, -- parent IID (from API path) - author_username TEXT, - body TEXT, - created_at INTEGER, - updated_at INTEGER, - system BOOLEAN, - raw_payload_id INTEGER REFERENCES raw_payloads(id), - -- Exactly one parent FK must be set + noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' + individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion + first_note_at INTEGER, -- for ordering discussions + last_note_at INTEGER, -- for "recently active" queries + resolvable BOOLEAN, -- MR discussions can be resolved + resolved BOOLEAN, CHECK ( (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) ) ); -CREATE INDEX idx_notes_issue ON notes(issue_id); -CREATE INDEX idx_notes_mr ON notes(merge_request_id); -CREATE INDEX idx_notes_author ON notes(author_username); +CREATE INDEX idx_discussions_issue ON discussions(issue_id); +CREATE INDEX idx_discussions_mr ON discussions(merge_request_id); +CREATE INDEX idx_discussions_last_note ON discussions(last_note_at); --- File linkage for "what MRs touched this file?" queries (with rename support) -CREATE TABLE mr_files ( +-- Notes belong to discussions (preserving thread context) +CREATE TABLE notes ( id INTEGER PRIMARY KEY, - merge_request_id INTEGER REFERENCES merge_requests(id), - old_path TEXT, - new_path TEXT, - new_file BOOLEAN, - deleted_file BOOLEAN, - renamed_file BOOLEAN, - UNIQUE(merge_request_id, old_path, new_path) + gitlab_id INTEGER UNIQUE NOT NULL, + discussion_id INTEGER NOT NULL REFERENCES discussions(id), + project_id INTEGER NOT NULL REFERENCES projects(id), + type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API) + author_username TEXT, + body TEXT, + created_at INTEGER, + updated_at INTEGER, + position INTEGER, -- derived from array order in API response (0-indexed) + resolvable BOOLEAN, -- note-level resolvability (MR code comments) + resolved BOOLEAN, -- note-level resolution status + resolved_by TEXT, -- username who resolved + resolved_at INTEGER, -- when resolved + raw_payload_id INTEGER REFERENCES raw_payloads(id) ); -CREATE INDEX idx_mr_files_old_path ON mr_files(old_path); -CREATE INDEX idx_mr_files_new_path ON mr_files(new_path); +CREATE INDEX idx_notes_discussion ON notes(discussion_id); +CREATE INDEX idx_notes_author ON notes(author_username); +CREATE INDEX idx_notes_type ON notes(type); -- MR labels (reuse same labels table) CREATE TABLE mr_labels ( @@ -336,39 +509,96 @@ CREATE TABLE mr_labels ( CREATE INDEX idx_mr_labels_label ON mr_labels(label_id); ``` +**Discussion Processing Rules:** +- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.) +- Each discussion from the API becomes one row in `discussions` table +- All notes within a discussion are stored with their `discussion_id` foreign key +- `individual_note: true` discussions have exactly one note (standalone comment) +- `individual_note: false` discussions have multiple notes (threaded conversation) + --- ### Checkpoint 3: Embedding Generation **Deliverable:** Vector embeddings generated for all text content -**Test:** Run `gitlab-engine embed --all` → progress indicator; run `gitlab-engine stats` → shows embedding coverage percentage +**Automated Tests (Vitest):** +``` +tests/unit/document-extractor.test.ts + ✓ extracts issue document (title + description) + ✓ extracts MR document (title + description) + ✓ extracts discussion document with full thread context + ✓ includes parent issue/MR title in discussion header + ✓ formats notes with author and timestamp + ✓ truncates content exceeding 8000 tokens + ✓ preserves first and last notes when truncating middle + ✓ computes SHA-256 content hash consistently + +tests/unit/embedding-client.test.ts + ✓ connects to Ollama API + ✓ generates embedding for text input + ✓ returns 768-dimension vector + ✓ handles Ollama connection failure gracefully + ✓ batches requests (32 documents per batch) + +tests/integration/document-creation.test.ts + ✓ creates document for each issue + ✓ creates document for each MR + ✓ creates document for each discussion + ✓ populates document_labels junction table + ✓ computes content_hash for each document + +tests/integration/embedding-storage.test.ts + ✓ stores embedding in sqlite-vss + ✓ embedding rowid matches document id + ✓ creates embedding_metadata record + ✓ skips re-embedding when content_hash unchanged + ✓ re-embeds when content_hash changes +``` + +**Manual CLI Smoke Tests:** +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi embed --all` | Progress bar with ETA | Completes without error | +| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs | +| `gi stats` | Embedding coverage stats | Shows 100% coverage | +| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | +| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error | + +**Data Integrity Checks:** +- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions +- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents` +- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents` +- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash` +- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings +- [ ] Discussion documents include parent title in content_text **Scope:** - Ollama integration (nomic-embed-text model) -- Embedding generation pipeline (batch processing) +- Embedding generation pipeline (batch processing, 32 documents per batch) - Vector storage in SQLite (sqlite-vss extension) - Progress tracking and resumability - Document extraction layer: - - Canonical "search documents" derived from issues/MRs/notes + - Canonical "search documents" derived from issues/MRs/discussions - Stable content hashing for change detection (SHA-256 of content_text) - Single embedding per document (chunking deferred to post-MVP) + - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192) - Denormalized metadata for fast filtering (author, labels, dates) - Fast label filtering via `document_labels` join table **Schema Additions:** ```sql --- Unified searchable documents (derived from issues/MRs/notes) +-- Unified searchable documents (derived from issues/MRs/discussions) CREATE TABLE documents ( id INTEGER PRIMARY KEY, - source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'note' + source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' source_id INTEGER NOT NULL, -- local DB id in the source table project_id INTEGER NOT NULL REFERENCES projects(id), - author_username TEXT, + author_username TEXT, -- for discussions: first note author label_names TEXT, -- JSON array (display/debug only) created_at INTEGER, updated_at INTEGER, url TEXT, - title TEXT, -- null for notes + title TEXT, -- null for discussions content_text TEXT NOT NULL, -- canonical text for embedding/snippets content_hash TEXT NOT NULL, -- SHA-256 for change detection UNIQUE(source_type, source_id) @@ -408,38 +638,131 @@ CREATE TABLE embedding_metadata ( - This alignment simplifies joins and eliminates rowid mapping fragility **Document Extraction Rules:** -- Issue → title + "\n\n" + description -- MR → title + "\n\n" + description -- Note → body (skip system notes unless they contain meaningful content) + +| Source | content_text Construction | +|--------|--------------------------| +| Issue | `title + "\n\n" + description` | +| MR | `title + "\n\n" + description` | +| Discussion | Full thread with context (see below) | + +**Discussion Document Format:** +``` +[Issue #234: Authentication redesign] Discussion + +@johndoe (2024-03-15): +I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients... + +@janedoe (2024-03-15): +Agreed. What about refresh token strategy? + +@johndoe (2024-03-16): +Short-lived access tokens (15min), longer refresh (7 days). Here's why... +``` + +This format preserves: +- Parent context (issue/MR title and number) +- Author attribution for each note +- Temporal ordering of the conversation +- Full thread semantics for decision traceability + +**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning. --- ### Checkpoint 4: Semantic Search **Deliverable:** Working semantic search across all indexed content -**Tests:** -1. Run `gitlab-engine search "authentication redesign"` → returns ranked results with snippets -2. Golden queries: curated list of 10 queries with expected result *containment* (e.g., "at least one of these 3 known URLs appears in top 10") -3. `gitlab-engine search "..." --json` validates against JSON schema (stable fields present) +**Automated Tests (Vitest):** +``` +tests/unit/search-query.test.ts + ✓ parses filter flags (--type, --author, --after, --label) + ✓ validates date format for --after + ✓ handles multiple --label flags + +tests/unit/rrf-ranking.test.ts + ✓ computes RRF score correctly + ✓ merges results from vector and FTS retrievers + ✓ handles documents appearing in only one retriever + ✓ respects k=60 parameter + +tests/integration/vector-search.test.ts + ✓ returns results for semantic query + ✓ ranks similar content higher + ✓ returns empty for nonsense query + +tests/integration/fts-search.test.ts + ✓ returns exact keyword matches + ✓ handles porter stemming (search/searching) + ✓ returns empty for non-matching query + +tests/integration/hybrid-search.test.ts + ✓ combines vector and FTS results + ✓ applies type filter correctly + ✓ applies author filter correctly + ✓ applies date filter correctly + ✓ applies label filter correctly + ✓ falls back to FTS when Ollama unavailable + +tests/e2e/golden-queries.test.ts + ✓ "authentication redesign" returns known auth-related items + ✓ "database migration" returns known migration items + ✓ [8 more domain-specific golden queries] +``` + +**Manual CLI Smoke Tests:** +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score | +| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output | +| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | +| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | +| `gi search "authentication" --label=bug` | Label filtered | All results have bug label | +| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama | +| `gi search "authentication" --json` | JSON output | Valid JSON array with schema | +| `gi search "xyznonexistent123"` | No results message | Graceful empty state | +| `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | + +**Golden Query Test Suite:** +Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs: +```json +[ + { + "query": "authentication redesign", + "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"], + "minResults": 1, + "maxRank": 10 + } +] +``` +Each query must have at least one expected URL appear in top 10 results. + +**Data Integrity Checks:** +- [ ] `documents_fts` row count matches `documents` row count +- [ ] Search returns results for known content (not empty) +- [ ] JSON output validates against defined schema +- [ ] All result URLs are valid GitLab URLs **Scope:** - Hybrid retrieval: - Vector recall (sqlite-vss) + FTS lexical recall (fts5) - Merge + rerank results using Reciprocal Rank Fusion (RRF) - Result ranking and scoring (document-level) -- Search filters: `--type=issue|mr|note`, `--author=username`, `--after=date`, `--label=name` +- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name` - Label filtering operates on `document_labels` (indexed, exact-match) - Output formatting: ranked list with title, snippet, score, URL - JSON output mode for AI agent consumption +- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning **Schema Additions:** ```sql -- Full-text search for hybrid retrieval +-- Using porter stemmer for better matching of word variants CREATE VIRTUAL TABLE documents_fts USING fts5( title, content_text, content='documents', - content_rowid='id' + content_rowid='id', + tokenize='porter unicode61' ); -- Triggers to keep FTS in sync @@ -461,6 +784,11 @@ CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN END; ``` +**FTS5 Tokenizer Notes:** +- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated") +- `unicode61` handles Unicode properly +- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer + **Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:** 1. Query both vector index (top 50) and FTS5 (top 50) 2. Merge results by document_id @@ -477,22 +805,49 @@ END; - RRF operates on ranks, not scores, making it robust to scale differences - Well-established in information retrieval literature +**Graceful Degradation:** +- If Ollama is unreachable during search, automatically fall back to FTS5-only +- Display warning: "Embedding service unavailable, using lexical search only" +- `embed` command fails with actionable error if Ollama is down + **CLI Interface:** ```bash # Basic semantic search -gitlab-engine search "why did we choose Redis" +gi search "why did we choose Redis" # Pure FTS search (fallback if embeddings unavailable) -gitlab-engine search "redis" --mode=lexical +gi search "redis" --mode=lexical # Filtered search -gitlab-engine search "authentication" --type=mr --after=2024-01-01 +gi search "authentication" --type=mr --after=2024-01-01 # Filter by label -gitlab-engine search "performance" --label=bug --label=critical +gi search "performance" --label=bug --label=critical # JSON output for programmatic use -gitlab-engine search "payment processing" --json +gi search "payment processing" --json +``` + +**CLI Output Example:** +``` +$ gi search "authentication redesign" + +Found 23 results (hybrid search, 0.34s) + +[1] MR !847 - Refactor auth to use JWT tokens (0.82) + @johndoe · 2024-03-15 · group/project-one + "...moving away from session cookies to JWT for authentication..." + https://gitlab.example.com/group/project-one/-/merge_requests/847 + +[2] Issue #234 - Authentication redesign discussion (0.79) + @janedoe · 2024-02-28 · group/project-one + "...we need to redesign the authentication flow because..." + https://gitlab.example.com/group/project-one/-/issues/234 + +[3] Discussion on Issue #234 (0.76) + @johndoe · 2024-03-01 · group/project-one + "I think we should move to JWT-based auth because the session..." + https://gitlab.example.com/group/project-one/-/issues/234#note_12345 ``` --- @@ -500,66 +855,142 @@ gitlab-engine search "payment processing" --json ### Checkpoint 5: Incremental Sync **Deliverable:** Efficient ongoing synchronization with GitLab -**Test:** Make a change in GitLab; run `gitlab-engine sync` → only fetches changed items; verify change appears in search +**Automated Tests (Vitest):** +``` +tests/unit/cursor-management.test.ts + ✓ advances cursor after successful page commit + ✓ uses tie-breaker id for identical timestamps + ✓ does not advance cursor on failure + ✓ resets cursor on --full flag + +tests/unit/change-detection.test.ts + ✓ detects content_hash mismatch + ✓ queues document for re-embedding on change + ✓ skips re-embedding when hash unchanged + +tests/integration/incremental-sync.test.ts + ✓ fetches only items updated after cursor + ✓ refetches discussions for updated issues + ✓ refetches discussions for updated MRs + ✓ updates existing records (not duplicates) + ✓ creates new records for new items + ✓ re-embeds documents with changed content + +tests/integration/sync-recovery.test.ts + ✓ resumes from cursor after interrupted sync + ✓ marks failed run with error message + ✓ handles rate limiting (429) with backoff + ✓ respects Retry-After header +``` + +**Manual CLI Smoke Tests:** +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check | +| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change | +| `gi sync --full` | Full re-sync progress | Resets cursors, fetches everything | +| `gi sync-status` | Cursor positions, last sync time | Shows current state | +| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually | +| `gi search "new content"` (after sync) | Returns new content | New content is searchable | + +**End-to-End Sync Verification:** +1. Note the current `sync_cursors` values +2. Create a new comment on an issue in GitLab +3. Run `gi sync` +4. Verify: + - [ ] Issue's `updated_at` in DB matches GitLab + - [ ] New discussion row exists + - [ ] New note row exists + - [ ] New document row exists for discussion + - [ ] New embedding exists for document + - [ ] `gi search "new comment text"` returns the new discussion + - [ ] Cursor advanced past the updated issue + +**Data Integrity Checks:** +- [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table +- [ ] No orphaned documents (all have valid source_id) +- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows +- [ ] `sync_runs` has complete audit trail **Scope:** - Delta sync based on stable cursor (updated_at + tie-breaker id) -- Dependent resources sync strategy (notes, MR changes) -- Webhook handler (optional, if webhook access granted) +- Dependent resources sync strategy (discussions refetched when parent updates) - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) - Sync status reporting +- Recommended: run via cron every 10 minutes **Correctness Rules (MVP):** 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` 2. Cursor advances only after successful DB commit for that page 3. Dependent resources: - - For each updated issue/MR, refetch its notes (sorted by `updated_at`) - - For each updated MR, refetch its file changes + - For each updated issue/MR, refetch ALL its discussions + - Discussion documents are regenerated and re-embedded if content_hash changes 4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` 5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) **Why Dependent Resource Model:** -- GitLab Notes API doesn't provide a clean global `updated_after` stream -- Notes are listed per-issue or per-MR, not as a top-level resource -- Treating notes as dependent resources (refetch when parent updates) is simpler and more correct -- Same applies to MR changes/diffs +- GitLab Discussions API doesn't provide a global `updated_after` stream +- Discussions are listed per-issue or per-MR, not as a top-level resource +- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct **CLI Commands:** ```bash # Full sync (respects cursors, only fetches new/updated) -gitlab-engine sync +gi sync # Force full re-sync (resets cursors) -gitlab-engine sync --full +gi sync --full # Override stale 'running' run after operator review -gitlab-engine sync --force +gi sync --force # Show sync status -gitlab-engine sync-status +gi sync-status ``` --- -## Future Checkpoints (Post-MVP) +## Future Work (Post-MVP) -### Checkpoint 6: File/Feature History View -- Map commits to MRs to discussions -- Query: "Show decision history for src/auth/login.ts" -- Ship `gitlab-engine file-history ` as a first-class feature here -- This command is deferred from MVP to sharpen checkpoint focus +The following features are explicitly deferred to keep MVP scope focused: -### Checkpoint 7: Personal Dashboard -- Filter by assigned/mentioned -- Integrate with existing gitlab-inbox tool +| Feature | Description | Depends On | +|---------|-------------|------------| +| **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete | +| **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete | +| **Person Context** | Aggregate contributions by author, expertise inference | MVP complete | +| **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration | +| **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 | +| **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 | -### Checkpoint 8: Person Context -- Aggregate contributions by author -- Expertise inference from activity +**Checkpoint 6 (File History) Schema Preview:** +```sql +-- Deferred from MVP; added when file-history feature is built +CREATE TABLE mr_files ( + id INTEGER PRIMARY KEY, + merge_request_id INTEGER REFERENCES merge_requests(id), + old_path TEXT, + new_path TEXT, + new_file BOOLEAN, + deleted_file BOOLEAN, + renamed_file BOOLEAN, + UNIQUE(merge_request_id, old_path, new_path) +); +CREATE INDEX idx_mr_files_old_path ON mr_files(old_path); +CREATE INDEX idx_mr_files_new_path ON mr_files(new_path); -### Checkpoint 9: Decision Graph -- Extract decisions from discussions (LLM-assisted) -- Visualize decision relationships +-- DiffNote position data (for "show me comments on this file" queries) +-- Populated from notes.type='DiffNote' position object in GitLab API +CREATE TABLE note_positions ( + note_id INTEGER PRIMARY KEY REFERENCES notes(id), + old_path TEXT, + new_path TEXT, + old_line INTEGER, + new_line INTEGER, + position_type TEXT -- 'text' | 'image' | etc. +); +CREATE INDEX idx_note_positions_new_path ON note_positions(new_path); +``` --- @@ -606,14 +1037,15 @@ Each checkpoint includes: | labels | 1 | Label definitions (unique by project + name) | | issue_labels | 1 | Issue-label junction | | merge_requests | 2 | Normalized MRs | -| notes | 2 | Issue and MR comments (with parent FKs) | -| mr_files | 2 | MR file changes (with rename tracking) | +| discussions | 2 | Discussion threads (the semantic unit for conversations) | +| notes | 2 | Individual comments within discussions | | mr_labels | 2 | MR-label junction | -| documents | 3 | Unified searchable documents | +| documents | 3 | Unified searchable documents (issues, MRs, discussions) | | document_labels | 3 | Document-label junction for fast filtering | | embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) | | embedding_metadata | 3 | Embedding provenance + change detection | -| documents_fts | 4 | Full-text search index (fts5) | +| documents_fts | 4 | Full-text search index (fts5 with porter stemmer) | +| mr_files | 6 | MR file changes (deferred to File History feature) | --- @@ -621,14 +1053,19 @@ Each checkpoint includes: | Question | Decision | Rationale | |----------|----------|-----------| -| Commit/file linkage | **Include MR→file links** | Enables "what MRs touched this file?" without full commit history | +| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability; individual notes are meaningless without their thread | +| System notes | **Exclude during ingestion** | System notes (assignments, label changes) add noise without semantic value | +| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature; reduces initial API calls | | Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering | | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available | -| Sync method | **Polling for MVP** | Decide on webhooks after using the system | -| Notes sync | **Dependent resource** | Notes API is per-parent, not global; refetch on parent update | +| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient | +| Discussions sync | **Dependent resource model** | Discussions API is per-parent, not global; refetch all discussions when parent updates | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts | -| file-history CLI | **Post-MVP (CP6)** | Sharpens MVP checkpoint focus | +| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context; nomic-embed-text limit is 8192 | +| Embedding batching | **32 documents per batch** | Balance between throughput and memory | +| FTS5 tokenizer | **porter unicode61** | Stemming improves recall; unicode61 handles international text | +| Ollama unavailable | **Graceful degradation to FTS5** | Search still works, just without semantic matching | ---