# GitLab Knowledge Engine - Spec Document > **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation. ## Executive Summary A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability. --- ## Quick Start ### Prerequisites | Requirement | Version | Notes | |-------------|---------|-------| | Node.js | 20+ | LTS recommended | | npm | 10+ | Comes with Node.js | | Ollama | Latest | Optional for semantic search; lexical search works without it | ### Installation ```bash # Clone and install git clone https://github.com/your-org/gitlab-inbox.git cd gitlab-inbox npm install npm run build npm link # Makes `gi` available globally ``` ### First Run 1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope): ```bash export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx" ``` 2. **Run the setup wizard:** ```bash gi init ``` This creates `gi.config.json` with your GitLab URL and project paths. 3. **Verify your environment:** ```bash gi doctor ``` All checks should pass (Ollama warning is OK if you only need lexical search). 4. **Sync your data:** ```bash gi sync ``` Initial sync takes 10-20 minutes depending on repo size and rate limits. 5. **Search:** ```bash gi search "authentication redesign" ``` ### Troubleshooting First Run | Symptom | Solution | |---------|----------| | `Config file not found` | Run `gi init` first | | `GITLAB_TOKEN not set` | Export the environment variable | | `401 Unauthorized` | Check token has `read_api` scope | | `Project not found: group/project` | Verify project path in GitLab URL | | `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search | --- ## Discovery Summary ### Pain Points Identified 1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find 2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions ### Constraints | Constraint | Detail | |------------|--------| | Hosting | Self-hosted only, no external APIs | | Compute | Local dev machine (M-series Mac assumed) | | GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) | | Build Method | AI agents will implement; user is TypeScript expert for review | ### Target Use Cases (Priority Order) 1. **MVP: Semantic Search** - "Find discussions about authentication redesign" 2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?" 3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?" 4. **Future: Person Context** - "What's @johndoe's background in this project?" --- ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ GitLab API │ │ (Issues, MRs, Notes) │ └─────────────────────────────────────────────────────────────────┘ (Commit-level indexing explicitly post-MVP) │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Data Ingestion Layer │ │ - Incremental sync (PAT-based polling) │ │ - Rate limiting / backoff │ │ - Raw JSON storage for replay │ │ - Dependent resource fetching (notes, MR changes) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Data Processing Layer │ │ - Normalize artifacts to unified schema │ │ - Extract searchable documents (canonical text + metadata) │ │ - Content hashing for change detection │ │ - MVP relationships: parent-child FKs + label/path associations│ │ (full cross-entity "decision graph" is post-MVP scope) │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Storage Layer │ │ - SQLite + sqlite-vec + FTS5 (hybrid search) │ │ - Structured metadata in relational tables │ │ - Vector embeddings for semantic search │ │ - Full-text index for lexical search fallback │ └─────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Query Interface │ │ - CLI for human testing │ │ - JSON API for AI agent testing │ │ - Semantic search with filters (author, date, type, label) │ └─────────────────────────────────────────────────────────────────┘ ``` ### Technology Choices | Component | Choice | Rationale | |-----------|--------|-----------| | Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly | | Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension | | Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors | | CLI Framework | Commander.js | Simple, lightweight, well-documented | | Logging | pino | Fast, JSON-structured, low overhead | | Validation | Zod | TypeScript-first schema validation | ### Alternative Considered: sqlite-vss - sqlite-vss was the original choice but is now deprecated - No Apple Silicon support (no prebuilt ARM binaries) - Replaced by sqlite-vec, which is pure C with no dependencies - sqlite-vec uses `vec0` virtual table (vs `vss0`) ### Alternative Considered: Postgres + pgvector - Pros: More scalable, better for production multi-user - Cons: Requires running Postgres, heavier setup - Decision: Start with SQLite for simplicity; migration path exists if needed --- ## Project Structure ``` gitlab-inbox/ ├── src/ │ ├── cli/ │ │ ├── index.ts # CLI entry point (Commander.js) │ │ └── commands/ # One file per command group │ │ ├── init.ts │ │ ├── sync.ts │ │ ├── search.ts │ │ ├── list.ts │ │ └── doctor.ts │ ├── core/ │ │ ├── config.ts # Config loading/validation (Zod) │ │ ├── db.ts # Database connection + migrations │ │ ├── errors.ts # Custom error classes │ │ └── logger.ts # pino logger setup │ ├── gitlab/ │ │ ├── client.ts # GitLab API client with rate limiting │ │ ├── types.ts # GitLab API response types │ │ └── transformers/ # Payload → normalized schema │ │ ├── issue.ts │ │ ├── merge-request.ts │ │ └── discussion.ts │ ├── ingestion/ │ │ ├── issues.ts │ │ ├── merge-requests.ts │ │ └── discussions.ts │ ├── documents/ │ │ ├── extractor.ts # Document generation from entities │ │ └── truncation.ts # Note-boundary aware truncation │ ├── embedding/ │ │ ├── ollama.ts # Ollama client │ │ └── pipeline.ts # Batch embedding orchestration │ ├── search/ │ │ ├── hybrid.ts # RRF ranking logic │ │ ├── fts.ts # FTS5 queries │ │ └── vector.ts # sqlite-vec queries │ └── types/ │ └── index.ts # Shared TypeScript types ├── tests/ │ ├── unit/ │ ├── integration/ │ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1) │ └── fixtures/ │ └── golden-queries.json ├── migrations/ # Numbered SQL migration files │ ├── 001_initial.sql │ └── ... ├── gi.config.json # User config (gitignored) ├── package.json ├── tsconfig.json ├── vitest.config.ts ├── eslint.config.js └── README.md ``` --- ## Dependencies ### Runtime Dependencies ```json { "dependencies": { "better-sqlite3": "latest", "sqlite-vec": "latest", "commander": "latest", "zod": "latest", "pino": "latest", "pino-pretty": "latest", "ora": "latest", "chalk": "latest", "cli-table3": "latest" } } ``` | Package | Purpose | |---------|---------| | better-sqlite3 | Synchronous SQLite driver (fast, native) | | sqlite-vec | Vector search extension (pure C, cross-platform) | | commander | CLI argument parsing | | zod | Schema validation for config and inputs | | pino | Structured JSON logging | | pino-pretty | Dev-mode log formatting | | ora | CLI spinners for progress indication | | chalk | Terminal colors | | cli-table3 | ASCII tables for list output | ### Dev Dependencies ```json { "devDependencies": { "typescript": "latest", "@types/better-sqlite3": "latest", "@types/node": "latest", "vitest": "latest", "msw": "latest", "eslint": "latest", "@typescript-eslint/eslint-plugin": "latest", "@typescript-eslint/parser": "latest", "tsx": "latest" } } ``` | Package | Purpose | |---------|---------| | typescript | TypeScript compiler | | vitest | Test runner | | msw | Mock Service Worker for API mocking in tests | | eslint | Linting | | tsx | Run TypeScript directly during development | --- ## GitLab API Strategy ### Primary Resources (Bulk Fetch) Issues and MRs support efficient bulk fetching with incremental sync: ``` GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100 GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100 ``` **Required query params for completeness:** - `scope=all` - include all issues/MRs, not just authored by current user - `state=all` - include closed items (GitLab defaults may exclude them) Without these params, the 2+ years of historical data would be incomplete. ### Dependent Resources (Per-Parent Fetch) Discussions must be fetched per-issue and per-MR. There is no bulk endpoint: ``` GET /projects/:id/issues/:iid/discussions?per_page=100&page=N GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N ``` **Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss. ### Sync Pattern **Initial sync:** 1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page) 2. For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead) 3. Fetch all MRs (paginated, ~60 calls) 4. For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead) 5. Total: thousands of API calls for initial sync **API Call Estimation Formula:** ``` total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue + ceil(mrs/100) + mrs × avg_discussion_pages_per_mr ``` Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent: - Issue list: 30 calls - Issue discussions: 3,000 × 1.2 = 3,600 calls - MR list: 30 calls - MR discussions: 3,000 × 1.2 = 3,600 calls - **Total: ~7,260 calls** This matters for rate limit planning and setting realistic "10-20 minutes" expectations. **Incremental sync:** 1. Fetch issues where `updated_after=cursor` (bulk) 2. For EACH updated issue → refetch ALL its discussions 3. Fetch MRs where `updated_after=cursor` (bulk) 4. For EACH updated MR → refetch ALL its discussions ### Critical Assumption (Softened) We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively. **Mitigations (MVP):** 1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor. 2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly. 3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`). The backfill window provides 80% of the safety of full resync at <5% of the API cost. ### Rate Limiting - Default: 10 requests/second with exponential backoff - Respect `Retry-After` headers on 429 responses - Add jitter to avoid thundering herd on retry - **Separate concurrency limits:** - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4) - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s) - Bound concurrency per-project to avoid one repo starving the other - Initial sync estimate: 10-20 minutes depending on rate limits --- ## Checkpoint Structure Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding. ### Checkpoint 0: Project Setup **Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working **Automated Tests (Vitest):** ``` tests/unit/config.test.ts ✓ loads config from gi.config.json ✓ throws if config file missing ✓ throws if required fields missing (baseUrl, projects) ✓ validates project paths are non-empty strings tests/unit/db.test.ts ✓ creates database file if not exists ✓ applies migrations in order ✓ sets WAL journal mode ✓ enables foreign keys tests/integration/gitlab-client.test.ts ✓ (mocked) authenticates with valid PAT ✓ (mocked) returns 401 for invalid PAT ✓ (mocked) fetches project by path ✓ (mocked) handles rate limiting (429) with retry tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI) ✓ authenticates with real PAT against configured baseUrl ✓ fetches real project by path ✓ handles actual rate limiting behavior tests/integration/app-lock.test.ts ✓ acquires lock successfully ✓ updates heartbeat during operation ✓ detects stale lock and recovers ✓ refuses concurrent acquisition tests/integration/init.test.ts ✓ creates config file with valid structure ✓ validates GitLab URL format ✓ validates GitLab connection before writing config ✓ validates each project path exists in GitLab ✓ fails if token not set ✓ fails if GitLab auth fails ✓ fails if any project path not found ✓ prompts before overwriting existing config ✓ respects --force to skip confirmation ✓ generates gi.config.json with sensible defaults ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name | | `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) | | `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks | | `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure | | `gi init` | Interactive prompts | Creates valid gi.config.json | | `gi init` (config exists) | Confirmation prompt | Warns before overwriting | | `gi --help` | Command list | Shows all available commands | | `gi version` | Version number | Shows installed version | | `gi sync-status` | Last sync time, cursor positions | Shows successful last run | **Data Integrity Checks:** - [ ] `projects` table contains rows for each configured project path - [ ] `gitlab_project_id` matches actual GitLab project IDs - [ ] `raw_payloads` contains project JSON for each synced project **Scope:** - Project structure (TypeScript, ESLint, Vitest) - GitLab API client with PAT authentication - Environment and project configuration - Basic CLI scaffold with `auth-test` command - `doctor` command for environment verification - Projects table and initial project resolution (no issue/MR ingestion yet) - DB migrations + WAL + FK enforcement - Sync tracking with crash-safe single-flight lock (heartbeat-based) - Rate limit handling with exponential backoff + jitter - `gi init` command for guided setup: - Prompts for GitLab base URL - Prompts for project paths (comma-separated or multiple prompts) - Prompts for token environment variable name (default: GITLAB_TOKEN) - **Validates before writing config:** - Token must be set in environment - Tests auth with `GET /user` endpoint - Validates each project path with `GET /projects/:path` - Only writes config after all validations pass - Generates `gi.config.json` with sensible defaults - `gi --help` shows all available commands - `gi --help` shows command-specific help - `gi version` shows installed version - First-run detection: if no config exists, suggest `gi init` **Configuration (MVP):** ```json // gi.config.json { "gitlab": { "baseUrl": "https://gitlab.example.com", "tokenEnvVar": "GITLAB_TOKEN" }, "projects": [ { "path": "group/project-one" }, { "path": "group/project-two" } ], "sync": { "backfillDays": 14, "staleLockMinutes": 10, "heartbeatIntervalSeconds": 30, "cursorRewindSeconds": 2, "primaryConcurrency": 4, "dependentConcurrency": 2 }, "storage": { "compressRawPayloads": true }, "embedding": { "provider": "ollama", "model": "nomic-embed-text", "baseUrl": "http://localhost:11434", "concurrency": 4 } } ``` **Raw Payload Compression:** - When `storage.compressRawPayloads: true` (default), raw JSON payloads are gzip-compressed before storage - `raw_payloads.content_encoding` indicates `'identity'` (uncompressed) or `'gzip'` (compressed) - Compression typically reduces storage by 70-80% for JSON payloads - Decompression is handled transparently when reading payloads - Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage **Error Classes (src/core/errors.ts):** ```typescript // Base error class with error codes for programmatic handling export class GiError extends Error { constructor(message: string, public readonly code: string) { super(message); this.name = 'GiError'; } } // Config errors export class ConfigNotFoundError extends GiError { constructor() { super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND'); } } export class ConfigValidationError extends GiError { constructor(details: string) { super(`Invalid config: ${details}`, 'CONFIG_INVALID'); } } // GitLab API errors export class GitLabAuthError extends GiError { constructor() { super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED'); } } export class GitLabNotFoundError extends GiError { constructor(resource: string) { super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND'); } } export class GitLabRateLimitError extends GiError { constructor(public readonly retryAfter: number) { super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED'); } } // Database errors export class DatabaseLockError extends GiError { constructor() { super('Another sync is running. Use --force to override.', 'DB_LOCKED'); } } // Embedding errors export class OllamaConnectionError extends GiError { constructor() { super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE'); } } export class EmbeddingError extends GiError { constructor(documentId: number, reason: string) { super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED'); } } ``` **Logging Strategy (src/core/logger.ts):** ```typescript import pino from 'pino'; // Logs go to stderr, results to stdout (allows clean JSON piping) export const logger = pino({ level: process.env.LOG_LEVEL || 'info', transport: process.env.NODE_ENV === 'production' ? undefined : { target: 'pino-pretty', options: { colorize: true, destination: 2 } // 2 = stderr } }, pino.destination(2)); ``` **Log Levels:** | Level | When to use | |-------|-------------| | debug | Detailed sync progress, API calls, SQL queries | | info | Sync start/complete, document counts, search timing | | warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries | | error | Failures that stop operations | **Logging Conventions:** - Always include structured context: `logger.info({ project, count }, 'Fetched issues')` - Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')` - All logs to stderr so `gi search --json` output stays clean on stdout **DB Runtime Defaults (Checkpoint 0):** - On every connection: - `PRAGMA journal_mode=WAL;` - `PRAGMA foreign_keys=ON;` **Schema (Checkpoint 0):** ```sql -- Projects table (configured targets) CREATE TABLE projects ( id INTEGER PRIMARY KEY, gitlab_project_id INTEGER UNIQUE NOT NULL, path_with_namespace TEXT NOT NULL, default_branch TEXT, web_url TEXT, created_at INTEGER, updated_at INTEGER, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_projects_path ON projects(path_with_namespace); -- Sync tracking for reliability CREATE TABLE sync_runs ( id INTEGER PRIMARY KEY, started_at INTEGER NOT NULL, heartbeat_at INTEGER NOT NULL, finished_at INTEGER, status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc. error TEXT ); -- Crash-safe single-flight lock (DB-enforced) CREATE TABLE app_locks ( name TEXT PRIMARY KEY, -- 'sync' owner TEXT NOT NULL, -- random run token (UUIDv4) acquired_at INTEGER NOT NULL, heartbeat_at INTEGER NOT NULL ); -- Sync cursors for primary resources only -- Notes and MR changes are dependent resources (fetched via parent updates) CREATE TABLE sync_cursors ( project_id INTEGER NOT NULL REFERENCES projects(id), resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests' updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch) tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering) PRIMARY KEY(project_id, resource_type) ); -- Raw payload storage (decoupled from entity tables) CREATE TABLE raw_payloads ( id INTEGER PRIMARY KEY, source TEXT NOT NULL, -- 'gitlab' project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion' gitlab_id TEXT NOT NULL, -- TEXT because discussion IDs are strings; numeric IDs stored as strings fetched_at INTEGER NOT NULL, content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip' payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON ); CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id); CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at); -- Schema version tracking for migrations CREATE TABLE schema_version ( version INTEGER PRIMARY KEY, applied_at INTEGER NOT NULL, description TEXT ); ``` --- ### Checkpoint 1: Issue Ingestion **Deliverable:** All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync **Automated Tests (Vitest):** ``` tests/unit/issue-transformer.test.ts ✓ transforms GitLab issue payload to normalized schema ✓ extracts labels from issue payload ✓ handles missing optional fields gracefully tests/unit/pagination.test.ts ✓ fetches all pages when multiple exist ✓ respects per_page parameter ✓ follows X-Next-Page header until empty/absent ✓ falls back to empty-page stop if headers missing (robustness) tests/unit/discussion-transformer.test.ts ✓ transforms discussion payload to normalized schema ✓ extracts notes array from discussion ✓ sets individual_note flag correctly ✓ flags system notes with is_system=1 ✓ preserves note order via position field tests/integration/issue-ingestion.test.ts ✓ inserts issues into database ✓ creates labels from issue payloads ✓ links issues to labels via junction table ✓ stores raw payload for each issue ✓ updates cursor after successful page commit ✓ resumes from cursor on subsequent runs tests/integration/issue-discussion-ingestion.test.ts ✓ fetches discussions for each issue ✓ creates discussion rows with correct issue FK ✓ creates note rows linked to discussions ✓ stores system notes with is_system=1 flag ✓ handles individual_note=true discussions tests/integration/sync-runs.test.ts ✓ creates sync_run record on start ✓ marks run as succeeded on completion ✓ marks run as failed with error message on failure ✓ refuses concurrent run (single-flight) ✓ allows --force to override stale running status ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi ingest --type=issues` | Progress bar, final count | Completes without error | | `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author | | `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project | | `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI | | `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use `--project=PATH` | | `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count | | `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count | | `gi sync-status` | Last sync time, cursor positions | Shows successful last run | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects - [ ] Every issue has a corresponding `raw_payloads` row - [ ] Labels in `issue_labels` junction all exist in `labels` table - [ ] `sync_cursors` has entry for each (project_id, 'issues') pair - [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current) - [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'` is non-zero - [ ] Every discussion has at least one note - [ ] `individual_note = true` discussions have exactly one note **Scope:** - Issue fetcher with pagination handling - Raw JSON storage in raw_payloads table - Normalized issue schema in SQLite - Labels ingestion derived from issue payload: - Always persist label names from `labels: string[]` - Optionally request `with_labels_details=true` to capture color/description when available - Issue discussions fetcher: - Uses `GET /projects/:id/issues/:iid/discussions` - Fetches all discussions for each issue during ingest - Preserve system notes but flag them with `is_system=1` - Incremental sync support (run tracking + per-project cursor) - Basic list/count CLI commands **Reliability/Idempotency Rules:** - Every ingest/sync creates a `sync_runs` row - Single-flight via DB-enforced app lock: - On start: acquire lock via transactional compare-and-swap: - `BEGIN IMMEDIATE` (acquires write lock immediately) - If no row exists → INSERT new lock - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps - Else if `owner` matches current run → UPDATE heartbeat (re-entrant) - Else → ROLLBACK and fail fast (another run is active) - `COMMIT` - During run: update `heartbeat_at` every 30 seconds - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire - `--force` remains as operator override for edge cases, but should rarely be needed - Cursor advances only after successful transaction commit per page/batch - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC` - Use explicit transactions for batch inserts **Schema Preview:** ```sql CREATE TABLE issues ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, title TEXT, description TEXT, state TEXT, author_username TEXT, created_at INTEGER, updated_at INTEGER, last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); CREATE INDEX idx_issues_author ON issues(author_username); CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid); -- Labels are derived from issue payloads (string array) -- Uniqueness is (project_id, name) since gitlab_id isn't always available CREATE TABLE labels ( id INTEGER PRIMARY KEY, gitlab_id INTEGER, -- optional (only if available) project_id INTEGER NOT NULL REFERENCES projects(id), name TEXT NOT NULL, color TEXT, description TEXT ); CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name); CREATE INDEX idx_labels_name ON labels(name); CREATE TABLE issue_labels ( issue_id INTEGER REFERENCES issues(id), label_id INTEGER REFERENCES labels(id), PRIMARY KEY(issue_id, label_id) ); CREATE INDEX idx_issue_labels_label ON issue_labels(label_id); -- Discussion threads for issues (MR discussions added in CP2) CREATE TABLE discussions ( id INTEGER PRIMARY KEY, gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...") project_id INTEGER NOT NULL REFERENCES projects(id), issue_id INTEGER REFERENCES issues(id), merge_request_id INTEGER, -- FK added in CP2 via ALTER TABLE noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion first_note_at INTEGER, -- for ordering discussions last_note_at INTEGER, -- for "recently active" queries last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync resolvable BOOLEAN, -- MR discussions can be resolved resolved BOOLEAN, CHECK ( (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) ) ); CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id); CREATE INDEX idx_discussions_issue ON discussions(issue_id); CREATE INDEX idx_discussions_mr ON discussions(merge_request_id); CREATE INDEX idx_discussions_last_note ON discussions(last_note_at); -- Notes belong to discussions (preserving thread context) CREATE TABLE notes ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, discussion_id INTEGER NOT NULL REFERENCES discussions(id), project_id INTEGER NOT NULL REFERENCES projects(id), type TEXT, -- 'DiscussionNote' | 'DiffNote' | null is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.) author_username TEXT, body TEXT, created_at INTEGER, updated_at INTEGER, last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync position INTEGER, -- derived from array order in API response (0-indexed) resolvable BOOLEAN, resolved BOOLEAN, resolved_by TEXT, resolved_at INTEGER, -- DiffNote position metadata (only populated for MR DiffNotes in CP2) position_old_path TEXT, position_new_path TEXT, position_old_line INTEGER, position_new_line INTEGER, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_notes_discussion ON notes(discussion_id); CREATE INDEX idx_notes_author ON notes(author_username); CREATE INDEX idx_notes_system ON notes(is_system); ``` --- ### Checkpoint 2: MR Ingestion **Deliverable:** All MRs + MR discussions + notes with DiffNote paths captured **Automated Tests (Vitest):** ``` tests/unit/mr-transformer.test.ts ✓ transforms GitLab MR payload to normalized schema ✓ extracts labels from MR payload ✓ handles missing optional fields gracefully tests/unit/diffnote-transformer.test.ts ✓ extracts DiffNote position metadata (paths and lines) ✓ handles missing position fields gracefully tests/integration/mr-ingestion.test.ts ✓ inserts MRs into database ✓ creates labels from MR payloads ✓ links MRs to labels via junction table ✓ stores raw payload for each MR tests/integration/mr-discussion-ingestion.test.ts ✓ fetches discussions for each MR ✓ creates discussion rows with correct MR FK ✓ creates note rows linked to discussions ✓ extracts position_new_path from DiffNotes ✓ captures note-level resolution status ✓ captures note type (DiscussionNote, DiffNote) ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error | | `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch | | `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI | | `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads | | `gi count discussions` | `Discussions: 12,345` | Total count (issue + MR) | | `gi count discussions --type=mr` | `MR Discussions: 6,789` | MR discussions only | | `gi count notes` | `Notes: 45,678 (excluding 8,901 system)` | Total with system note count | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count - [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest'` is non-zero - [ ] DiffNotes have `position_new_path` populated when available - [ ] Discussion `first_note_at` <= `last_note_at` for all rows **Scope:** - MR fetcher with pagination - MR discussions fetcher: - Uses `GET /projects/:id/merge_requests/:iid/discussions` - Fetches all discussions for each MR during ingest - Capture DiffNote file path/line metadata from `position` field for filename search - Relationship linking (discussion → MR, notes → discussion) - Extended CLI commands for MR display with threads - Add `idx_notes_type` and `idx_notes_new_path` indexes for DiffNote queries **Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries. **Schema Additions:** ```sql CREATE TABLE merge_requests ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, title TEXT, description TEXT, state TEXT, author_username TEXT, source_branch TEXT, target_branch TEXT, created_at INTEGER, updated_at INTEGER, last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync merged_at INTEGER, web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); CREATE INDEX idx_mrs_author ON merge_requests(author_username); CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid); -- MR labels (reuse same labels table from CP1) CREATE TABLE mr_labels ( merge_request_id INTEGER REFERENCES merge_requests(id), label_id INTEGER REFERENCES labels(id), PRIMARY KEY(merge_request_id, label_id) ); CREATE INDEX idx_mr_labels_label ON mr_labels(label_id); -- Additional indexes for DiffNote queries (tables created in CP1) CREATE INDEX idx_notes_type ON notes(type); CREATE INDEX idx_notes_new_path ON notes(position_new_path); -- Migration: Add FK constraint to discussions table (was deferred from CP1) -- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK -- This is handled by the migration system; pseudocode for clarity: -- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id) -- 2. INSERT INTO discussions_new SELECT * FROM discussions -- 3. DROP TABLE discussions -- 4. ALTER TABLE discussions_new RENAME TO discussions -- 5. Recreate indexes ``` **MR Discussion Processing Rules:** - DiffNote position data is extracted and stored: - `position.old_path`, `position.new_path` for file-level search - `position.old_line`, `position.new_line` for line-level context - MR discussions can be resolvable; resolution status is captured at note level --- ### Checkpoint 3A: Document Generation + FTS (Lexical Search) **Deliverable:** Documents generated + FTS5 index; `gi search --mode=lexical` works end-to-end (no Ollama required) **Automated Tests (Vitest):** ``` tests/unit/document-extractor.test.ts ✓ extracts issue document (title + description) ✓ extracts MR document (title + description) ✓ extracts discussion document with full thread context ✓ includes parent issue/MR title in discussion header ✓ formats notes with author and timestamp ✓ excludes system notes from discussion documents by default ✓ includes system notes only when --include-system-notes enabled (debug) ✓ truncates content exceeding 8000 tokens at note boundaries ✓ preserves first and last notes when truncating middle ✓ computes SHA-256 content hash consistently tests/integration/document-creation.test.ts ✓ creates document for each issue ✓ creates document for each MR ✓ creates document for each discussion ✓ populates document_labels junction table ✓ computes content_hash for each document ✓ excludes system notes from discussion content tests/integration/fts-index.test.ts ✓ documents_fts row count matches documents ✓ FTS triggers fire on insert/update/delete ✓ updates propagate via triggers tests/integration/fts-search.test.ts ✓ returns exact keyword matches ✓ porter stemming works (search/searching) ✓ returns empty for non-matching query ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi generate-docs` | Progress bar, final count | Completes without error | | `gi generate-docs` (re-run) | `0 documents to regenerate` | Skips unchanged docs | | `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, works without Ollama | | `gi stats` | Document count stats | Shows document coverage | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions - [ ] `SELECT COUNT(*) FROM documents_fts` = `SELECT COUNT(*) FROM documents` (via FTS triggers) - [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings - [ ] Discussion documents include parent title in content_text - [ ] Discussion documents exclude system notes **Scope:** - Document extraction layer: - Canonical "search documents" derived from issues/MRs/discussions - Stable content hashing for change detection (SHA-256 of content_text) - Truncation: content_text capped at 8000 tokens at NOTE boundaries - **Implementation:** Use character budget, not exact token count - `maxChars = 32000` (conservative 4 chars/token estimate) - Drop whole notes from middle, never cut mid-note - `approxTokens = ceil(charCount / 4)` for reporting/logging only - System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search) - Denormalized metadata for fast filtering (author, labels, dates) - Fast label filtering via `document_labels` join table - FTS5 index for lexical search - `gi search --mode=lexical` CLI command (works without Ollama) This checkpoint delivers a working search experience before introducing embedding infrastructure risk. **Schema Additions (CP3A):** ```sql -- Unified searchable documents (derived from issues/MRs/discussions) -- Note: Full documents table schema is in CP3B section for continuity with embeddings CREATE TABLE documents ( id INTEGER PRIMARY KEY, source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' source_id INTEGER NOT NULL, -- local DB id in the source table project_id INTEGER NOT NULL REFERENCES projects(id), author_username TEXT, -- for discussions: first note author label_names TEXT, -- JSON array (display/debug only) created_at INTEGER, updated_at INTEGER, url TEXT, title TEXT, -- null for discussions content_text TEXT NOT NULL, -- canonical text for embedding/snippets content_hash TEXT NOT NULL, -- SHA-256 for change detection is_truncated BOOLEAN NOT NULL DEFAULT 0, truncated_reason TEXT, -- 'token_limit_middle_drop' | null UNIQUE(source_type, source_id) ); CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at); CREATE INDEX idx_documents_author ON documents(author_username); CREATE INDEX idx_documents_source ON documents(source_type, source_id); -- Fast label filtering for documents (indexed exact-match) CREATE TABLE document_labels ( document_id INTEGER NOT NULL REFERENCES documents(id), label_name TEXT NOT NULL, PRIMARY KEY(document_id, label_name) ); CREATE INDEX idx_document_labels_label ON document_labels(label_name); -- Fast path filtering for documents (extracted from DiffNote positions) CREATE TABLE document_paths ( document_id INTEGER NOT NULL REFERENCES documents(id), path TEXT NOT NULL, PRIMARY KEY(document_id, path) ); CREATE INDEX idx_document_paths_path ON document_paths(path); -- Track sources that require document regeneration (populated during ingestion) CREATE TABLE dirty_sources ( source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' source_id INTEGER NOT NULL, -- local DB id queued_at INTEGER NOT NULL, PRIMARY KEY(source_type, source_id) ); -- Resumable dependent fetches (discussions are per-parent resources) CREATE TABLE pending_discussion_fetches ( project_id INTEGER NOT NULL REFERENCES projects(id), noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' noteable_iid INTEGER NOT NULL, -- parent iid (stable human identifier) queued_at INTEGER NOT NULL, attempt_count INTEGER NOT NULL DEFAULT 0, last_attempt_at INTEGER, last_error TEXT, PRIMARY KEY(project_id, noteable_type, noteable_iid) ); CREATE INDEX idx_pending_discussions_retry ON pending_discussion_fetches(attempt_count, last_attempt_at) WHERE last_error IS NOT NULL; -- Full-text search for lexical retrieval -- Using porter stemmer for better matching of word variants CREATE VIRTUAL TABLE documents_fts USING fts5( title, content_text, content='documents', content_rowid='id', tokenize='porter unicode61' ); -- Triggers to keep FTS in sync CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN INSERT INTO documents_fts(rowid, title, content_text) VALUES (new.id, new.title, new.content_text); END; CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN INSERT INTO documents_fts(documents_fts, rowid, title, content_text) VALUES('delete', old.id, old.title, old.content_text); END; CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN INSERT INTO documents_fts(documents_fts, rowid, title, content_text) VALUES('delete', old.id, old.title, old.content_text); INSERT INTO documents_fts(rowid, title, content_text) VALUES (new.id, new.title, new.content_text); END; ``` **FTS5 Tokenizer Notes:** - `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated") - `unicode61` handles Unicode properly - Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer --- ### Checkpoint 3B: Embedding Generation (Semantic Search) **Deliverable:** Embeddings generated + `gi search --mode=semantic` works; graceful fallback if Ollama unavailable **Automated Tests (Vitest):** ``` tests/unit/embedding-client.test.ts ✓ connects to Ollama API ✓ generates embedding for text input ✓ returns 768-dimension vector ✓ handles Ollama connection failure gracefully ✓ batches requests (32 documents per batch) tests/integration/embedding-storage.test.ts ✓ stores embedding in sqlite-vec ✓ embedding rowid matches document id ✓ creates embedding_metadata record ✓ skips re-embedding when content_hash unchanged ✓ re-embeds when content_hash changes ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi embed --all` | Progress bar with ETA | Completes without error | | `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs | | `gi stats` | Embedding coverage stats | Shows 100% coverage | | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error | | `gi search "authentication" --mode=semantic` | Vector results | Returns semantically similar documents | **Data Integrity Checks:** - [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents` - [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents` - [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash` **Scope:** - Ollama integration (nomic-embed-text model) - Embedding generation pipeline: - Batch size: 32 documents per batch - Concurrency: configurable (default 4 workers) - Retry with exponential backoff for transient failures (max 3 attempts) - Per-document failure recording to enable targeted re-runs - Vector storage in SQLite (sqlite-vec extension) - Progress tracking and resumability - `gi search --mode=semantic` CLI command **Ollama API Contract:** ```typescript // POST http://localhost:11434/api/embed (batch endpoint - preferred) interface OllamaEmbedRequest { model: string; // "nomic-embed-text" input: string[]; // array of texts to embed (up to 32) } interface OllamaEmbedResponse { model: string; embeddings: number[][]; // array of 768-dim vectors } // POST http://localhost:11434/api/embeddings (single text - fallback) interface OllamaEmbeddingsRequest { model: string; prompt: string; } interface OllamaEmbeddingsResponse { embedding: number[]; } ``` **Usage:** - Use `/api/embed` for batching (up to 32 documents per request) - Fall back to `/api/embeddings` for single documents or if batch fails - Check Ollama availability with `GET http://localhost:11434/api/tags` **Schema Additions (CP3B):** ```sql -- sqlite-vec virtual table for vector search -- Storage rule: embeddings.rowid = documents.id CREATE VIRTUAL TABLE embeddings USING vec0( embedding float[768] ); -- Embedding provenance + change detection -- document_id is PRIMARY KEY and equals embeddings.rowid CREATE TABLE embedding_metadata ( document_id INTEGER PRIMARY KEY REFERENCES documents(id), model TEXT NOT NULL, -- 'nomic-embed-text' dims INTEGER NOT NULL, -- 768 content_hash TEXT NOT NULL, -- copied from documents.content_hash created_at INTEGER NOT NULL, -- Error tracking for resumable embedding last_error TEXT, -- error message from last failed attempt attempt_count INTEGER NOT NULL DEFAULT 0, last_attempt_at INTEGER -- when last attempt occurred ); -- Index for finding failed embeddings to retry CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL; ``` **Storage Rule (MVP):** - Insert embedding with `rowid = documents.id` - Upsert `embedding_metadata` by `document_id` - This alignment simplifies joins and eliminates rowid mapping fragility **Document Extraction Rules:** | Source | content_text Construction | |--------|--------------------------| | Issue | `title + "\n\n" + description` | | MR | `title + "\n\n" + description` | | Discussion | Full thread with context (see below) | **Discussion Document Format:** ``` [[Discussion]] Issue #234: Authentication redesign Project: group/project-one URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345 Labels: ["bug", "auth"] Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread --- Thread --- @johndoe (2024-03-15): I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients... @janedoe (2024-03-15): Agreed. What about refresh token strategy? @johndoe (2024-03-16): Short-lived access tokens (15min), longer refresh (7 days). Here's why... ``` **System Notes Exclusion Rule:** - System notes (is_system=1) are stored in the DB for audit purposes - System notes are EXCLUDED from discussion documents by default - This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings - Debug flag `--include-system-notes` available for troubleshooting This format preserves: - Parent context (issue/MR title and number) - Project path for scoped search - Direct URL for navigation - Labels for context - File paths from DiffNotes (enables immediate file search) - Author attribution for each note - Temporal ordering of the conversation - Full thread semantics for decision traceability **Truncation (Note-Boundary Aware):** If content exceeds 8000 tokens (~32000 chars): **Algorithm:** 1. Count non-system notes in the discussion 2. If total chars ≤ maxChars, no truncation needed 3. Otherwise, drop whole notes from the MIDDLE: - Preserve first N notes and last M notes - Never cut mid-note (produces unreadable snippets and worse embeddings) - Continue dropping middle notes until under maxChars 4. Insert marker: `\n\n[... N notes omitted for length ...]\n\n` 5. Set `documents.is_truncated = 1` 6. Set `documents.truncated_reason = 'token_limit_middle_drop'` 7. Log a warning with document ID and original/truncated token count **Edge Cases:** - **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'` - **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'` - **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]` **Why note-boundary truncation:** - Cutting mid-note produces unreadable snippets ("...the authentication flow because--") - Keeping whole notes preserves semantic coherence for embeddings - First notes contain context/problem statement; last notes contain conclusions - Middle notes are often back-and-forth that's less critical **Token estimation:** `approxTokens = ceil(charCount / 4)`. No tokenizer dependency. This metadata enables: - Monitoring truncation frequency in production - Future investigation of high-value truncated documents - Debugging when search misses expected content --- ### Checkpoint 4: Hybrid Search (Semantic + Lexical) **Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content **Automated Tests (Vitest):** ``` tests/unit/search-query.test.ts ✓ parses filter flags (--type, --author, --after, --label) ✓ validates date format for --after ✓ handles multiple --label flags tests/unit/rrf-ranking.test.ts ✓ computes RRF score correctly ✓ merges results from vector and FTS retrievers ✓ handles documents appearing in only one retriever ✓ respects k=60 parameter tests/integration/vector-search.test.ts ✓ returns results for semantic query ✓ ranks similar content higher ✓ returns empty for nonsense query tests/integration/fts-search.test.ts ✓ returns exact keyword matches ✓ handles porter stemming (search/searching) ✓ returns empty for non-matching query tests/integration/hybrid-search.test.ts ✓ combines vector and FTS results ✓ applies type filter correctly ✓ applies author filter correctly ✓ applies date filter correctly ✓ applies label filter correctly ✓ falls back to FTS when Ollama unavailable tests/e2e/golden-queries.test.ts ✓ "authentication redesign" returns known auth-related items ✓ "database migration" returns known migration items ✓ [8 more domain-specific golden queries] ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score | | `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project | | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output | | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | | `gi search "authentication" --label=bug` | Label filtered | All results have bug label | | `gi search "redis" --mode=lexical` | FTS results only | Shows FTS results, no embeddings | | `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ | | `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema | | `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions | | `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results | | `gi search "xyznonexistent123"` | No results message | Graceful empty state | | `gi search "authentication"` (no data synced) | No data message | Shows "Run gi sync first" | | `gi search "authentication"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | **Golden Query Test Suite:** Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs: ```json [ { "query": "authentication redesign", "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"], "minResults": 1, "maxRank": 10 } ] ``` Each query must have at least one expected URL appear in top 10 results. **Data Integrity Checks:** - [ ] `documents_fts` row count matches `documents` row count - [ ] Search returns results for known content (not empty) - [ ] JSON output validates against defined schema - [ ] All result URLs are valid GitLab URLs **Scope:** - Hybrid retrieval: - Vector recall (sqlite-vec) + FTS lexical recall (fts5) - Merge + rerank results using Reciprocal Rank Fusion (RRF) - Query embedding generation (same Ollama pipeline as documents) - Result ranking and scoring (document-level) - Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N` - `--limit=N` controls result count (default: 20, max: 100) - `--path` filters documents by referenced file paths (from DiffNote positions): - If `--path` ends with `/`: prefix match (`path LIKE 'src/auth/%'`) - Otherwise: exact match OR prefix on directory boundary - Examples: `--path=src/auth/` matches `src/auth/login.ts`, `src/auth/utils/helpers.ts` - Examples: `--path=src/auth/login.ts` matches only that exact file - Glob patterns deferred to post-MVP - Label filtering operates on `document_labels` (indexed, exact-match) - Filters work identically in hybrid and lexical modes - Debug: `--explain` returns rank contributions from vector + FTS + RRF - Output formatting: ranked list with title, snippet, score, URL - JSON output mode for AI/agent consumption (stable documented schema) - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning - Empty state handling: - No documents indexed: `No data indexed. Run 'gi sync' first.` - Query returns no results: `No results found for "query".` - Filters exclude all results: `No results match the specified filters.` - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search") **Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:** 1. Determine recall size (adaptive based on filters): - `baseTopK = 50` - If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200` - This prevents "no results" when relevant docs exist outside top-50 unfiltered recall 2. Query both vector index (top topK) and FTS5 (top topK) - Vector recall via sqlite-vec + FTS lexical recall via fts5 - Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type) 3. Merge results by document_id 4. Combine with Reciprocal Rank Fusion (RRF): - For each retriever list, assign ranks (1..N) - `rrfScore = Σ 1 / (k + rank)` with k=60 (tunable) - RRF is simpler than weighted sums and doesn't require score normalization 5. Apply remaining filters (date ranges, labels, paths that weren't applied in SQL) 6. Return top K results **Why Adaptive Recall:** - Fixed top-50 + filter can easily return 0 results even when relevant docs exist - Increasing recall when filters are present catches more candidates before filtering - SQL-level filtering is preferred (faster, uses indexes) but not always possible **Why RRF over Weighted Sums:** - FTS5 BM25 scores and vector distances use different scales - Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization - RRF operates on ranks, not scores, making it robust to scale differences - Well-established in information retrieval literature **Graceful Degradation:** - If Ollama is unreachable during search, automatically fall back to FTS5-only search - Display warning: "Embedding service unavailable, using lexical search only" - `embed` command fails with actionable error if Ollama is down **CLI Interface:** ```bash # Basic semantic search gi search "authentication redesign" # Search within specific project gi search "authentication" --project=group/project-one # Search by file path (finds discussions/MRs touching this file) gi search "rate limit" --path=src/client.ts # Pure FTS search (fallback if embeddings unavailable) gi search "redis" --mode=lexical # Filtered search gi search "authentication" --type=mr --after=2024-01-01 # Filter by label gi search "performance" --label=bug --label=critical # JSON output for programmatic use gi search "payment processing" --json # Explain search (shows RRF contributions) gi search "auth" --explain ``` **CLI Output Example:** ``` $ gi search "authentication" Found 23 results (hybrid search, 0.34s) [1] MR !847 - Refactor auth to use JWT tokens (0.82) @johndoe · 2024-03-15 · group/project-one "...moving away from session cookies to JWT for authentication..." https://gitlab.example.com/group/project-one/-/merge_requests/847 [2] Issue #234 - Authentication redesign discussion (0.79) @janedoe · 2024-02-28 · group/project-one "...we need to redesign the authentication flow because..." https://gitlab.example.com/group/project-one/-/issues/234 [3] Discussion on Issue #234 (0.76) @johndoe · 2024-03-01 · group/project-one "I think we should move to JWT-based auth because the session..." https://gitlab.example.com/group/project-one/-/issues/234#note_12345 ``` **JSON Output Schema (Stable):** For AI/agent consumption, `--json` output follows this stable schema: ```typescript interface SearchResult { documentId: number; sourceType: "issue" | "merge_request" | "discussion"; title: string | null; url: string; projectPath: string; author: string | null; createdAt: string; // ISO 8601 updatedAt: string; // ISO 8601 score: number; // normalized 0-1 (rrfScore / maxRrfScore in this result set) snippet: string; // truncated content_text labels: string[]; // Only present with --explain flag explain?: { vectorRank?: number; // null if not in vector results ftsRank?: number; // null if not in FTS results rrfScore: number; // raw RRF score (rank-based, comparable within a query) }; } // Note on score normalization: // - `score` is normalized 0-1 for UI display convenience // - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set) // - Use `explain.rrfScore` for raw scores when comparing across queries // - Scores are NOT comparable across different queries interface SearchResponse { query: string; mode: "hybrid" | "lexical" | "semantic"; totalResults: number; results: SearchResult[]; warnings?: string[]; // e.g., "Embedding service unavailable" } ``` **Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed. --- ### Checkpoint 5: Incremental Sync **Deliverable:** Efficient ongoing synchronization with GitLab **Automated Tests (Vitest):** ``` tests/unit/cursor-management.test.ts ✓ advances cursor after successful page commit ✓ uses tie-breaker id for identical timestamps ✓ does not advance cursor on failure ✓ resets cursor on --full flag tests/unit/change-detection.test.ts ✓ detects content_hash mismatch ✓ queues document for re-embedding on change ✓ skips re-embedding when hash unchanged tests/integration/incremental-sync.test.ts ✓ fetches only items updated after cursor ✓ refetches discussions for updated issues ✓ refetches discussions for updated MRs ✓ updates existing records (not duplicates) ✓ creates new records for new items ✓ re-embeds documents with changed content_hash tests/integration/sync-recovery.test.ts ✓ resumes from cursor after interrupted sync ✓ marks failed run with error message ✓ handles rate limiting (429) with backoff ✓ respects Retry-After header ``` **Manual CLI Smoke Tests:** | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check | | `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change | | `gi sync --full` | Full sync progress | Resets cursors, fetches everything | | `gi sync-status` | Last sync time, cursor positions | Shows current state | | `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually | | `gi search "new content"` (after sync) | Returns new content | New content is searchable | **End-to-End Sync Verification:** 1. Note the current `sync_cursors` values 2. Create a new comment on an issue in GitLab 3. Run `gi sync` 4. Verify: - [ ] Issue's `updated_at` in DB matches GitLab - [ ] New discussion row exists - [ ] New note row exists - [ ] New document row exists for discussion - [ ] New embedding exists for document - [ ] `gi search "new comment text"` returns the new discussion - [ ] Cursor advanced past the updated issue **Data Integrity Checks:** - [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table - [ ] No orphaned documents (all have valid source_id) - [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows - [ ] `sync_runs` has complete audit trail **Scope:** - Delta sync based on stable tuple cursor `(updated_at, gitlab_id)` - Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates - Dependent resources sync strategy (discussions refetched when parent updates) - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) - Sync status reporting - Recommended: run via cron every 10 minutes **Correctness Rules (MVP):** 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC` 2. Cursor is a stable tuple `(updated_at, gitlab_id)`: - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.** - Use **cursor rewind + local filtering**: - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable) - Locally discard items where: - `updated_at < cursor_updated_at`, OR - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id` - This makes the tuple cursor rule true in practice while keeping API calls simple. - Cursor advances only after successful DB commit for that page - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)` 3. Dependent resources: - For each updated issue/MR, refetch ALL its discussions - Discussion documents are regenerated and re-embedded if content_hash changes 4. Rolling backfill window: - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays` - This catches any items whose timestamps were updated without triggering our cursor 5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` 6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) **Why Dependent Resource Model:** - GitLab Discussions API doesn't provide a global `updated_after` stream - Discussions are listed per-issue or per-MR, not as a top-level resource - Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct **CLI Commands:** ```bash # Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced) gi sync # orchestrates all steps gi sync --no-embed # skip embedding step (fast ingest/debug) gi sync --no-docs # skip document regeneration (debug) # Force full re-sync (resets cursors) gi sync --full # Override stale 'running' run after operator review gi sync --force # Show sync status gi sync-status ``` **Orchestration steps (in order):** 1. Acquire app lock with heartbeat 2. Ingest delta (issues, MRs) based on cursors - For each upserted issue/MR, enqueue into `pending_discussion_fetches` - INSERT into `dirty_sources` for each upserted issue/MR 3. Process `pending_discussion_fetches` queue (bounded per run, retryable): - Fetch discussions for each queued parent - On success: upsert discussions/notes, INSERT into `dirty_sources`, DELETE from queue - On failure: increment `attempt_count`, record `last_error`, leave in queue for retry - Bound processing: max N parents per sync run to avoid unbounded API calls 4. Apply rolling backfill window 5. Regenerate documents for entities in `dirty_sources` (process + delete from queue) 6. Embed documents with changed content_hash 7. FTS triggers auto-sync (no explicit step needed) 8. Release lock, record sync_run as succeeded **Why queue-based discussion fetching:** - One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync - Primary resource cursors can advance independently - Discussions can be retried without re-fetching all issues/MRs - Bounded processing prevents unbounded API calls per sync run Individual commands remain available for checkpoint testing and debugging: - `gi ingest --type=issues` - `gi ingest --type=merge_requests` - `gi embed --all` - `gi embed --retry-failed` --- ## CLI Command Reference All commands support `--help` for detailed usage information. ### Setup & Diagnostics | Command | CP | Description | |---------|-----|-------------| | `gi init` | 0 | Interactive setup wizard; creates gi.config.json | | `gi auth-test` | 0 | Verify GitLab authentication | | `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) | | `gi doctor --json` | 0 | JSON output for scripting | | `gi version` | 0 | Show installed version | ### Data Ingestion | Command | CP | Description | |---------|-----|-------------| | `gi ingest --type=issues` | 1 | Fetch issues from GitLab | | `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions | | `gi generate-docs` | 3A | Extract documents from issues/MRs/discussions | | `gi embed --all` | 3B | Generate embeddings for all documents | | `gi embed --retry-failed` | 3 | Retry failed embeddings | | `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) | | `gi sync --full` | 5 | Force complete re-sync (reset cursors) | | `gi sync --force` | 5 | Override stale lock after operator review | | `gi sync --no-embed` | 5 | Sync without embedding (faster) | ### Data Inspection | Command | CP | Description | |---------|-----|-------------| | `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues | | `gi list mrs --limit=N` | 2 | List merge requests | | `gi count issues` | 1 | Count issues | | `gi count mrs` | 2 | Count merge requests | | `gi count discussions --type=issue` | 1 | Count issue discussions | | `gi count discussions` | 2 | Count all discussions | | `gi count discussions --type=mr` | 2 | Count MR discussions | | `gi count notes --type=issue` | 1 | Count issue notes (excluding system) | | `gi count notes` | 2 | Count all notes (excluding system) | | `gi show issue [--project=PATH]` | 1 | Show issue details (prompts if iid ambiguous across projects) | | `gi show mr [--project=PATH]` | 2 | Show MR details with discussions | | `gi stats` | 3 | Embedding coverage statistics | | `gi stats --json` | 3 | JSON stats for scripting | | `gi sync-status` | 1 | Show cursor positions and last sync | ### Search | Command | CP | Description | |---------|-----|-------------| | `gi search "query"` | 4 | Hybrid semantic + lexical search | | `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) | | `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type | | `gi search "query" --author=USERNAME` | 4 | Filter by author | | `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date | | `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) | | `gi search "query" --project=PATH` | 4 | Filter by project | | `gi search "query" --path=FILE` | 4 | Filter by file path | | `gi search "query" --limit=N` | 4 | Limit results (default: 20, max: 100) | | `gi search "query" --json` | 4 | JSON output for scripting | | `gi search "query" --explain` | 4 | Show ranking breakdown | ### Database Management | Command | CP | Description | |---------|-----|-------------| | `gi backup` | 0 | Create timestamped database backup | | `gi reset --confirm` | 0 | Delete database and reset cursors | --- ## Error Handling Common errors and their resolutions: ### Configuration Errors | Error | Cause | Resolution | |-------|-------|------------| | `Config file not found` | No gi.config.json | Run `gi init` to create configuration | | `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually | | `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config | ### Authentication Errors | Error | Cause | Resolution | |-------|-------|------------| | `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` | | `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope | | `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope | ### GitLab API Errors | Error | Cause | Resolution | |-------|-------|------------| | `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) | | `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry | | `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity | ### Data Errors | Error | Cause | Resolution | |-------|-------|------------| | `No documents indexed` | Sync not run | Run `gi sync` first | | `No results found` | Query too specific | Try broader search terms | | `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale | ### Embedding Errors | Error | Cause | Resolution | |-------|-------|------------| | `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` | | `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` | | `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` | ### Operational Behavior | Scenario | Behavior | |----------|----------| | **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. | | **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. | | **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. | | **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. | | **Embedding permanent failure** | After 3 retries, document stays in `embedding_metadata` with `last_error` populated. Use `gi embed --retry-failed` to retry later, or `gi stats` to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS. | | **Orphaned records** | MVP: No automatic cleanup. `last_seen_at` field enables future detection of items deleted in GitLab. Post-MVP: `gi gc --dry-run` to identify orphans, `gi gc --confirm` to remove. | --- ## Database Management ### Database Location The SQLite database is stored at an XDG-compliant location: ``` ~/.local/share/gi/data.db ``` This can be overridden in `gi.config.json`: ```json { "storage": { "dbPath": "/custom/path/to/data.db" } } ``` ### Backup Create a timestamped backup of the database: ```bash gi backup # Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db ``` Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode). ### Reset To completely reset the database and all sync cursors: ```bash gi reset --confirm ``` This deletes: - The database file - All sync cursors - All embeddings You'll need to run `gi sync` again to repopulate. ### Schema Migrations Database schema is version-tracked and migrations auto-apply on startup: 1. On first run, schema is created at latest version 2. On subsequent runs, pending migrations are applied automatically 3. Migration version is stored in `schema_version` table 4. Migrations are idempotent and reversible where possible **Manual migration check:** ```bash gi doctor --json | jq '.checks.database' # Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 } ``` --- ## Future Work (Post-MVP) The following features are explicitly deferred to keep MVP scope focused: | Feature | Description | Depends On | |---------|-------------|------------| | **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete | | **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete | | **Person Context** | Aggregate contributions by author, expertise inference | MVP complete | | **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration | | **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 | | **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 | **Checkpoint 6 (File History) Schema Preview:** ```sql -- Deferred from MVP; added when file-history feature is built CREATE TABLE mr_files ( id INTEGER PRIMARY KEY, merge_request_id INTEGER REFERENCES merge_requests(id), old_path TEXT, new_path TEXT, new_file BOOLEAN, deleted_file BOOLEAN, renamed_file BOOLEAN, UNIQUE(merge_request_id, old_path, new_path) ); CREATE INDEX idx_mr_files_old_path ON mr_files(old_path); CREATE INDEX idx_mr_files_new_path ON mr_files(new_path); -- DiffNote position data (for "show me comments on this file" queries) -- Populated from notes.type='DiffNote' position object in GitLab API CREATE TABLE note_positions ( note_id INTEGER PRIMARY KEY REFERENCES notes(id), old_path TEXT, new_path TEXT, old_line INTEGER, new_line INTEGER, position_type TEXT -- 'text' | 'image' | etc. ); CREATE INDEX idx_note_positions_new_path ON note_positions(new_path); ``` --- ## Verification Strategy Each checkpoint includes: 1. **Automated tests** - Unit tests for data transformations, integration tests for API calls 2. **CLI smoke tests** - Manual commands with expected outputs documented 3. **Data integrity checks** - Count verification against GitLab, schema validation 4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+) --- ## Risk Mitigation | Risk | Mitigation | |------|------------| | GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync | | Embedding model quality | Start with nomic-embed-text; architecture allows model swap | | SQLite scale limits | Monitor performance; Postgres migration path documented | | Stale data | Incremental sync with change detection | | Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery | | Missed updates | Rolling backfill window (14 days), tuple cursor semantics | | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | | Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery | | Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs | | Pathological discussions | Queue-based discussion fetching; one bad thread doesn't block entire sync | | Empty search results with filters | Adaptive recall (topK 50→200 when filtered) | **SQLite Performance Defaults (MVP):** - Enable `PRAGMA journal_mode=WAL;` on every connection - Enable `PRAGMA foreign_keys=ON;` on every connection - Use explicit transactions for page/batch inserts - Targeted indexes on `(project_id, updated_at)` for primary resources --- ## Schema Summary | Table | Checkpoint | Purpose | |-------|------------|---------| | projects | 0 | Configured GitLab projects | | sync_runs | 0 | Audit trail of sync operations (with heartbeat) | | app_locks | 0 | Crash-safe single-flight lock | | sync_cursors | 0 | Resumable sync state per primary resource | | raw_payloads | 0 | Decoupled raw JSON storage (gitlab_id as TEXT) | | schema_version | 0 | Database migration version tracking | | issues | 1 | Normalized issues (unique by project+iid) | | labels | 1 | Label definitions (unique by project + name) | | issue_labels | 1 | Issue-label junction | | discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) | | notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) | | merge_requests | 2 | Normalized MRs (unique by project+iid) | | mr_labels | 2 | MR-label junction | | documents | 3A | Unified searchable documents with truncation metadata | | document_labels | 3A | Document-label junction for fast filtering | | document_paths | 3A | Fast path filtering for documents (DiffNote file paths) | | dirty_sources | 3A | Queue for incremental document regeneration | | pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching | | documents_fts | 3A | Full-text search index (fts5 with porter stemmer) | | embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) | | embedding_metadata | 3B | Embedding provenance + error tracking | | mr_files | 6 | MR file changes (deferred to post-MVP) | --- ## Resolved Decisions | Question | Decision | Rationale | |----------|----------|-----------| | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability | | System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise | | DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature | | MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature | | Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering | | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings | | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient | | Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates | | Discussions sync | **Resumable queue model** | Queue-based fetching allows one pathological thread to not block entire sync | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping | | Embedding truncation | **Note-boundary aware middle drop** | Never cut mid-note; preserves semantic coherence | | Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation | | FTS5 tokenizer | **porter unicode61** | Stemming improves recall | | Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching | | JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption | | Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable | | `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX | | Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly | | Empty state UX | **Actionable messages** | Guide user to next step | | raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings | | GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items | | Pagination | **X-Next-Page headers with empty-page fallback** | Headers are more robust than empty-page detection | | Integration tests | **Mocked by default, live tests optional** | Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1 | | Search recall with filters | **Adaptive topK (50→200 when filtered)** | Prevents "no results" when relevant docs exist outside top-50 | | RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain | | --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match | | CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk | | Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere | | CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool | | Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping | | Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors | | Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker | | Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single | --- ## Next Steps 1. User approves this spec 2. Generate Checkpoint 0 PRD for project setup 3. Implement Checkpoint 0 4. Human validates → proceed to Checkpoint 1 5. Repeat for each checkpoint