Files
gitlore/SPEC.md
Taylor Eernisse 986bc59f6a docs: Add comprehensive documentation and planning artifacts
README.md provides complete user documentation:
- Installation via cargo install or build from source
- Quick start guide with example commands
- Configuration file format with all options documented
- Full command reference for init, auth-test, doctor, ingest,
  list, show, count, sync-status, migrate, and version
- Database schema overview covering projects, issues, milestones,
  assignees, labels, discussions, notes, and raw payloads
- Development setup with test, lint, and debug commands

SPEC.md updated from original TypeScript planning document:
- Added note clarifying this is historical (implementation uses Rust)
- Updated sqlite-vss references to sqlite-vec (deprecated library)
- Added architecture overview with Technology Choices rationale
- Expanded project structure showing all planned modules

docs/prd/ contains detailed checkpoint planning:
- checkpoint-0.md: Initial project vision and requirements
- checkpoint-1.md: Revised planning after technology decisions

These documents capture the evolution from initial concept through
the decision to use Rust for performance and type safety.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 11:27:40 -05:00

84 KiB
Raw Blame History

GitLab Knowledge Engine - Spec Document

Note: This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See README.md for current documentation.

Executive Summary

A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.


Quick Start

Prerequisites

Requirement Version Notes
Node.js 20+ LTS recommended
npm 10+ Comes with Node.js
Ollama Latest Optional for semantic search; lexical search works without it

Installation

# Clone and install
git clone https://github.com/your-org/gitlab-inbox.git
cd gitlab-inbox
npm install
npm run build
npm link  # Makes `gi` available globally

First Run

  1. Set your GitLab token (create at GitLab > Settings > Access Tokens with read_api scope):

    export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
    
  2. Run the setup wizard:

    gi init
    

    This creates gi.config.json with your GitLab URL and project paths.

  3. Verify your environment:

    gi doctor
    

    All checks should pass (Ollama warning is OK if you only need lexical search).

  4. Sync your data:

    gi sync
    

    Initial sync takes 10-20 minutes depending on repo size and rate limits.

  5. Search:

    gi search "authentication redesign"
    

Troubleshooting First Run

Symptom Solution
Config file not found Run gi init first
GITLAB_TOKEN not set Export the environment variable
401 Unauthorized Check token has read_api scope
Project not found: group/project Verify project path in GitLab URL
Ollama connection refused Start Ollama or use --mode=lexical for search

Discovery Summary

Pain Points Identified

  1. Knowledge discovery - Tribal knowledge buried in old MRs/issues that nobody can find
  2. Decision traceability - Hard to find why decisions were made; context scattered across issue comments and MR discussions

Constraints

Constraint Detail
Hosting Self-hosted only, no external APIs
Compute Local dev machine (M-series Mac assumed)
GitLab Access Self-hosted instance, PAT access, no webhooks (could request)
Build Method AI agents will implement; user is TypeScript expert for review

Target Use Cases (Priority Order)

  1. MVP: Semantic Search - "Find discussions about authentication redesign"
  2. Future: File/Feature History - "What decisions were made about src/auth/login.ts?"
  3. Future: Personal Tracking - "What am I assigned to or mentioned in?"
  4. Future: Person Context - "What's @johndoe's background in this project?"

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        GitLab API                                │
│                    (Issues, MRs, Notes)                          │
└─────────────────────────────────────────────────────────────────┘
  (Commit-level indexing explicitly post-MVP)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Ingestion Layer                         │
│  - Incremental sync (PAT-based polling)                         │
│  - Rate limiting / backoff                                       │
│  - Raw JSON storage for replay                                   │
│  - Dependent resource fetching (notes, MR changes)              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Data Processing Layer                        │
│  - Normalize artifacts to unified schema                        │
│  - Extract searchable documents (canonical text + metadata)     │
│  - Content hashing for change detection                         │
│  - MVP relationships: parent-child FKs + label/path associations│
│    (full cross-entity "decision graph" is post-MVP scope)       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                               │
│  - SQLite + sqlite-vec + FTS5 (hybrid search)                   │
│  - Structured metadata in relational tables                      │
│  - Vector embeddings for semantic search                         │
│  - Full-text index for lexical search fallback                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Query Interface                             │
│  - CLI for human testing                                         │
│  - JSON API for AI agent testing                                 │
│  - Semantic search with filters (author, date, type, label)     │
└─────────────────────────────────────────────────────────────────┘

Technology Choices

Component Choice Rationale
Language TypeScript/Node.js User expertise, good GitLab libs, AI agent friendly
Database SQLite + sqlite-vec + FTS5 Zero-config, portable, vector search via pure-C extension
Embeddings Ollama + nomic-embed-text Self-hosted, runs well on Apple Silicon, 768-dim vectors
CLI Framework Commander.js Simple, lightweight, well-documented
Logging pino Fast, JSON-structured, low overhead
Validation Zod TypeScript-first schema validation

Alternative Considered: sqlite-vss

  • sqlite-vss was the original choice but is now deprecated
  • No Apple Silicon support (no prebuilt ARM binaries)
  • Replaced by sqlite-vec, which is pure C with no dependencies
  • sqlite-vec uses vec0 virtual table (vs vss0)

Alternative Considered: Postgres + pgvector

  • Pros: More scalable, better for production multi-user
  • Cons: Requires running Postgres, heavier setup
  • Decision: Start with SQLite for simplicity; migration path exists if needed

Project Structure

gitlab-inbox/
├── src/
│   ├── cli/
│   │   ├── index.ts          # CLI entry point (Commander.js)
│   │   └── commands/         # One file per command group
│   │       ├── init.ts
│   │       ├── sync.ts
│   │       ├── search.ts
│   │       ├── list.ts
│   │       └── doctor.ts
│   ├── core/
│   │   ├── config.ts         # Config loading/validation (Zod)
│   │   ├── db.ts             # Database connection + migrations
│   │   ├── errors.ts         # Custom error classes
│   │   └── logger.ts         # pino logger setup
│   ├── gitlab/
│   │   ├── client.ts         # GitLab API client with rate limiting
│   │   ├── types.ts          # GitLab API response types
│   │   └── transformers/     # Payload → normalized schema
│   │       ├── issue.ts
│   │       ├── merge-request.ts
│   │       └── discussion.ts
│   ├── ingestion/
│   │   ├── issues.ts
│   │   ├── merge-requests.ts
│   │   └── discussions.ts
│   ├── documents/
│   │   ├── extractor.ts      # Document generation from entities
│   │   └── truncation.ts     # Note-boundary aware truncation
│   ├── embedding/
│   │   ├── ollama.ts         # Ollama client
│   │   └── pipeline.ts       # Batch embedding orchestration
│   ├── search/
│   │   ├── hybrid.ts         # RRF ranking logic
│   │   ├── fts.ts            # FTS5 queries
│   │   └── vector.ts         # sqlite-vec queries
│   └── types/
│       └── index.ts          # Shared TypeScript types
├── tests/
│   ├── unit/
│   ├── integration/
│   ├── live/                 # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
│   └── fixtures/
│       └── golden-queries.json
├── migrations/               # Numbered SQL migration files
│   ├── 001_initial.sql
│   └── ...
├── gi.config.json           # User config (gitignored)
├── package.json
├── tsconfig.json
├── vitest.config.ts
├── eslint.config.js
└── README.md

Dependencies

Runtime Dependencies

{
  "dependencies": {
    "better-sqlite3": "latest",
    "sqlite-vec": "latest",
    "commander": "latest",
    "zod": "latest",
    "pino": "latest",
    "pino-pretty": "latest",
    "ora": "latest",
    "chalk": "latest",
    "cli-table3": "latest"
  }
}
Package Purpose
better-sqlite3 Synchronous SQLite driver (fast, native)
sqlite-vec Vector search extension (pure C, cross-platform)
commander CLI argument parsing
zod Schema validation for config and inputs
pino Structured JSON logging
pino-pretty Dev-mode log formatting
ora CLI spinners for progress indication
chalk Terminal colors
cli-table3 ASCII tables for list output

Dev Dependencies

{
  "devDependencies": {
    "typescript": "latest",
    "@types/better-sqlite3": "latest",
    "@types/node": "latest",
    "vitest": "latest",
    "msw": "latest",
    "eslint": "latest",
    "@typescript-eslint/eslint-plugin": "latest",
    "@typescript-eslint/parser": "latest",
    "tsx": "latest"
  }
}
Package Purpose
typescript TypeScript compiler
vitest Test runner
msw Mock Service Worker for API mocking in tests
eslint Linting
tsx Run TypeScript directly during development

GitLab API Strategy

Primary Resources (Bulk Fetch)

Issues and MRs support efficient bulk fetching with incremental sync:

GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100

Required query params for completeness:

  • scope=all - include all issues/MRs, not just authored by current user
  • state=all - include closed items (GitLab defaults may exclude them)

Without these params, the 2+ years of historical data would be incomplete.

Dependent Resources (Per-Parent Fetch)

Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:

GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N

Pagination: Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss.

Sync Pattern

Initial sync:

  1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
  2. For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead)
  3. Fetch all MRs (paginated, ~60 calls)
  4. For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead)
  5. Total: thousands of API calls for initial sync

API Call Estimation Formula:

total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue
            + ceil(mrs/100) + mrs × avg_discussion_pages_per_mr

Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent:

  • Issue list: 30 calls
  • Issue discussions: 3,000 × 1.2 = 3,600 calls
  • MR list: 30 calls
  • MR discussions: 3,000 × 1.2 = 3,600 calls
  • Total: ~7,260 calls

This matters for rate limit planning and setting realistic "10-20 minutes" expectations.

Incremental sync:

  1. Fetch issues where updated_after=cursor (bulk)
  2. For EACH updated issue → refetch ALL its discussions
  3. Fetch MRs where updated_after=cursor (bulk)
  4. For EACH updated MR → refetch ALL its discussions

Critical Assumption (Softened)

We expect adding a note/discussion updates the parent's updated_at, but we do not rely on it exclusively.

Mitigations (MVP):

  1. Tuple cursor semantics: Cursor is a stable tuple (updated_at, gitlab_id). Ties are handled explicitly - process all items with equal updated_at before advancing cursor.
  2. Rolling backfill window: Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
  3. Periodic full re-sync: Remains optional as an extra safety net (gi sync --full).

The backfill window provides 80% of the safety of full resync at <5% of the API cost.

Rate Limiting

  • Default: 10 requests/second with exponential backoff
  • Respect Retry-After headers on 429 responses
  • Add jitter to avoid thundering herd on retry
  • Separate concurrency limits:
    • sync.primaryConcurrency: concurrent requests for issues/MRs list endpoints (default 4)
    • sync.dependentConcurrency: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
    • Bound concurrency per-project to avoid one repo starving the other
  • Initial sync estimate: 10-20 minutes depending on rate limits

Checkpoint Structure

Each checkpoint is a testable milestone where a human can validate the system works before proceeding.

Checkpoint 0: Project Setup

Deliverable: Scaffolded project with GitLab API connection verified and project resolution working

Automated Tests (Vitest):

tests/unit/config.test.ts
  ✓ loads config from gi.config.json
  ✓ throws if config file missing
  ✓ throws if required fields missing (baseUrl, projects)
  ✓ validates project paths are non-empty strings

tests/unit/db.test.ts
  ✓ creates database file if not exists
  ✓ applies migrations in order
  ✓ sets WAL journal mode
  ✓ enables foreign keys

tests/integration/gitlab-client.test.ts
  ✓ (mocked) authenticates with valid PAT
  ✓ (mocked) returns 401 for invalid PAT
  ✓ (mocked) fetches project by path
  ✓ (mocked) handles rate limiting (429) with retry

tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI)
  ✓ authenticates with real PAT against configured baseUrl
  ✓ fetches real project by path
  ✓ handles actual rate limiting behavior

tests/integration/app-lock.test.ts
  ✓ acquires lock successfully
  ✓ updates heartbeat during operation
  ✓ detects stale lock and recovers
  ✓ refuses concurrent acquisition

tests/integration/init.test.ts
  ✓ creates config file with valid structure
  ✓ validates GitLab URL format
  ✓ validates GitLab connection before writing config
  ✓ validates each project path exists in GitLab
  ✓ fails if token not set
  ✓ fails if GitLab auth fails
  ✓ fails if any project path not found
  ✓ prompts before overwriting existing config
  ✓ respects --force to skip confirmation
  ✓ generates gi.config.json with sensible defaults

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi auth-test Authenticated as @username (User Name) Shows GitLab username and display name
gi doctor Status table with ✓/✗ for each check All checks pass (or Ollama shows warning if not running)
gi doctor --json JSON object with check results Valid JSON, success: true for required checks
GITLAB_TOKEN=invalid gi auth-test Error message Non-zero exit code, clear error about auth failure
gi init Interactive prompts Creates valid gi.config.json
gi init (config exists) Confirmation prompt Warns before overwriting
gi --help Command list Shows all available commands
gi version Version number Shows installed version
gi sync-status Last sync time, cursor positions Shows successful last run

Data Integrity Checks:

  • projects table contains rows for each configured project path
  • gitlab_project_id matches actual GitLab project IDs
  • raw_payloads contains project JSON for each synced project

Scope:

  • Project structure (TypeScript, ESLint, Vitest)
  • GitLab API client with PAT authentication
  • Environment and project configuration
  • Basic CLI scaffold with auth-test command
  • doctor command for environment verification
  • Projects table and initial project resolution (no issue/MR ingestion yet)
  • DB migrations + WAL + FK enforcement
  • Sync tracking with crash-safe single-flight lock (heartbeat-based)
  • Rate limit handling with exponential backoff + jitter
  • gi init command for guided setup:
    • Prompts for GitLab base URL
    • Prompts for project paths (comma-separated or multiple prompts)
    • Prompts for token environment variable name (default: GITLAB_TOKEN)
    • Validates before writing config:
      • Token must be set in environment
      • Tests auth with GET /user endpoint
      • Validates each project path with GET /projects/:path
      • Only writes config after all validations pass
    • Generates gi.config.json with sensible defaults
  • gi --help shows all available commands
  • gi <command> --help shows command-specific help
  • gi version shows installed version
  • First-run detection: if no config exists, suggest gi init

Configuration (MVP):

// gi.config.json
{
  "gitlab": {
    "baseUrl": "https://gitlab.example.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project-one" },
    { "path": "group/project-two" }
  ],
  "sync": {
    "backfillDays": 14,
    "staleLockMinutes": 10,
    "heartbeatIntervalSeconds": 30,
    "cursorRewindSeconds": 2,
    "primaryConcurrency": 4,
    "dependentConcurrency": 2
  },
  "storage": {
    "compressRawPayloads": true
  },
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "baseUrl": "http://localhost:11434",
    "concurrency": 4
  }
}

Raw Payload Compression:

  • When storage.compressRawPayloads: true (default), raw JSON payloads are gzip-compressed before storage
  • raw_payloads.content_encoding indicates 'identity' (uncompressed) or 'gzip' (compressed)
  • Compression typically reduces storage by 70-80% for JSON payloads
  • Decompression is handled transparently when reading payloads
  • Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage

Error Classes (src/core/errors.ts):

// Base error class with error codes for programmatic handling
export class GiError extends Error {
  constructor(message: string, public readonly code: string) {
    super(message);
    this.name = 'GiError';
  }
}

// Config errors
export class ConfigNotFoundError extends GiError {
  constructor() {
    super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
  }
}

export class ConfigValidationError extends GiError {
  constructor(details: string) {
    super(`Invalid config: ${details}`, 'CONFIG_INVALID');
  }
}

// GitLab API errors
export class GitLabAuthError extends GiError {
  constructor() {
    super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
  }
}

export class GitLabNotFoundError extends GiError {
  constructor(resource: string) {
    super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
  }
}

export class GitLabRateLimitError extends GiError {
  constructor(public readonly retryAfter: number) {
    super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
  }
}

// Database errors
export class DatabaseLockError extends GiError {
  constructor() {
    super('Another sync is running. Use --force to override.', 'DB_LOCKED');
  }
}

// Embedding errors
export class OllamaConnectionError extends GiError {
  constructor() {
    super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
  }
}

export class EmbeddingError extends GiError {
  constructor(documentId: number, reason: string) {
    super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
  }
}

Logging Strategy (src/core/logger.ts):

import pino from 'pino';

// Logs go to stderr, results to stdout (allows clean JSON piping)
export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  transport: process.env.NODE_ENV === 'production' ? undefined : {
    target: 'pino-pretty',
    options: { colorize: true, destination: 2 }  // 2 = stderr
  }
}, pino.destination(2));

Log Levels:

Level When to use
debug Detailed sync progress, API calls, SQL queries
info Sync start/complete, document counts, search timing
warn Rate limits hit, Ollama unavailable (fallback to FTS), retries
error Failures that stop operations

Logging Conventions:

  • Always include structured context: logger.info({ project, count }, 'Fetched issues')
  • Errors include err object: logger.error({ err, documentId }, 'Embedding failed')
  • All logs to stderr so gi search --json output stays clean on stdout

DB Runtime Defaults (Checkpoint 0):

  • On every connection:
    • PRAGMA journal_mode=WAL;
    • PRAGMA foreign_keys=ON;

Schema (Checkpoint 0):

-- Projects table (configured targets)
CREATE TABLE projects (
  id INTEGER PRIMARY KEY,
  gitlab_project_id INTEGER UNIQUE NOT NULL,
  path_with_namespace TEXT NOT NULL,
  default_branch TEXT,
  web_url TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);

-- Sync tracking for reliability
CREATE TABLE sync_runs (
  id INTEGER PRIMARY KEY,
  started_at INTEGER NOT NULL,
  heartbeat_at INTEGER NOT NULL,
  finished_at INTEGER,
  status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
  command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
  error TEXT
);

-- Crash-safe single-flight lock (DB-enforced)
CREATE TABLE app_locks (
  name TEXT PRIMARY KEY,         -- 'sync'
  owner TEXT NOT NULL,           -- random run token (UUIDv4)
  acquired_at INTEGER NOT NULL,
  heartbeat_at INTEGER NOT NULL
);

-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  resource_type TEXT NOT NULL,   -- 'issues' | 'merge_requests'
  updated_at_cursor INTEGER,     -- last fully processed updated_at (ms epoch)
  tie_breaker_id INTEGER,        -- last fully processed gitlab_id (for stable ordering)
  PRIMARY KEY(project_id, resource_type)
);

-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
  id INTEGER PRIMARY KEY,
  source TEXT NOT NULL,          -- 'gitlab'
  project_id INTEGER REFERENCES projects(id),  -- nullable for instance-level resources
  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
  gitlab_id TEXT NOT NULL,       -- TEXT because discussion IDs are strings; numeric IDs stored as strings
  fetched_at INTEGER NOT NULL,
  content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
  payload BLOB NOT NULL          -- raw JSON or gzip-compressed JSON
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);

-- Schema version tracking for migrations
CREATE TABLE schema_version (
  version INTEGER PRIMARY KEY,
  applied_at INTEGER NOT NULL,
  description TEXT
);

Checkpoint 1: Issue Ingestion

Deliverable: All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync

Automated Tests (Vitest):

tests/unit/issue-transformer.test.ts
  ✓ transforms GitLab issue payload to normalized schema
  ✓ extracts labels from issue payload
  ✓ handles missing optional fields gracefully

tests/unit/pagination.test.ts
  ✓ fetches all pages when multiple exist
  ✓ respects per_page parameter
  ✓ follows X-Next-Page header until empty/absent
  ✓ falls back to empty-page stop if headers missing (robustness)

tests/unit/discussion-transformer.test.ts
  ✓ transforms discussion payload to normalized schema
  ✓ extracts notes array from discussion
  ✓ sets individual_note flag correctly
  ✓ flags system notes with is_system=1
  ✓ preserves note order via position field

tests/integration/issue-ingestion.test.ts
  ✓ inserts issues into database
  ✓ creates labels from issue payloads
  ✓ links issues to labels via junction table
  ✓ stores raw payload for each issue
  ✓ updates cursor after successful page commit
  ✓ resumes from cursor on subsequent runs

tests/integration/issue-discussion-ingestion.test.ts
  ✓ fetches discussions for each issue
  ✓ creates discussion rows with correct issue FK
  ✓ creates note rows linked to discussions
  ✓ stores system notes with is_system=1 flag
  ✓ handles individual_note=true discussions

tests/integration/sync-runs.test.ts
  ✓ creates sync_run record on start
  ✓ marks run as succeeded on completion
  ✓ marks run as failed with error message on failure
  ✓ refuses concurrent run (single-flight)
  ✓ allows --force to override stale running status

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi ingest --type=issues Progress bar, final count Completes without error
gi list issues --limit=10 Table of 10 issues Shows iid, title, state, author
gi list issues --project=group/project-one Filtered list Only shows issues from that project
gi count issues Issues: 1,234 (example) Count matches GitLab UI
gi show issue 123 Issue detail view Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use --project=PATH
gi count discussions --type=issue Issue Discussions: 5,678 Non-zero count
gi count notes --type=issue Issue Notes: 12,345 (excluding 2,345 system) Non-zero count
gi sync-status Last sync time, cursor positions Shows successful last run

Data Integrity Checks:

  • SELECT COUNT(*) FROM issues matches GitLab issue count for configured projects
  • Every issue has a corresponding raw_payloads row
  • Labels in issue_labels junction all exist in labels table
  • sync_cursors has entry for each (project_id, 'issues') pair
  • Re-running gi ingest --type=issues fetches 0 new items (cursor is current)
  • SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue' is non-zero
  • Every discussion has at least one note
  • individual_note = true discussions have exactly one note

Scope:

  • Issue fetcher with pagination handling
  • Raw JSON storage in raw_payloads table
  • Normalized issue schema in SQLite
  • Labels ingestion derived from issue payload:
    • Always persist label names from labels: string[]
    • Optionally request with_labels_details=true to capture color/description when available
  • Issue discussions fetcher:
    • Uses GET /projects/:id/issues/:iid/discussions
    • Fetches all discussions for each issue during ingest
    • Preserve system notes but flag them with is_system=1
  • Incremental sync support (run tracking + per-project cursor)
  • Basic list/count CLI commands

Reliability/Idempotency Rules:

  • Every ingest/sync creates a sync_runs row
  • Single-flight via DB-enforced app lock:
    • On start: acquire lock via transactional compare-and-swap:
      • BEGIN IMMEDIATE (acquires write lock immediately)
      • If no row exists → INSERT new lock
      • Else if heartbeat_at is stale (> staleLockMinutes) → UPDATE owner + timestamps
      • Else if owner matches current run → UPDATE heartbeat (re-entrant)
      • Else → ROLLBACK and fail fast (another run is active)
      • COMMIT
    • During run: update heartbeat_at every 30 seconds
    • If existing lock's heartbeat_at is stale (> 10 minutes), treat as abandoned and acquire
    • --force remains as operator override for edge cases, but should rarely be needed
  • Cursor advances only after successful transaction commit per page/batch
  • Ordering: updated_at ASC, tie-breaker gitlab_id ASC
  • Use explicit transactions for batch inserts

Schema Preview:

CREATE TABLE issues (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);
CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);

-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER,                  -- optional (only if available)
  project_id INTEGER NOT NULL REFERENCES projects(id),
  name TEXT NOT NULL,
  color TEXT,
  description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);

CREATE TABLE issue_labels (
  issue_id INTEGER REFERENCES issues(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);

-- Discussion threads for issues (MR discussions added in CP2)
CREATE TABLE discussions (
  id INTEGER PRIMARY KEY,
  gitlab_discussion_id TEXT NOT NULL,         -- GitLab's string ID (e.g. "6a9c1750b37d...")
  project_id INTEGER NOT NULL REFERENCES projects(id),
  issue_id INTEGER REFERENCES issues(id),
  merge_request_id INTEGER,                   -- FK added in CP2 via ALTER TABLE
  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
  first_note_at INTEGER,                      -- for ordering discussions
  last_note_at INTEGER,                       -- for "recently active" queries
  last_seen_at INTEGER NOT NULL,              -- updated on every upsert during sync
  resolvable BOOLEAN,                         -- MR discussions can be resolved
  resolved BOOLEAN,
  CHECK (
    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
  )
);
CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);

-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
  project_id INTEGER NOT NULL REFERENCES projects(id),
  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null
  is_system BOOLEAN NOT NULL DEFAULT 0,       -- system notes (assignments, label changes, etc.)
  author_username TEXT,
  body TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  last_seen_at INTEGER NOT NULL,              -- updated on every upsert during sync
  position INTEGER,                           -- derived from array order in API response (0-indexed)
  resolvable BOOLEAN,
  resolved BOOLEAN,
  resolved_by TEXT,
  resolved_at INTEGER,
  -- DiffNote position metadata (only populated for MR DiffNotes in CP2)
  position_old_path TEXT,
  position_new_path TEXT,
  position_old_line INTEGER,
  position_new_line INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_system ON notes(is_system);

Checkpoint 2: MR Ingestion

Deliverable: All MRs + MR discussions + notes with DiffNote paths captured

Automated Tests (Vitest):

tests/unit/mr-transformer.test.ts
  ✓ transforms GitLab MR payload to normalized schema
  ✓ extracts labels from MR payload
  ✓ handles missing optional fields gracefully

tests/unit/diffnote-transformer.test.ts
  ✓ extracts DiffNote position metadata (paths and lines)
  ✓ handles missing position fields gracefully

tests/integration/mr-ingestion.test.ts
  ✓ inserts MRs into database
  ✓ creates labels from MR payloads
  ✓ links MRs to labels via junction table
  ✓ stores raw payload for each MR

tests/integration/mr-discussion-ingestion.test.ts
  ✓ fetches discussions for each MR
  ✓ creates discussion rows with correct MR FK
  ✓ creates note rows linked to discussions
  ✓ extracts position_new_path from DiffNotes
  ✓ captures note-level resolution status
  ✓ captures note type (DiscussionNote, DiffNote)

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi ingest --type=merge_requests Progress bar, final count Completes without error
gi list mrs --limit=10 Table of 10 MRs Shows iid, title, state, author, branch
gi count mrs Merge Requests: 567 (example) Count matches GitLab UI
gi show mr 123 MR detail with discussions Shows title, description, discussion threads
gi count discussions Discussions: 12,345 Total count (issue + MR)
gi count discussions --type=mr MR Discussions: 6,789 MR discussions only
gi count notes Notes: 45,678 (excluding 8,901 system) Total with system note count

Data Integrity Checks:

  • SELECT COUNT(*) FROM merge_requests matches GitLab MR count
  • SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest' is non-zero
  • DiffNotes have position_new_path populated when available
  • Discussion first_note_at <= last_note_at for all rows

Scope:

  • MR fetcher with pagination
  • MR discussions fetcher:
    • Uses GET /projects/:id/merge_requests/:iid/discussions
    • Fetches all discussions for each MR during ingest
    • Capture DiffNote file path/line metadata from position field for filename search
  • Relationship linking (discussion → MR, notes → discussion)
  • Extended CLI commands for MR display with threads
  • Add idx_notes_type and idx_notes_new_path indexes for DiffNote queries

Note: MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.

Schema Additions:

CREATE TABLE merge_requests (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  source_branch TEXT,
  target_branch TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
  merged_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);

-- MR labels (reuse same labels table from CP1)
CREATE TABLE mr_labels (
  merge_request_id INTEGER REFERENCES merge_requests(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);

-- Additional indexes for DiffNote queries (tables created in CP1)
CREATE INDEX idx_notes_type ON notes(type);
CREATE INDEX idx_notes_new_path ON notes(position_new_path);

-- Migration: Add FK constraint to discussions table (was deferred from CP1)
-- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK
-- This is handled by the migration system; pseudocode for clarity:
-- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id)
-- 2. INSERT INTO discussions_new SELECT * FROM discussions
-- 3. DROP TABLE discussions
-- 4. ALTER TABLE discussions_new RENAME TO discussions
-- 5. Recreate indexes

MR Discussion Processing Rules:

  • DiffNote position data is extracted and stored:
    • position.old_path, position.new_path for file-level search
    • position.old_line, position.new_line for line-level context
  • MR discussions can be resolvable; resolution status is captured at note level

Deliverable: Documents generated + FTS5 index; gi search --mode=lexical works end-to-end (no Ollama required)

Automated Tests (Vitest):

tests/unit/document-extractor.test.ts
  ✓ extracts issue document (title + description)
  ✓ extracts MR document (title + description)
  ✓ extracts discussion document with full thread context
  ✓ includes parent issue/MR title in discussion header
  ✓ formats notes with author and timestamp
  ✓ excludes system notes from discussion documents by default
  ✓ includes system notes only when --include-system-notes enabled (debug)
  ✓ truncates content exceeding 8000 tokens at note boundaries
  ✓ preserves first and last notes when truncating middle
  ✓ computes SHA-256 content hash consistently

tests/integration/document-creation.test.ts
  ✓ creates document for each issue
  ✓ creates document for each MR
  ✓ creates document for each discussion
  ✓ populates document_labels junction table
  ✓ computes content_hash for each document
  ✓ excludes system notes from discussion content

tests/integration/fts-index.test.ts
  ✓ documents_fts row count matches documents
  ✓ FTS triggers fire on insert/update/delete
  ✓ updates propagate via triggers

tests/integration/fts-search.test.ts
  ✓ returns exact keyword matches
  ✓ porter stemming works (search/searching)
  ✓ returns empty for non-matching query

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi generate-docs Progress bar, final count Completes without error
gi generate-docs (re-run) 0 documents to regenerate Skips unchanged docs
gi search "authentication" --mode=lexical FTS results Returns matching documents, works without Ollama
gi stats Document count stats Shows document coverage

Data Integrity Checks:

  • SELECT COUNT(*) FROM documents = issues + MRs + discussions
  • SELECT COUNT(*) FROM documents_fts = SELECT COUNT(*) FROM documents (via FTS triggers)
  • SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000 logs truncation warnings
  • Discussion documents include parent title in content_text
  • Discussion documents exclude system notes

Scope:

  • Document extraction layer:
    • Canonical "search documents" derived from issues/MRs/discussions
    • Stable content hashing for change detection (SHA-256 of content_text)
    • Truncation: content_text capped at 8000 tokens at NOTE boundaries
      • Implementation: Use character budget, not exact token count
      • maxChars = 32000 (conservative 4 chars/token estimate)
      • Drop whole notes from middle, never cut mid-note
      • approxTokens = ceil(charCount / 4) for reporting/logging only
  • System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search)
  • Denormalized metadata for fast filtering (author, labels, dates)
  • Fast label filtering via document_labels join table
  • FTS5 index for lexical search
  • gi search --mode=lexical CLI command (works without Ollama)

This checkpoint delivers a working search experience before introducing embedding infrastructure risk.

Schema Additions (CP3A):

-- Unified searchable documents (derived from issues/MRs/discussions)
-- Note: Full documents table schema is in CP3B section for continuity with embeddings
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,
  updated_at INTEGER,
  url TEXT,
  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  is_truncated BOOLEAN NOT NULL DEFAULT 0,
  truncated_reason TEXT,         -- 'token_limit_middle_drop' | null
  UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);

-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- Fast path filtering for documents (extracted from DiffNote positions)
CREATE TABLE document_paths (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  path TEXT NOT NULL,
  PRIMARY KEY(document_id, path)
);
CREATE INDEX idx_document_paths_path ON document_paths(path);

-- Track sources that require document regeneration (populated during ingestion)
CREATE TABLE dirty_sources (
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id
  queued_at INTEGER NOT NULL,
  PRIMARY KEY(source_type, source_id)
);

-- Resumable dependent fetches (discussions are per-parent resources)
CREATE TABLE pending_discussion_fetches (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  noteable_type TEXT NOT NULL,            -- 'Issue' | 'MergeRequest'
  noteable_iid INTEGER NOT NULL,          -- parent iid (stable human identifier)
  queued_at INTEGER NOT NULL,
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER,
  last_error TEXT,
  PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_retry
  ON pending_discussion_fetches(attempt_count, last_attempt_at)
  WHERE last_error IS NOT NULL;

-- Full-text search for lexical retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id',
  tokenize='porter unicode61'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
END;

CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

FTS5 Tokenizer Notes:

  • porter enables stemming (searching "authentication" matches "authenticating", "authenticated")
  • unicode61 handles Unicode properly
  • Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer

Deliverable: Embeddings generated + gi search --mode=semantic works; graceful fallback if Ollama unavailable

Automated Tests (Vitest):

tests/unit/embedding-client.test.ts
  ✓ connects to Ollama API
  ✓ generates embedding for text input
  ✓ returns 768-dimension vector
  ✓ handles Ollama connection failure gracefully
  ✓ batches requests (32 documents per batch)

tests/integration/embedding-storage.test.ts
  ✓ stores embedding in sqlite-vec
  ✓ embedding rowid matches document id
  ✓ creates embedding_metadata record
  ✓ skips re-embedding when content_hash unchanged
  ✓ re-embeds when content_hash changes

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi embed --all Progress bar with ETA Completes without error
gi embed --all (re-run) 0 documents to embed Skips already-embedded docs
gi stats Embedding coverage stats Shows 100% coverage
gi stats --json JSON stats object Valid JSON with document/embedding counts
gi embed --all (Ollama stopped) Clear error message Non-zero exit, actionable error
gi search "authentication" --mode=semantic Vector results Returns semantically similar documents

Data Integrity Checks:

  • SELECT COUNT(*) FROM embeddings = SELECT COUNT(*) FROM documents
  • SELECT COUNT(*) FROM embedding_metadata = SELECT COUNT(*) FROM documents
  • All embedding_metadata.content_hash matches corresponding documents.content_hash

Scope:

  • Ollama integration (nomic-embed-text model)
  • Embedding generation pipeline:
    • Batch size: 32 documents per batch
    • Concurrency: configurable (default 4 workers)
    • Retry with exponential backoff for transient failures (max 3 attempts)
    • Per-document failure recording to enable targeted re-runs
  • Vector storage in SQLite (sqlite-vec extension)
  • Progress tracking and resumability
  • gi search --mode=semantic CLI command

Ollama API Contract:

// POST http://localhost:11434/api/embed (batch endpoint - preferred)
interface OllamaEmbedRequest {
  model: string;      // "nomic-embed-text"
  input: string[];    // array of texts to embed (up to 32)
}

interface OllamaEmbedResponse {
  model: string;
  embeddings: number[][];  // array of 768-dim vectors
}

// POST http://localhost:11434/api/embeddings (single text - fallback)
interface OllamaEmbeddingsRequest {
  model: string;
  prompt: string;
}

interface OllamaEmbeddingsResponse {
  embedding: number[];
}

Usage:

  • Use /api/embed for batching (up to 32 documents per request)
  • Fall back to /api/embeddings for single documents or if batch fails
  • Check Ollama availability with GET http://localhost:11434/api/tags

Schema Additions (CP3B):

-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
  embedding float[768]
);

-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id),
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL,
  -- Error tracking for resumable embedding
  last_error TEXT,               -- error message from last failed attempt
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER        -- when last attempt occurred
);

-- Index for finding failed embeddings to retry
CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;

Storage Rule (MVP):

  • Insert embedding with rowid = documents.id
  • Upsert embedding_metadata by document_id
  • This alignment simplifies joins and eliminates rowid mapping fragility

Document Extraction Rules:

Source content_text Construction
Issue title + "\n\n" + description
MR title + "\n\n" + description
Discussion Full thread with context (see below)

Discussion Document Format:

[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"]     -- present if any DiffNotes exist in thread

--- Thread ---

@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...

@janedoe (2024-03-15):
Agreed. What about refresh token strategy?

@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...

System Notes Exclusion Rule:

  • System notes (is_system=1) are stored in the DB for audit purposes
  • System notes are EXCLUDED from discussion documents by default
  • This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings
  • Debug flag --include-system-notes available for troubleshooting

This format preserves:

  • Parent context (issue/MR title and number)
  • Project path for scoped search
  • Direct URL for navigation
  • Labels for context
  • File paths from DiffNotes (enables immediate file search)
  • Author attribution for each note
  • Temporal ordering of the conversation
  • Full thread semantics for decision traceability

Truncation (Note-Boundary Aware): If content exceeds 8000 tokens (~32000 chars):

Algorithm:

  1. Count non-system notes in the discussion
  2. If total chars ≤ maxChars, no truncation needed
  3. Otherwise, drop whole notes from the MIDDLE:
    • Preserve first N notes and last M notes
    • Never cut mid-note (produces unreadable snippets and worse embeddings)
    • Continue dropping middle notes until under maxChars
  4. Insert marker: \n\n[... N notes omitted for length ...]\n\n
  5. Set documents.is_truncated = 1
  6. Set documents.truncated_reason = 'token_limit_middle_drop'
  7. Log a warning with document ID and original/truncated token count

Edge Cases:

  • Single note > 32000 chars: Truncate at character boundary, append [truncated], set truncated_reason = 'single_note_oversized'
  • First + last note > 32000 chars: Keep only first note (truncated if needed), set truncated_reason = 'first_last_oversized'
  • Only one note in discussion: If it exceeds limit, truncate at char boundary with [truncated]

Why note-boundary truncation:

  • Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
  • Keeping whole notes preserves semantic coherence for embeddings
  • First notes contain context/problem statement; last notes contain conclusions
  • Middle notes are often back-and-forth that's less critical

Token estimation: approxTokens = ceil(charCount / 4). No tokenizer dependency.

This metadata enables:

  • Monitoring truncation frequency in production
  • Future investigation of high-value truncated documents
  • Debugging when search misses expected content

Checkpoint 4: Hybrid Search (Semantic + Lexical)

Deliverable: Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content

Automated Tests (Vitest):

tests/unit/search-query.test.ts
  ✓ parses filter flags (--type, --author, --after, --label)
  ✓ validates date format for --after
  ✓ handles multiple --label flags

tests/unit/rrf-ranking.test.ts
  ✓ computes RRF score correctly
  ✓ merges results from vector and FTS retrievers
  ✓ handles documents appearing in only one retriever
  ✓ respects k=60 parameter

tests/integration/vector-search.test.ts
  ✓ returns results for semantic query
  ✓ ranks similar content higher
  ✓ returns empty for nonsense query

tests/integration/fts-search.test.ts
  ✓ returns exact keyword matches
  ✓ handles porter stemming (search/searching)
  ✓ returns empty for non-matching query

tests/integration/hybrid-search.test.ts
  ✓ combines vector and FTS results
  ✓ applies type filter correctly
  ✓ applies author filter correctly
  ✓ applies date filter correctly
  ✓ applies label filter correctly
  ✓ falls back to FTS when Ollama unavailable

tests/e2e/golden-queries.test.ts
  ✓ "authentication redesign" returns known auth-related items
  ✓ "database migration" returns known migration items
  ✓ [8 more domain-specific golden queries]

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi search "authentication" Ranked results with snippets Returns relevant items, shows score
gi search "authentication" --project=group/project-one Project-scoped results Only results from that project
gi search "authentication" --type=mr Only MR results No issues or discussions in output
gi search "authentication" --author=johndoe Filtered by author All results have @johndoe
gi search "authentication" --after=2024-01-01 Date filtered All results after date
gi search "authentication" --label=bug Label filtered All results have bug label
gi search "redis" --mode=lexical FTS results only Shows FTS results, no embeddings
gi search "auth" --path=src/auth/ Path-filtered results Only results referencing files in src/auth/
gi search "authentication" --json JSON output Valid JSON matching stable schema
gi search "authentication" --explain Rank breakdown Shows vector/FTS/RRF contributions
gi search "authentication" --limit=5 5 results max Returns at most 5 results
gi search "xyznonexistent123" No results message Graceful empty state
gi search "authentication" (no data synced) No data message Shows "Run gi sync first"
gi search "authentication" (Ollama stopped) FTS results + warning Shows warning, still returns results

Golden Query Test Suite: Create tests/fixtures/golden-queries.json with 10 queries and expected URLs:

[
  {
    "query": "authentication redesign",
    "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
    "minResults": 1,
    "maxRank": 10
  }
]

Each query must have at least one expected URL appear in top 10 results.

Data Integrity Checks:

  • documents_fts row count matches documents row count
  • Search returns results for known content (not empty)
  • JSON output validates against defined schema
  • All result URLs are valid GitLab URLs

Scope:

  • Hybrid retrieval:
    • Vector recall (sqlite-vec) + FTS lexical recall (fts5)
    • Merge + rerank results using Reciprocal Rank Fusion (RRF)
  • Query embedding generation (same Ollama pipeline as documents)
  • Result ranking and scoring (document-level)
  • Search filters: --type=issue|mr|discussion, --author=username, --after=date, --label=name, --project=path, --path=file, --limit=N
    • --limit=N controls result count (default: 20, max: 100)
    • --path filters documents by referenced file paths (from DiffNote positions):
      • If --path ends with /: prefix match (path LIKE 'src/auth/%')
      • Otherwise: exact match OR prefix on directory boundary
      • Examples: --path=src/auth/ matches src/auth/login.ts, src/auth/utils/helpers.ts
      • Examples: --path=src/auth/login.ts matches only that exact file
      • Glob patterns deferred to post-MVP
    • Label filtering operates on document_labels (indexed, exact-match)
    • Filters work identically in hybrid and lexical modes
  • Debug: --explain returns rank contributions from vector + FTS + RRF
  • Output formatting: ranked list with title, snippet, score, URL
  • JSON output mode for AI/agent consumption (stable documented schema)
  • Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
  • Empty state handling:
    • No documents indexed: No data indexed. Run 'gi sync' first.
    • Query returns no results: No results found for "query".
    • Filters exclude all results: No results match the specified filters.
    • Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")

Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:

  1. Determine recall size (adaptive based on filters):
    • baseTopK = 50
    • If any filters present (--project, --type, --author, --label, --path, --after): topK = 200
    • This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
  2. Query both vector index (top topK) and FTS5 (top topK)
    • Vector recall via sqlite-vec + FTS lexical recall via fts5
    • Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
  3. Merge results by document_id
  4. Combine with Reciprocal Rank Fusion (RRF):
    • For each retriever list, assign ranks (1..N)
    • rrfScore = Σ 1 / (k + rank) with k=60 (tunable)
    • RRF is simpler than weighted sums and doesn't require score normalization
  5. Apply remaining filters (date ranges, labels, paths that weren't applied in SQL)
  6. Return top K results

Why Adaptive Recall:

  • Fixed top-50 + filter can easily return 0 results even when relevant docs exist
  • Increasing recall when filters are present catches more candidates before filtering
  • SQL-level filtering is preferred (faster, uses indexes) but not always possible

Why RRF over Weighted Sums:

  • FTS5 BM25 scores and vector distances use different scales
  • Weighted sums (0.7 * vector + 0.3 * fts) require careful normalization
  • RRF operates on ranks, not scores, making it robust to scale differences
  • Well-established in information retrieval literature

Graceful Degradation:

  • If Ollama is unreachable during search, automatically fall back to FTS5-only search
  • Display warning: "Embedding service unavailable, using lexical search only"
  • embed command fails with actionable error if Ollama is down

CLI Interface:

# Basic semantic search
gi search "authentication redesign"

# Search within specific project
gi search "authentication" --project=group/project-one

# Search by file path (finds discussions/MRs touching this file)
gi search "rate limit" --path=src/client.ts

# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical

# Filtered search
gi search "authentication" --type=mr --after=2024-01-01

# Filter by label
gi search "performance" --label=bug --label=critical

# JSON output for programmatic use
gi search "payment processing" --json

# Explain search (shows RRF contributions)
gi search "auth" --explain

CLI Output Example:

$ gi search "authentication"

Found 23 results (hybrid search, 0.34s)

[1] MR !847 - Refactor auth to use JWT tokens (0.82)
    @johndoe · 2024-03-15 · group/project-one
    "...moving away from session cookies to JWT for authentication..."
    https://gitlab.example.com/group/project-one/-/merge_requests/847

[2] Issue #234 - Authentication redesign discussion (0.79)
    @janedoe · 2024-02-28 · group/project-one
    "...we need to redesign the authentication flow because..."
    https://gitlab.example.com/group/project-one/-/issues/234

[3] Discussion on Issue #234 (0.76)
    @johndoe · 2024-03-01 · group/project-one
    "I think we should move to JWT-based auth because the session..."
    https://gitlab.example.com/group/project-one/-/issues/234#note_12345

JSON Output Schema (Stable):

For AI/agent consumption, --json output follows this stable schema:

interface SearchResult {
  documentId: number;
  sourceType: "issue" | "merge_request" | "discussion";
  title: string | null;
  url: string;
  projectPath: string;
  author: string | null;
  createdAt: string;       // ISO 8601
  updatedAt: string;       // ISO 8601
  score: number;           // normalized 0-1 (rrfScore / maxRrfScore in this result set)
  snippet: string;         // truncated content_text
  labels: string[];
  // Only present with --explain flag
  explain?: {
    vectorRank?: number;   // null if not in vector results
    ftsRank?: number;      // null if not in FTS results
    rrfScore: number;      // raw RRF score (rank-based, comparable within a query)
  };
}

// Note on score normalization:
// - `score` is normalized 0-1 for UI display convenience
// - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set)
// - Use `explain.rrfScore` for raw scores when comparing across queries
// - Scores are NOT comparable across different queries

interface SearchResponse {
  query: string;
  mode: "hybrid" | "lexical" | "semantic";
  totalResults: number;
  results: SearchResult[];
  warnings?: string[];     // e.g., "Embedding service unavailable"
}

Schema versioning: Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.


Checkpoint 5: Incremental Sync

Deliverable: Efficient ongoing synchronization with GitLab

Automated Tests (Vitest):

tests/unit/cursor-management.test.ts
  ✓ advances cursor after successful page commit
  ✓ uses tie-breaker id for identical timestamps
  ✓ does not advance cursor on failure
  ✓ resets cursor on --full flag

tests/unit/change-detection.test.ts
  ✓ detects content_hash mismatch
  ✓ queues document for re-embedding on change
  ✓ skips re-embedding when hash unchanged

tests/integration/incremental-sync.test.ts
  ✓ fetches only items updated after cursor
  ✓ refetches discussions for updated issues
  ✓ refetches discussions for updated MRs
  ✓ updates existing records (not duplicates)
  ✓ creates new records for new items
  ✓ re-embeds documents with changed content_hash

tests/integration/sync-recovery.test.ts
  ✓ resumes from cursor after interrupted sync
  ✓ marks failed run with error message
  ✓ handles rate limiting (429) with backoff
  ✓ respects Retry-After header

Manual CLI Smoke Tests:

Command Expected Output Pass Criteria
gi sync (no changes) 0 issues, 0 MRs updated Fast completion, no API calls beyond cursor check
gi sync (after GitLab change) 1 issue updated, 3 discussions refetched Detects and syncs the change
gi sync --full Full sync progress Resets cursors, fetches everything
gi sync-status Last sync time, cursor positions Shows current state
gi sync (with rate limit) Backoff messages Respects rate limits, completes eventually
gi search "new content" (after sync) Returns new content New content is searchable

End-to-End Sync Verification:

  1. Note the current sync_cursors values
  2. Create a new comment on an issue in GitLab
  3. Run gi sync
  4. Verify:
    • Issue's updated_at in DB matches GitLab
    • New discussion row exists
    • New note row exists
    • New document row exists for discussion
    • New embedding exists for document
    • gi search "new comment text" returns the new discussion
    • Cursor advanced past the updated issue

Data Integrity Checks:

  • sync_cursors timestamp <= max updated_at in corresponding table
  • No orphaned documents (all have valid source_id)
  • embedding_metadata.content_hash = documents.content_hash for all rows
  • sync_runs has complete audit trail

Scope:

  • Delta sync based on stable tuple cursor (updated_at, gitlab_id)
  • Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
  • Dependent resources sync strategy (discussions refetched when parent updates)
  • Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
  • Sync status reporting
  • Recommended: run via cron every 10 minutes

Correctness Rules (MVP):

  1. Fetch pages ordered by updated_at ASC, within identical timestamps by gitlab_id ASC
  2. Cursor is a stable tuple (updated_at, gitlab_id):
    • GitLab API cannot express (updated_at = X AND id > Y) server-side.
    • Use cursor rewind + local filtering:
      • Call GitLab with updated_after = cursor_updated_at - rewindSeconds (default 2s, configurable)
      • Locally discard items where:
        • updated_at < cursor_updated_at, OR
        • updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id
      • This makes the tuple cursor rule true in practice while keeping API calls simple.
    • Cursor advances only after successful DB commit for that page
    • When advancing, set cursor to the last processed item's (updated_at, gitlab_id)
  3. Dependent resources:
    • For each updated issue/MR, refetch ALL its discussions
    • Discussion documents are regenerated and re-embedded if content_hash changes
  4. Rolling backfill window:
    • After cursor-based delta sync, also fetch items where updated_at > NOW() - backfillDays
    • This catches any items whose timestamps were updated without triggering our cursor
  5. A document is queued for embedding iff documents.content_hash != embedding_metadata.content_hash
  6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

Why Dependent Resource Model:

  • GitLab Discussions API doesn't provide a global updated_after stream
  • Discussions are listed per-issue or per-MR, not as a top-level resource
  • Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct

CLI Commands:

# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
gi sync                    # orchestrates all steps
gi sync --no-embed         # skip embedding step (fast ingest/debug)
gi sync --no-docs          # skip document regeneration (debug)

# Force full re-sync (resets cursors)
gi sync --full

# Override stale 'running' run after operator review
gi sync --force

# Show sync status
gi sync-status

Orchestration steps (in order):

  1. Acquire app lock with heartbeat
  2. Ingest delta (issues, MRs) based on cursors
    • For each upserted issue/MR, enqueue into pending_discussion_fetches
    • INSERT into dirty_sources for each upserted issue/MR
  3. Process pending_discussion_fetches queue (bounded per run, retryable):
    • Fetch discussions for each queued parent
    • On success: upsert discussions/notes, INSERT into dirty_sources, DELETE from queue
    • On failure: increment attempt_count, record last_error, leave in queue for retry
    • Bound processing: max N parents per sync run to avoid unbounded API calls
  4. Apply rolling backfill window
  5. Regenerate documents for entities in dirty_sources (process + delete from queue)
  6. Embed documents with changed content_hash
  7. FTS triggers auto-sync (no explicit step needed)
  8. Release lock, record sync_run as succeeded

Why queue-based discussion fetching:

  • One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync
  • Primary resource cursors can advance independently
  • Discussions can be retried without re-fetching all issues/MRs
  • Bounded processing prevents unbounded API calls per sync run

Individual commands remain available for checkpoint testing and debugging:

  • gi ingest --type=issues
  • gi ingest --type=merge_requests
  • gi embed --all
  • gi embed --retry-failed

CLI Command Reference

All commands support --help for detailed usage information.

Setup & Diagnostics

Command CP Description
gi init 0 Interactive setup wizard; creates gi.config.json
gi auth-test 0 Verify GitLab authentication
gi doctor 0 Check environment (GitLab, Ollama, DB)
gi doctor --json 0 JSON output for scripting
gi version 0 Show installed version

Data Ingestion

Command CP Description
gi ingest --type=issues 1 Fetch issues from GitLab
gi ingest --type=merge_requests 2 Fetch MRs and discussions
gi generate-docs 3A Extract documents from issues/MRs/discussions
gi embed --all 3B Generate embeddings for all documents
gi embed --retry-failed 3 Retry failed embeddings
gi sync 5 Full sync orchestration (ingest + docs + embed)
gi sync --full 5 Force complete re-sync (reset cursors)
gi sync --force 5 Override stale lock after operator review
gi sync --no-embed 5 Sync without embedding (faster)

Data Inspection

Command CP Description
gi list issues [--limit=N] [--project=PATH] 1 List issues
gi list mrs --limit=N 2 List merge requests
gi count issues 1 Count issues
gi count mrs 2 Count merge requests
gi count discussions --type=issue 1 Count issue discussions
gi count discussions 2 Count all discussions
gi count discussions --type=mr 2 Count MR discussions
gi count notes --type=issue 1 Count issue notes (excluding system)
gi count notes 2 Count all notes (excluding system)
gi show issue <iid> [--project=PATH] 1 Show issue details (prompts if iid ambiguous across projects)
gi show mr <iid> [--project=PATH] 2 Show MR details with discussions
gi stats 3 Embedding coverage statistics
gi stats --json 3 JSON stats for scripting
gi sync-status 1 Show cursor positions and last sync
Command CP Description
gi search "query" 4 Hybrid semantic + lexical search
gi search "query" --mode=lexical 3 Lexical-only search (no Ollama required)
gi search "query" --type=issue|mr|discussion 4 Filter by document type
gi search "query" --author=USERNAME 4 Filter by author
gi search "query" --after=YYYY-MM-DD 4 Filter by date
gi search "query" --label=NAME 4 Filter by label (repeatable)
gi search "query" --project=PATH 4 Filter by project
gi search "query" --path=FILE 4 Filter by file path
gi search "query" --limit=N 4 Limit results (default: 20, max: 100)
gi search "query" --json 4 JSON output for scripting
gi search "query" --explain 4 Show ranking breakdown

Database Management

Command CP Description
gi backup 0 Create timestamped database backup
gi reset --confirm 0 Delete database and reset cursors

Error Handling

Common errors and their resolutions:

Configuration Errors

Error Cause Resolution
Config file not found No gi.config.json Run gi init to create configuration
Invalid config: missing baseUrl Malformed config Re-run gi init or fix gi.config.json manually
Invalid config: no projects defined Empty projects array Add at least one project path to config

Authentication Errors

Error Cause Resolution
GITLAB_TOKEN environment variable not set Token not exported export GITLAB_TOKEN="glpat-xxx"
401 Unauthorized Invalid or expired token Generate new token with read_api scope
403 Forbidden Token lacks permissions Ensure token has read_api scope

GitLab API Errors

Error Cause Resolution
Project not found: group/project Invalid project path Verify path matches GitLab URL (case-sensitive)
429 Too Many Requests Rate limited Wait for Retry-After period; sync will auto-retry
Connection refused GitLab unreachable Check GitLab URL and network connectivity

Data Errors

Error Cause Resolution
No documents indexed Sync not run Run gi sync first
No results found Query too specific Try broader search terms
Database locked Concurrent access Wait for other process; use gi sync --force if stale

Embedding Errors

Error Cause Resolution
Ollama connection refused Ollama not running Start Ollama or use --mode=lexical
Model not found: nomic-embed-text Model not pulled Run ollama pull nomic-embed-text
Embedding failed for N documents Transient failures Run gi embed --retry-failed

Operational Behavior

Scenario Behavior
Ctrl+C during sync Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with gi sync.
Disk full during write Fails with clear error. Cursor preserved at last successful commit. Free space and resume.
Stale lock detected Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers.
Network interruption Retries with exponential backoff. After max retries, sync fails but cursor is preserved.
Embedding permanent failure After 3 retries, document stays in embedding_metadata with last_error populated. Use gi embed --retry-failed to retry later, or gi stats to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS.
Orphaned records MVP: No automatic cleanup. last_seen_at field enables future detection of items deleted in GitLab. Post-MVP: gi gc --dry-run to identify orphans, gi gc --confirm to remove.

Database Management

Database Location

The SQLite database is stored at an XDG-compliant location:

~/.local/share/gi/data.db

This can be overridden in gi.config.json:

{
  "storage": {
    "dbPath": "/custom/path/to/data.db"
  }
}

Backup

Create a timestamped backup of the database:

gi backup
# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db

Backups are SQLite .backup command copies (safe even during active writes due to WAL mode).

Reset

To completely reset the database and all sync cursors:

gi reset --confirm

This deletes:

  • The database file
  • All sync cursors
  • All embeddings

You'll need to run gi sync again to repopulate.

Schema Migrations

Database schema is version-tracked and migrations auto-apply on startup:

  1. On first run, schema is created at latest version
  2. On subsequent runs, pending migrations are applied automatically
  3. Migration version is stored in schema_version table
  4. Migrations are idempotent and reversible where possible

Manual migration check:

gi doctor --json | jq '.checks.database'
# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }

Future Work (Post-MVP)

The following features are explicitly deferred to keep MVP scope focused:

Feature Description Depends On
File History Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing MVP complete
Personal Dashboard Filter by assigned/mentioned, integrate with gitlab-inbox tool MVP complete
Person Context Aggregate contributions by author, expertise inference MVP complete
Decision Graph LLM-assisted decision extraction, relationship visualization MVP + LLM integration
MCP Server Expose search as MCP tool for Claude Code integration Checkpoint 4
Custom Tokenizer Better handling of code identifiers (snake_case, paths) Checkpoint 4

Checkpoint 6 (File History) Schema Preview:

-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
  id INTEGER PRIMARY KEY,
  merge_request_id INTEGER REFERENCES merge_requests(id),
  old_path TEXT,
  new_path TEXT,
  new_file BOOLEAN,
  deleted_file BOOLEAN,
  renamed_file BOOLEAN,
  UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);

-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
  note_id INTEGER PRIMARY KEY REFERENCES notes(id),
  old_path TEXT,
  new_path TEXT,
  old_line INTEGER,
  new_line INTEGER,
  position_type TEXT                          -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);

Verification Strategy

Each checkpoint includes:

  1. Automated tests - Unit tests for data transformations, integration tests for API calls
  2. CLI smoke tests - Manual commands with expected outputs documented
  3. Data integrity checks - Count verification against GitLab, schema validation
  4. Search quality tests - Known queries with expected results (for Checkpoint 4+)

Risk Mitigation

Risk Mitigation
GitLab rate limiting Exponential backoff, respect Retry-After headers, incremental sync
Embedding model quality Start with nomic-embed-text; architecture allows model swap
SQLite scale limits Monitor performance; Postgres migration path documented
Stale data Incremental sync with change detection
Mid-sync failures Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery
Missed updates Rolling backfill window (14 days), tuple cursor semantics
Search quality Hybrid (vector + FTS5) retrieval with RRF, golden query test suite
Concurrent sync corruption DB lock + heartbeat + rolling backfill, automatic stale lock recovery
Embedding failures Per-document error tracking, retry with backoff, targeted re-runs
Pathological discussions Queue-based discussion fetching; one bad thread doesn't block entire sync
Empty search results with filters Adaptive recall (topK 50→200 when filtered)

SQLite Performance Defaults (MVP):

  • Enable PRAGMA journal_mode=WAL; on every connection
  • Enable PRAGMA foreign_keys=ON; on every connection
  • Use explicit transactions for page/batch inserts
  • Targeted indexes on (project_id, updated_at) for primary resources

Schema Summary

Table Checkpoint Purpose
projects 0 Configured GitLab projects
sync_runs 0 Audit trail of sync operations (with heartbeat)
app_locks 0 Crash-safe single-flight lock
sync_cursors 0 Resumable sync state per primary resource
raw_payloads 0 Decoupled raw JSON storage (gitlab_id as TEXT)
schema_version 0 Database migration version tracking
issues 1 Normalized issues (unique by project+iid)
labels 1 Label definitions (unique by project + name)
issue_labels 1 Issue-label junction
discussions 1 Discussion threads (issue discussions in CP1, MR discussions in CP2)
notes 1 Individual comments with is_system flag (DiffNote paths added in CP2)
merge_requests 2 Normalized MRs (unique by project+iid)
mr_labels 2 MR-label junction
documents 3A Unified searchable documents with truncation metadata
document_labels 3A Document-label junction for fast filtering
document_paths 3A Fast path filtering for documents (DiffNote file paths)
dirty_sources 3A Queue for incremental document regeneration
pending_discussion_fetches 3A Resumable queue for dependent discussion fetching
documents_fts 3A Full-text search index (fts5 with porter stemmer)
embeddings 3B Vector embeddings (sqlite-vec vec0, rowid=document_id)
embedding_metadata 3B Embedding provenance + error tracking
mr_files 6 MR file changes (deferred to post-MVP)

Resolved Decisions

Question Decision Rationale
Comments structure Discussions as first-class entities Thread context is essential for decision traceability
System notes Store flagged, exclude from embeddings Preserves audit trail while avoiding semantic noise
DiffNote paths Capture now Enables immediate file/path search without full file-history feature
MR file linkage Deferred to post-MVP (CP6) Only needed for file-history feature
Labels Index as filters document_labels table enables fast --label=X filtering
Labels uniqueness By (project_id, name) GitLab API returns labels as strings
Sync method Polling only for MVP Webhooks add complexity; polling every 10 min is sufficient
Sync safety DB lock + heartbeat + rolling backfill Prevents race conditions and missed updates
Discussions sync Resumable queue model Queue-based fetching allows one pathological thread to not block entire sync
Hybrid ranking RRF over weighted sums Simpler, no score normalization needed
Embedding rowid rowid = documents.id Eliminates fragile rowid mapping
Embedding truncation Note-boundary aware middle drop Never cut mid-note; preserves semantic coherence
Embedding batching 32 docs/batch, 4 concurrent workers Balance throughput, memory, and error isolation
FTS5 tokenizer porter unicode61 Stemming improves recall
Ollama unavailable Graceful degradation to FTS5 Search still works without semantic matching
JSON output Stable documented schema Enables reliable agent/MCP consumption
Database location XDG compliant: ~/.local/share/gi/ Standard location, user-configurable
gi init validation Validate GitLab before writing config Fail fast, better UX
Ctrl+C handling Graceful shutdown Finish page, commit cursor, exits cleanly
Empty state UX Actionable messages Guide user to next step
raw_payloads.gitlab_id TEXT not INTEGER Discussion IDs are strings; numeric IDs stored as strings
GitLab list params Always scope=all&state=all Ensures all historical data including closed items
Pagination X-Next-Page headers with empty-page fallback Headers are more robust than empty-page detection
Integration tests Mocked by default, live tests optional Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1
Search recall with filters Adaptive topK (50→200 when filtered) Prevents "no results" when relevant docs exist outside top-50
RRF score normalization Per-query normalized 0-1 score = rrfScore / max(rrfScore); raw score in explain
--path semantics Trailing / = prefix match --path=src/auth/ does prefix; otherwise exact match
CP3 structure Split into 3A (FTS) and 3B (embeddings) Lexical search works before embedding infra risk
Vector extension sqlite-vec (not sqlite-vss) sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere
CLI framework Commander.js Simple, lightweight, sufficient for single-user CLI tool
Logging pino to stderr JSON-structured, fast; stderr keeps stdout clean for JSON output piping
Error handling Custom error class hierarchy GiError base with codes; specific classes for config/gitlab/db/embedding errors
Truncation edge cases Char-boundary cut for oversized notes Single notes > 32000 chars truncated at char boundary with [truncated] marker
Ollama API Use /api/embed for batching Batch up to 32 docs per request; fall back to /api/embeddings for single

Next Steps

  1. User approves this spec
  2. Generate Checkpoint 0 PRD for project setup
  3. Implement Checkpoint 0
  4. Human validates → proceed to Checkpoint 1
  5. Repeat for each checkpoint