Files

Taylor Eernisse 986bc59f6a docs: Add comprehensive documentation and planning artifacts

README.md provides complete user documentation:
- Installation via cargo install or build from source
- Quick start guide with example commands
- Configuration file format with all options documented
- Full command reference for init, auth-test, doctor, ingest,
  list, show, count, sync-status, migrate, and version
- Database schema overview covering projects, issues, milestones,
  assignees, labels, discussions, notes, and raw payloads
- Development setup with test, lint, and debug commands

SPEC.md updated from original TypeScript planning document:
- Added note clarifying this is historical (implementation uses Rust)
- Updated sqlite-vss references to sqlite-vec (deprecated library)
- Added architecture overview with Technology Choices rationale
- Expanded project structure showing all planned modules

docs/prd/ contains detailed checkpoint planning:
- checkpoint-0.md: Initial project vision and requirements
- checkpoint-1.md: Revised planning after technology decisions

These documents capture the evolution from initial concept through
the decision to use Rust for performance and type safety.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-26 11:27:40 -05:00

84 KiB

Raw Blame History

GitLab Knowledge Engine - Spec Document

Note: This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See README.md for current documentation.

Executive Summary

A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.

Quick Start

Prerequisites

Requirement	Version	Notes
Node.js	20+	LTS recommended
npm	10+	Comes with Node.js
Ollama	Latest	Optional for semantic search; lexical search works without it

Installation

# Clone and install
git clone https://github.com/your-org/gitlab-inbox.git
cd gitlab-inbox
npm install
npm run build
npm link  # Makes `gi` available globally

First Run

Set your GitLab token (create at GitLab > Settings > Access Tokens with read_api scope):
```
export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
```
Run the setup wizard:
```
gi init
```
This creates gi.config.json with your GitLab URL and project paths.
Verify your environment:
```
gi doctor
```
All checks should pass (Ollama warning is OK if you only need lexical search).
Sync your data:
```
gi sync
```
Initial sync takes 10-20 minutes depending on repo size and rate limits.
Search:
```
gi search "authentication redesign"
```

Troubleshooting First Run

Symptom	Solution
`Config file not found`	Run `gi init` first
`GITLAB_TOKEN not set`	Export the environment variable
`401 Unauthorized`	Check token has `read_api` scope
`Project not found: group/project`	Verify project path in GitLab URL
`Ollama connection refused`	Start Ollama or use `--mode=lexical` for search

Discovery Summary

Pain Points Identified

Knowledge discovery - Tribal knowledge buried in old MRs/issues that nobody can find
Decision traceability - Hard to find why decisions were made; context scattered across issue comments and MR discussions

Constraints

Constraint	Detail
Hosting	Self-hosted only, no external APIs
Compute	Local dev machine (M-series Mac assumed)
GitLab Access	Self-hosted instance, PAT access, no webhooks (could request)
Build Method	AI agents will implement; user is TypeScript expert for review

Target Use Cases (Priority Order)

MVP: Semantic Search - "Find discussions about authentication redesign"
Future: File/Feature History - "What decisions were made about src/auth/login.ts?"
Future: Personal Tracking - "What am I assigned to or mentioned in?"
Future: Person Context - "What's @johndoe's background in this project?"

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        GitLab API                                │
│                    (Issues, MRs, Notes)                          │
└─────────────────────────────────────────────────────────────────┘
  (Commit-level indexing explicitly post-MVP)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     Data Ingestion Layer                         │
│  - Incremental sync (PAT-based polling)                         │
│  - Rate limiting / backoff                                       │
│  - Raw JSON storage for replay                                   │
│  - Dependent resource fetching (notes, MR changes)              │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Data Processing Layer                        │
│  - Normalize artifacts to unified schema                        │
│  - Extract searchable documents (canonical text + metadata)     │
│  - Content hashing for change detection                         │
│  - MVP relationships: parent-child FKs + label/path associations│
│    (full cross-entity "decision graph" is post-MVP scope)       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Storage Layer                               │
│  - SQLite + sqlite-vec + FTS5 (hybrid search)                   │
│  - Structured metadata in relational tables                      │
│  - Vector embeddings for semantic search                         │
│  - Full-text index for lexical search fallback                  │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Query Interface                             │
│  - CLI for human testing                                         │
│  - JSON API for AI agent testing                                 │
│  - Semantic search with filters (author, date, type, label)     │
└─────────────────────────────────────────────────────────────────┘

Technology Choices

Component	Choice	Rationale
Language	TypeScript/Node.js	User expertise, good GitLab libs, AI agent friendly
Database	SQLite + sqlite-vec + FTS5	Zero-config, portable, vector search via pure-C extension
Embeddings	Ollama + nomic-embed-text	Self-hosted, runs well on Apple Silicon, 768-dim vectors
CLI Framework	Commander.js	Simple, lightweight, well-documented
Logging	pino	Fast, JSON-structured, low overhead
Validation	Zod	TypeScript-first schema validation

Alternative Considered: sqlite-vss

sqlite-vss was the original choice but is now deprecated
No Apple Silicon support (no prebuilt ARM binaries)
Replaced by sqlite-vec, which is pure C with no dependencies
sqlite-vec uses vec0 virtual table (vs vss0)

Alternative Considered: Postgres + pgvector

Pros: More scalable, better for production multi-user
Cons: Requires running Postgres, heavier setup
Decision: Start with SQLite for simplicity; migration path exists if needed

Project Structure

gitlab-inbox/
├── src/
│   ├── cli/
│   │   ├── index.ts          # CLI entry point (Commander.js)
│   │   └── commands/         # One file per command group
│   │       ├── init.ts
│   │       ├── sync.ts
│   │       ├── search.ts
│   │       ├── list.ts
│   │       └── doctor.ts
│   ├── core/
│   │   ├── config.ts         # Config loading/validation (Zod)
│   │   ├── db.ts             # Database connection + migrations
│   │   ├── errors.ts         # Custom error classes
│   │   └── logger.ts         # pino logger setup
│   ├── gitlab/
│   │   ├── client.ts         # GitLab API client with rate limiting
│   │   ├── types.ts          # GitLab API response types
│   │   └── transformers/     # Payload → normalized schema
│   │       ├── issue.ts
│   │       ├── merge-request.ts
│   │       └── discussion.ts
│   ├── ingestion/
│   │   ├── issues.ts
│   │   ├── merge-requests.ts
│   │   └── discussions.ts
│   ├── documents/
│   │   ├── extractor.ts      # Document generation from entities
│   │   └── truncation.ts     # Note-boundary aware truncation
│   ├── embedding/
│   │   ├── ollama.ts         # Ollama client
│   │   └── pipeline.ts       # Batch embedding orchestration
│   ├── search/
│   │   ├── hybrid.ts         # RRF ranking logic
│   │   ├── fts.ts            # FTS5 queries
│   │   └── vector.ts         # sqlite-vec queries
│   └── types/
│       └── index.ts          # Shared TypeScript types
├── tests/
│   ├── unit/
│   ├── integration/
│   ├── live/                 # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
│   └── fixtures/
│       └── golden-queries.json
├── migrations/               # Numbered SQL migration files
│   ├── 001_initial.sql
│   └── ...
├── gi.config.json           # User config (gitignored)
├── package.json
├── tsconfig.json
├── vitest.config.ts
├── eslint.config.js
└── README.md

Dependencies

Runtime Dependencies

{
  "dependencies": {
    "better-sqlite3": "latest",
    "sqlite-vec": "latest",
    "commander": "latest",
    "zod": "latest",
    "pino": "latest",
    "pino-pretty": "latest",
    "ora": "latest",
    "chalk": "latest",
    "cli-table3": "latest"
  }
}

Package	Purpose
better-sqlite3	Synchronous SQLite driver (fast, native)
sqlite-vec	Vector search extension (pure C, cross-platform)
commander	CLI argument parsing
zod	Schema validation for config and inputs
pino	Structured JSON logging
pino-pretty	Dev-mode log formatting
ora	CLI spinners for progress indication
chalk	Terminal colors
cli-table3	ASCII tables for list output

Dev Dependencies

{
  "devDependencies": {
    "typescript": "latest",
    "@types/better-sqlite3": "latest",
    "@types/node": "latest",
    "vitest": "latest",
    "msw": "latest",
    "eslint": "latest",
    "@typescript-eslint/eslint-plugin": "latest",
    "@typescript-eslint/parser": "latest",
    "tsx": "latest"
  }
}

Package	Purpose
typescript	TypeScript compiler
vitest	Test runner
msw	Mock Service Worker for API mocking in tests
eslint	Linting
tsx	Run TypeScript directly during development

GitLab API Strategy

Primary Resources (Bulk Fetch)

Issues and MRs support efficient bulk fetching with incremental sync:

GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100

Required query params for completeness:

scope=all - include all issues/MRs, not just authored by current user
state=all - include closed items (GitLab defaults may exclude them)

Without these params, the 2+ years of historical data would be incomplete.

Dependent Resources (Per-Parent Fetch)

Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:

GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N

Pagination: Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss.

Sync Pattern

Initial sync:

Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead)
Fetch all MRs (paginated, ~60 calls)
For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead)
Total: thousands of API calls for initial sync

API Call Estimation Formula:

total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue
            + ceil(mrs/100) + mrs × avg_discussion_pages_per_mr

Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent:

Issue list: 30 calls
Issue discussions: 3,000 × 1.2 = 3,600 calls
MR list: 30 calls
MR discussions: 3,000 × 1.2 = 3,600 calls
Total: ~7,260 calls

This matters for rate limit planning and setting realistic "10-20 minutes" expectations.

Incremental sync:

Fetch issues where updated_after=cursor (bulk)
For EACH updated issue → refetch ALL its discussions
Fetch MRs where updated_after=cursor (bulk)
For EACH updated MR → refetch ALL its discussions

Critical Assumption (Softened)

We expect adding a note/discussion updates the parent's updated_at, but we do not rely on it exclusively.

Mitigations (MVP):

Tuple cursor semantics: Cursor is a stable tuple (updated_at, gitlab_id). Ties are handled explicitly - process all items with equal updated_at before advancing cursor.
Rolling backfill window: Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
Periodic full re-sync: Remains optional as an extra safety net (gi sync --full).

The backfill window provides 80% of the safety of full resync at <5% of the API cost.

Rate Limiting

Default: 10 requests/second with exponential backoff
Respect Retry-After headers on 429 responses
Add jitter to avoid thundering herd on retry
Separate concurrency limits:
- sync.primaryConcurrency: concurrent requests for issues/MRs list endpoints (default 4)
- sync.dependentConcurrency: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
- Bound concurrency per-project to avoid one repo starving the other
Initial sync estimate: 10-20 minutes depending on rate limits

Checkpoint Structure

Each checkpoint is a testable milestone where a human can validate the system works before proceeding.

Checkpoint 0: Project Setup

Deliverable: Scaffolded project with GitLab API connection verified and project resolution working

Automated Tests (Vitest):

tests/unit/config.test.ts
  ✓ loads config from gi.config.json
  ✓ throws if config file missing
  ✓ throws if required fields missing (baseUrl, projects)
  ✓ validates project paths are non-empty strings

tests/unit/db.test.ts
  ✓ creates database file if not exists
  ✓ applies migrations in order
  ✓ sets WAL journal mode
  ✓ enables foreign keys

tests/integration/gitlab-client.test.ts
  ✓ (mocked) authenticates with valid PAT
  ✓ (mocked) returns 401 for invalid PAT
  ✓ (mocked) fetches project by path
  ✓ (mocked) handles rate limiting (429) with retry

tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI)
  ✓ authenticates with real PAT against configured baseUrl
  ✓ fetches real project by path
  ✓ handles actual rate limiting behavior

tests/integration/app-lock.test.ts
  ✓ acquires lock successfully
  ✓ updates heartbeat during operation
  ✓ detects stale lock and recovers
  ✓ refuses concurrent acquisition

tests/integration/init.test.ts
  ✓ creates config file with valid structure
  ✓ validates GitLab URL format
  ✓ validates GitLab connection before writing config
  ✓ validates each project path exists in GitLab
  ✓ fails if token not set
  ✓ fails if GitLab auth fails
  ✓ fails if any project path not found
  ✓ prompts before overwriting existing config
  ✓ respects --force to skip confirmation
  ✓ generates gi.config.json with sensible defaults

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi auth-test`	`Authenticated as @username (User Name)`	Shows GitLab username and display name
`gi doctor`	Status table with ✓/✗ for each check	All checks pass (or Ollama shows warning if not running)
`gi doctor --json`	JSON object with check results	Valid JSON, `success: true` for required checks
`GITLAB_TOKEN=invalid gi auth-test`	Error message	Non-zero exit code, clear error about auth failure
`gi init`	Interactive prompts	Creates valid gi.config.json
`gi init` (config exists)	Confirmation prompt	Warns before overwriting
`gi --help`	Command list	Shows all available commands
`gi version`	Version number	Shows installed version
`gi sync-status`	Last sync time, cursor positions	Shows successful last run

Data Integrity Checks:

projects table contains rows for each configured project path
gitlab_project_id matches actual GitLab project IDs
raw_payloads contains project JSON for each synced project

Scope:

Project structure (TypeScript, ESLint, Vitest)
GitLab API client with PAT authentication
Environment and project configuration
Basic CLI scaffold with auth-test command
doctor command for environment verification
Projects table and initial project resolution (no issue/MR ingestion yet)
DB migrations + WAL + FK enforcement
Sync tracking with crash-safe single-flight lock (heartbeat-based)
Rate limit handling with exponential backoff + jitter
gi init command for guided setup:
- Prompts for GitLab base URL
- Prompts for project paths (comma-separated or multiple prompts)
- Prompts for token environment variable name (default: GITLAB_TOKEN)
- Validates before writing config:
  - Token must be set in environment
  - Tests auth with GET /user endpoint
  - Validates each project path with GET /projects/:path
  - Only writes config after all validations pass
- Generates gi.config.json with sensible defaults
gi --help shows all available commands
gi <command> --help shows command-specific help
gi version shows installed version
First-run detection: if no config exists, suggest gi init

Configuration (MVP):

// gi.config.json
{
  "gitlab": {
    "baseUrl": "https://gitlab.example.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project-one" },
    { "path": "group/project-two" }
  ],
  "sync": {
    "backfillDays": 14,
    "staleLockMinutes": 10,
    "heartbeatIntervalSeconds": 30,
    "cursorRewindSeconds": 2,
    "primaryConcurrency": 4,
    "dependentConcurrency": 2
  },
  "storage": {
    "compressRawPayloads": true
  },
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "baseUrl": "http://localhost:11434",
    "concurrency": 4
  }
}

Raw Payload Compression:

When storage.compressRawPayloads: true (default), raw JSON payloads are gzip-compressed before storage
raw_payloads.content_encoding indicates 'identity' (uncompressed) or 'gzip' (compressed)
Compression typically reduces storage by 70-80% for JSON payloads
Decompression is handled transparently when reading payloads
Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage

Error Classes (src/core/errors.ts):

// Base error class with error codes for programmatic handling
export class GiError extends Error {
  constructor(message: string, public readonly code: string) {
    super(message);
    this.name = 'GiError';
  }
}

// Config errors
export class ConfigNotFoundError extends GiError {
  constructor() {
    super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
  }
}

export class ConfigValidationError extends GiError {
  constructor(details: string) {
    super(`Invalid config: ${details}`, 'CONFIG_INVALID');
  }
}

// GitLab API errors
export class GitLabAuthError extends GiError {
  constructor() {
    super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
  }
}

export class GitLabNotFoundError extends GiError {
  constructor(resource: string) {
    super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
  }
}

export class GitLabRateLimitError extends GiError {
  constructor(public readonly retryAfter: number) {
    super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
  }
}

// Database errors
export class DatabaseLockError extends GiError {
  constructor() {
    super('Another sync is running. Use --force to override.', 'DB_LOCKED');
  }
}

// Embedding errors
export class OllamaConnectionError extends GiError {
  constructor() {
    super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
  }
}

export class EmbeddingError extends GiError {
  constructor(documentId: number, reason: string) {
    super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
  }
}

Logging Strategy (src/core/logger.ts):

import pino from 'pino';

// Logs go to stderr, results to stdout (allows clean JSON piping)
export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  transport: process.env.NODE_ENV === 'production' ? undefined : {
    target: 'pino-pretty',
    options: { colorize: true, destination: 2 }  // 2 = stderr
  }
}, pino.destination(2));

Log Levels:

Level	When to use
debug	Detailed sync progress, API calls, SQL queries
info	Sync start/complete, document counts, search timing
warn	Rate limits hit, Ollama unavailable (fallback to FTS), retries
error	Failures that stop operations

Logging Conventions:

Always include structured context: logger.info({ project, count }, 'Fetched issues')
Errors include err object: logger.error({ err, documentId }, 'Embedding failed')
All logs to stderr so gi search --json output stays clean on stdout

DB Runtime Defaults (Checkpoint 0):

On every connection:
- PRAGMA journal_mode=WAL;
- PRAGMA foreign_keys=ON;

Schema (Checkpoint 0):

-- Projects table (configured targets)
CREATE TABLE projects (
  id INTEGER PRIMARY KEY,
  gitlab_project_id INTEGER UNIQUE NOT NULL,
  path_with_namespace TEXT NOT NULL,
  default_branch TEXT,
  web_url TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);

-- Sync tracking for reliability
CREATE TABLE sync_runs (
  id INTEGER PRIMARY KEY,
  started_at INTEGER NOT NULL,
  heartbeat_at INTEGER NOT NULL,
  finished_at INTEGER,
  status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
  command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
  error TEXT
);

-- Crash-safe single-flight lock (DB-enforced)
CREATE TABLE app_locks (
  name TEXT PRIMARY KEY,         -- 'sync'
  owner TEXT NOT NULL,           -- random run token (UUIDv4)
  acquired_at INTEGER NOT NULL,
  heartbeat_at INTEGER NOT NULL
);

-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  resource_type TEXT NOT NULL,   -- 'issues' | 'merge_requests'
  updated_at_cursor INTEGER,     -- last fully processed updated_at (ms epoch)
  tie_breaker_id INTEGER,        -- last fully processed gitlab_id (for stable ordering)
  PRIMARY KEY(project_id, resource_type)
);

-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
  id INTEGER PRIMARY KEY,
  source TEXT NOT NULL,          -- 'gitlab'
  project_id INTEGER REFERENCES projects(id),  -- nullable for instance-level resources
  resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
  gitlab_id TEXT NOT NULL,       -- TEXT because discussion IDs are strings; numeric IDs stored as strings
  fetched_at INTEGER NOT NULL,
  content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
  payload BLOB NOT NULL          -- raw JSON or gzip-compressed JSON
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);

-- Schema version tracking for migrations
CREATE TABLE schema_version (
  version INTEGER PRIMARY KEY,
  applied_at INTEGER NOT NULL,
  description TEXT
);

Checkpoint 1: Issue Ingestion

Deliverable: All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync

Automated Tests (Vitest):

tests/unit/issue-transformer.test.ts
  ✓ transforms GitLab issue payload to normalized schema
  ✓ extracts labels from issue payload
  ✓ handles missing optional fields gracefully

tests/unit/pagination.test.ts
  ✓ fetches all pages when multiple exist
  ✓ respects per_page parameter
  ✓ follows X-Next-Page header until empty/absent
  ✓ falls back to empty-page stop if headers missing (robustness)

tests/unit/discussion-transformer.test.ts
  ✓ transforms discussion payload to normalized schema
  ✓ extracts notes array from discussion
  ✓ sets individual_note flag correctly
  ✓ flags system notes with is_system=1
  ✓ preserves note order via position field

tests/integration/issue-ingestion.test.ts
  ✓ inserts issues into database
  ✓ creates labels from issue payloads
  ✓ links issues to labels via junction table
  ✓ stores raw payload for each issue
  ✓ updates cursor after successful page commit
  ✓ resumes from cursor on subsequent runs

tests/integration/issue-discussion-ingestion.test.ts
  ✓ fetches discussions for each issue
  ✓ creates discussion rows with correct issue FK
  ✓ creates note rows linked to discussions
  ✓ stores system notes with is_system=1 flag
  ✓ handles individual_note=true discussions

tests/integration/sync-runs.test.ts
  ✓ creates sync_run record on start
  ✓ marks run as succeeded on completion
  ✓ marks run as failed with error message on failure
  ✓ refuses concurrent run (single-flight)
  ✓ allows --force to override stale running status

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi ingest --type=issues`	Progress bar, final count	Completes without error
`gi list issues --limit=10`	Table of 10 issues	Shows iid, title, state, author
`gi list issues --project=group/project-one`	Filtered list	Only shows issues from that project
`gi count issues`	`Issues: 1,234` (example)	Count matches GitLab UI
`gi show issue 123`	Issue detail view	Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use `--project=PATH`
`gi count discussions --type=issue`	`Issue Discussions: 5,678`	Non-zero count
`gi count notes --type=issue`	`Issue Notes: 12,345 (excluding 2,345 system)`	Non-zero count
`gi sync-status`	Last sync time, cursor positions	Shows successful last run

Data Integrity Checks:

SELECT COUNT(*) FROM issues matches GitLab issue count for configured projects
Every issue has a corresponding raw_payloads row
Labels in issue_labels junction all exist in labels table
sync_cursors has entry for each (project_id, 'issues') pair
Re-running gi ingest --type=issues fetches 0 new items (cursor is current)
SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue' is non-zero
Every discussion has at least one note
individual_note = true discussions have exactly one note

Scope:

Issue fetcher with pagination handling
Raw JSON storage in raw_payloads table
Normalized issue schema in SQLite
Labels ingestion derived from issue payload:
- Always persist label names from labels: string[]
- Optionally request with_labels_details=true to capture color/description when available
Issue discussions fetcher:
- Uses GET /projects/:id/issues/:iid/discussions
- Fetches all discussions for each issue during ingest
- Preserve system notes but flag them with is_system=1
Incremental sync support (run tracking + per-project cursor)
Basic list/count CLI commands

Reliability/Idempotency Rules:

Every ingest/sync creates a sync_runs row
Single-flight via DB-enforced app lock:
- On start: acquire lock via transactional compare-and-swap:
  - BEGIN IMMEDIATE (acquires write lock immediately)
  - If no row exists → INSERT new lock
  - Else if heartbeat_at is stale (> staleLockMinutes) → UPDATE owner + timestamps
  - Else if owner matches current run → UPDATE heartbeat (re-entrant)
  - Else → ROLLBACK and fail fast (another run is active)
  - COMMIT
- During run: update heartbeat_at every 30 seconds
- If existing lock's heartbeat_at is stale (> 10 minutes), treat as abandoned and acquire
- --force remains as operator override for edge cases, but should rarely be needed
Cursor advances only after successful transaction commit per page/batch
Ordering: updated_at ASC, tie-breaker gitlab_id ASC
Use explicit transactions for batch inserts

Schema Preview:

CREATE TABLE issues (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);
CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);

-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER,                  -- optional (only if available)
  project_id INTEGER NOT NULL REFERENCES projects(id),
  name TEXT NOT NULL,
  color TEXT,
  description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);

CREATE TABLE issue_labels (
  issue_id INTEGER REFERENCES issues(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);

-- Discussion threads for issues (MR discussions added in CP2)
CREATE TABLE discussions (
  id INTEGER PRIMARY KEY,
  gitlab_discussion_id TEXT NOT NULL,         -- GitLab's string ID (e.g. "6a9c1750b37d...")
  project_id INTEGER NOT NULL REFERENCES projects(id),
  issue_id INTEGER REFERENCES issues(id),
  merge_request_id INTEGER,                   -- FK added in CP2 via ALTER TABLE
  noteable_type TEXT NOT NULL,                -- 'Issue' | 'MergeRequest'
  individual_note BOOLEAN NOT NULL,           -- standalone comment vs threaded discussion
  first_note_at INTEGER,                      -- for ordering discussions
  last_note_at INTEGER,                       -- for "recently active" queries
  last_seen_at INTEGER NOT NULL,              -- updated on every upsert during sync
  resolvable BOOLEAN,                         -- MR discussions can be resolved
  resolved BOOLEAN,
  CHECK (
    (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
    (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
  )
);
CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);

-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  discussion_id INTEGER NOT NULL REFERENCES discussions(id),
  project_id INTEGER NOT NULL REFERENCES projects(id),
  type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null
  is_system BOOLEAN NOT NULL DEFAULT 0,       -- system notes (assignments, label changes, etc.)
  author_username TEXT,
  body TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  last_seen_at INTEGER NOT NULL,              -- updated on every upsert during sync
  position INTEGER,                           -- derived from array order in API response (0-indexed)
  resolvable BOOLEAN,
  resolved BOOLEAN,
  resolved_by TEXT,
  resolved_at INTEGER,
  -- DiffNote position metadata (only populated for MR DiffNotes in CP2)
  position_old_path TEXT,
  position_new_path TEXT,
  position_old_line INTEGER,
  position_new_line INTEGER,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_system ON notes(is_system);

Checkpoint 2: MR Ingestion

Deliverable: All MRs + MR discussions + notes with DiffNote paths captured

Automated Tests (Vitest):

tests/unit/mr-transformer.test.ts
  ✓ transforms GitLab MR payload to normalized schema
  ✓ extracts labels from MR payload
  ✓ handles missing optional fields gracefully

tests/unit/diffnote-transformer.test.ts
  ✓ extracts DiffNote position metadata (paths and lines)
  ✓ handles missing position fields gracefully

tests/integration/mr-ingestion.test.ts
  ✓ inserts MRs into database
  ✓ creates labels from MR payloads
  ✓ links MRs to labels via junction table
  ✓ stores raw payload for each MR

tests/integration/mr-discussion-ingestion.test.ts
  ✓ fetches discussions for each MR
  ✓ creates discussion rows with correct MR FK
  ✓ creates note rows linked to discussions
  ✓ extracts position_new_path from DiffNotes
  ✓ captures note-level resolution status
  ✓ captures note type (DiscussionNote, DiffNote)

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi ingest --type=merge_requests`	Progress bar, final count	Completes without error
`gi list mrs --limit=10`	Table of 10 MRs	Shows iid, title, state, author, branch
`gi count mrs`	`Merge Requests: 567` (example)	Count matches GitLab UI
`gi show mr 123`	MR detail with discussions	Shows title, description, discussion threads
`gi count discussions`	`Discussions: 12,345`	Total count (issue + MR)
`gi count discussions --type=mr`	`MR Discussions: 6,789`	MR discussions only
`gi count notes`	`Notes: 45,678 (excluding 8,901 system)`	Total with system note count

Data Integrity Checks:

SELECT COUNT(*) FROM merge_requests matches GitLab MR count
SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest' is non-zero
DiffNotes have position_new_path populated when available
Discussion first_note_at <= last_note_at for all rows

Scope:

MR fetcher with pagination
MR discussions fetcher:
- Uses GET /projects/:id/merge_requests/:iid/discussions
- Fetches all discussions for each MR during ingest
- Capture DiffNote file path/line metadata from position field for filename search
Relationship linking (discussion → MR, notes → discussion)
Extended CLI commands for MR display with threads
Add idx_notes_type and idx_notes_new_path indexes for DiffNote queries

Note: MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.

Schema Additions:

CREATE TABLE merge_requests (
  id INTEGER PRIMARY KEY,
  gitlab_id INTEGER UNIQUE NOT NULL,
  project_id INTEGER NOT NULL REFERENCES projects(id),
  iid INTEGER NOT NULL,
  title TEXT,
  description TEXT,
  state TEXT,
  author_username TEXT,
  source_branch TEXT,
  target_branch TEXT,
  created_at INTEGER,
  updated_at INTEGER,
  last_seen_at INTEGER NOT NULL,    -- updated on every upsert during sync
  merged_at INTEGER,
  web_url TEXT,
  raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);

-- MR labels (reuse same labels table from CP1)
CREATE TABLE mr_labels (
  merge_request_id INTEGER REFERENCES merge_requests(id),
  label_id INTEGER REFERENCES labels(id),
  PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);

-- Additional indexes for DiffNote queries (tables created in CP1)
CREATE INDEX idx_notes_type ON notes(type);
CREATE INDEX idx_notes_new_path ON notes(position_new_path);

-- Migration: Add FK constraint to discussions table (was deferred from CP1)
-- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK
-- This is handled by the migration system; pseudocode for clarity:
-- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id)
-- 2. INSERT INTO discussions_new SELECT * FROM discussions
-- 3. DROP TABLE discussions
-- 4. ALTER TABLE discussions_new RENAME TO discussions
-- 5. Recreate indexes

MR Discussion Processing Rules:

DiffNote position data is extracted and stored:
- position.old_path, position.new_path for file-level search
- position.old_line, position.new_line for line-level context
MR discussions can be resolvable; resolution status is captured at note level

Checkpoint 3A: Document Generation + FTS (Lexical Search)

Deliverable: Documents generated + FTS5 index; gi search --mode=lexical works end-to-end (no Ollama required)

Automated Tests (Vitest):

tests/unit/document-extractor.test.ts
  ✓ extracts issue document (title + description)
  ✓ extracts MR document (title + description)
  ✓ extracts discussion document with full thread context
  ✓ includes parent issue/MR title in discussion header
  ✓ formats notes with author and timestamp
  ✓ excludes system notes from discussion documents by default
  ✓ includes system notes only when --include-system-notes enabled (debug)
  ✓ truncates content exceeding 8000 tokens at note boundaries
  ✓ preserves first and last notes when truncating middle
  ✓ computes SHA-256 content hash consistently

tests/integration/document-creation.test.ts
  ✓ creates document for each issue
  ✓ creates document for each MR
  ✓ creates document for each discussion
  ✓ populates document_labels junction table
  ✓ computes content_hash for each document
  ✓ excludes system notes from discussion content

tests/integration/fts-index.test.ts
  ✓ documents_fts row count matches documents
  ✓ FTS triggers fire on insert/update/delete
  ✓ updates propagate via triggers

tests/integration/fts-search.test.ts
  ✓ returns exact keyword matches
  ✓ porter stemming works (search/searching)
  ✓ returns empty for non-matching query

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi generate-docs`	Progress bar, final count	Completes without error
`gi generate-docs` (re-run)	`0 documents to regenerate`	Skips unchanged docs
`gi search "authentication" --mode=lexical`	FTS results	Returns matching documents, works without Ollama
`gi stats`	Document count stats	Shows document coverage

Data Integrity Checks:

SELECT COUNT(*) FROM documents = issues + MRs + discussions
SELECT COUNT(*) FROM documents_fts = SELECT COUNT(*) FROM documents (via FTS triggers)
SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000 logs truncation warnings
Discussion documents include parent title in content_text
Discussion documents exclude system notes

Scope:

Document extraction layer:
- Canonical "search documents" derived from issues/MRs/discussions
- Stable content hashing for change detection (SHA-256 of content_text)
- Truncation: content_text capped at 8000 tokens at NOTE boundaries
  - Implementation: Use character budget, not exact token count
  - maxChars = 32000 (conservative 4 chars/token estimate)
  - Drop whole notes from middle, never cut mid-note
  - approxTokens = ceil(charCount / 4) for reporting/logging only
System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search)
Denormalized metadata for fast filtering (author, labels, dates)
Fast label filtering via document_labels join table
FTS5 index for lexical search
gi search --mode=lexical CLI command (works without Ollama)

This checkpoint delivers a working search experience before introducing embedding infrastructure risk.

Schema Additions (CP3A):

-- Unified searchable documents (derived from issues/MRs/discussions)
-- Note: Full documents table schema is in CP3B section for continuity with embeddings
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,
  updated_at INTEGER,
  url TEXT,
  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  is_truncated BOOLEAN NOT NULL DEFAULT 0,
  truncated_reason TEXT,         -- 'token_limit_middle_drop' | null
  UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);

-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- Fast path filtering for documents (extracted from DiffNote positions)
CREATE TABLE document_paths (
  document_id INTEGER NOT NULL REFERENCES documents(id),
  path TEXT NOT NULL,
  PRIMARY KEY(document_id, path)
);
CREATE INDEX idx_document_paths_path ON document_paths(path);

-- Track sources that require document regeneration (populated during ingestion)
CREATE TABLE dirty_sources (
  source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
  source_id INTEGER NOT NULL,    -- local DB id
  queued_at INTEGER NOT NULL,
  PRIMARY KEY(source_type, source_id)
);

-- Resumable dependent fetches (discussions are per-parent resources)
CREATE TABLE pending_discussion_fetches (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  noteable_type TEXT NOT NULL,            -- 'Issue' | 'MergeRequest'
  noteable_iid INTEGER NOT NULL,          -- parent iid (stable human identifier)
  queued_at INTEGER NOT NULL,
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER,
  last_error TEXT,
  PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_retry
  ON pending_discussion_fetches(attempt_count, last_attempt_at)
  WHERE last_error IS NOT NULL;

-- Full-text search for lexical retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id',
  tokenize='porter unicode61'
);

-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
END;

CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

FTS5 Tokenizer Notes:

porter enables stemming (searching "authentication" matches "authenticating", "authenticated")
unicode61 handles Unicode properly
Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer

Checkpoint 3B: Embedding Generation (Semantic Search)

Deliverable: Embeddings generated + gi search --mode=semantic works; graceful fallback if Ollama unavailable

Automated Tests (Vitest):

tests/unit/embedding-client.test.ts
  ✓ connects to Ollama API
  ✓ generates embedding for text input
  ✓ returns 768-dimension vector
  ✓ handles Ollama connection failure gracefully
  ✓ batches requests (32 documents per batch)

tests/integration/embedding-storage.test.ts
  ✓ stores embedding in sqlite-vec
  ✓ embedding rowid matches document id
  ✓ creates embedding_metadata record
  ✓ skips re-embedding when content_hash unchanged
  ✓ re-embeds when content_hash changes

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi embed --all`	Progress bar with ETA	Completes without error
`gi embed --all` (re-run)	`0 documents to embed`	Skips already-embedded docs
`gi stats`	Embedding coverage stats	Shows 100% coverage
`gi stats --json`	JSON stats object	Valid JSON with document/embedding counts
`gi embed --all` (Ollama stopped)	Clear error message	Non-zero exit, actionable error
`gi search "authentication" --mode=semantic`	Vector results	Returns semantically similar documents

Data Integrity Checks:

SELECT COUNT(*) FROM embeddings = SELECT COUNT(*) FROM documents
SELECT COUNT(*) FROM embedding_metadata = SELECT COUNT(*) FROM documents
All embedding_metadata.content_hash matches corresponding documents.content_hash

Scope:

Ollama integration (nomic-embed-text model)
Embedding generation pipeline:
- Batch size: 32 documents per batch
- Concurrency: configurable (default 4 workers)
- Retry with exponential backoff for transient failures (max 3 attempts)
- Per-document failure recording to enable targeted re-runs
Vector storage in SQLite (sqlite-vec extension)
Progress tracking and resumability
gi search --mode=semantic CLI command

Ollama API Contract:

// POST http://localhost:11434/api/embed (batch endpoint - preferred)
interface OllamaEmbedRequest {
  model: string;      // "nomic-embed-text"
  input: string[];    // array of texts to embed (up to 32)
}

interface OllamaEmbedResponse {
  model: string;
  embeddings: number[][];  // array of 768-dim vectors
}

// POST http://localhost:11434/api/embeddings (single text - fallback)
interface OllamaEmbeddingsRequest {
  model: string;
  prompt: string;
}

interface OllamaEmbeddingsResponse {
  embedding: number[];
}

Usage:

Use /api/embed for batching (up to 32 documents per request)
Fall back to /api/embeddings for single documents or if batch fails
Check Ollama availability with GET http://localhost:11434/api/tags

Schema Additions (CP3B):

-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
  embedding float[768]
);

-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id),
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL,
  -- Error tracking for resumable embedding
  last_error TEXT,               -- error message from last failed attempt
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER        -- when last attempt occurred
);

-- Index for finding failed embeddings to retry
CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;

Storage Rule (MVP):

Insert embedding with rowid = documents.id
Upsert embedding_metadata by document_id
This alignment simplifies joins and eliminates rowid mapping fragility

Document Extraction Rules:

Source	content_text Construction
Issue	`title + "\n\n" + description`
MR	`title + "\n\n" + description`
Discussion	Full thread with context (see below)

Discussion Document Format:

[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"]     -- present if any DiffNotes exist in thread

--- Thread ---

@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...

@janedoe (2024-03-15):
Agreed. What about refresh token strategy?

@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...

System Notes Exclusion Rule:

System notes (is_system=1) are stored in the DB for audit purposes
System notes are EXCLUDED from discussion documents by default
This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings
Debug flag --include-system-notes available for troubleshooting

This format preserves:

Parent context (issue/MR title and number)
Project path for scoped search
Direct URL for navigation
Labels for context
File paths from DiffNotes (enables immediate file search)
Author attribution for each note
Temporal ordering of the conversation
Full thread semantics for decision traceability

Truncation (Note-Boundary Aware): If content exceeds 8000 tokens (~32000 chars):

Algorithm:

Count non-system notes in the discussion
If total chars ≤ maxChars, no truncation needed
Otherwise, drop whole notes from the MIDDLE:
- Preserve first N notes and last M notes
- Never cut mid-note (produces unreadable snippets and worse embeddings)
- Continue dropping middle notes until under maxChars
Insert marker: \n\n[... N notes omitted for length ...]\n\n
Set documents.is_truncated = 1
Set documents.truncated_reason = 'token_limit_middle_drop'
Log a warning with document ID and original/truncated token count

Edge Cases:

Single note > 32000 chars: Truncate at character boundary, append [truncated], set truncated_reason = 'single_note_oversized'
First + last note > 32000 chars: Keep only first note (truncated if needed), set truncated_reason = 'first_last_oversized'
Only one note in discussion: If it exceeds limit, truncate at char boundary with [truncated]

Why note-boundary truncation:

Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
Keeping whole notes preserves semantic coherence for embeddings
First notes contain context/problem statement; last notes contain conclusions
Middle notes are often back-and-forth that's less critical

Token estimation: approxTokens = ceil(charCount / 4). No tokenizer dependency.

This metadata enables:

Monitoring truncation frequency in production
Future investigation of high-value truncated documents
Debugging when search misses expected content

Checkpoint 4: Hybrid Search (Semantic + Lexical)

Deliverable: Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content

Automated Tests (Vitest):

tests/unit/search-query.test.ts
  ✓ parses filter flags (--type, --author, --after, --label)
  ✓ validates date format for --after
  ✓ handles multiple --label flags

tests/unit/rrf-ranking.test.ts
  ✓ computes RRF score correctly
  ✓ merges results from vector and FTS retrievers
  ✓ handles documents appearing in only one retriever
  ✓ respects k=60 parameter

tests/integration/vector-search.test.ts
  ✓ returns results for semantic query
  ✓ ranks similar content higher
  ✓ returns empty for nonsense query

tests/integration/fts-search.test.ts
  ✓ returns exact keyword matches
  ✓ handles porter stemming (search/searching)
  ✓ returns empty for non-matching query

tests/integration/hybrid-search.test.ts
  ✓ combines vector and FTS results
  ✓ applies type filter correctly
  ✓ applies author filter correctly
  ✓ applies date filter correctly
  ✓ applies label filter correctly
  ✓ falls back to FTS when Ollama unavailable

tests/e2e/golden-queries.test.ts
  ✓ "authentication redesign" returns known auth-related items
  ✓ "database migration" returns known migration items
  ✓ [8 more domain-specific golden queries]

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi search "authentication"`	Ranked results with snippets	Returns relevant items, shows score
`gi search "authentication" --project=group/project-one`	Project-scoped results	Only results from that project
`gi search "authentication" --type=mr`	Only MR results	No issues or discussions in output
`gi search "authentication" --author=johndoe`	Filtered by author	All results have @johndoe
`gi search "authentication" --after=2024-01-01`	Date filtered	All results after date
`gi search "authentication" --label=bug`	Label filtered	All results have bug label
`gi search "redis" --mode=lexical`	FTS results only	Shows FTS results, no embeddings
`gi search "auth" --path=src/auth/`	Path-filtered results	Only results referencing files in src/auth/
`gi search "authentication" --json`	JSON output	Valid JSON matching stable schema
`gi search "authentication" --explain`	Rank breakdown	Shows vector/FTS/RRF contributions
`gi search "authentication" --limit=5`	5 results max	Returns at most 5 results
`gi search "xyznonexistent123"`	No results message	Graceful empty state
`gi search "authentication"` (no data synced)	No data message	Shows "Run gi sync first"
`gi search "authentication"` (Ollama stopped)	FTS results + warning	Shows warning, still returns results

Golden Query Test Suite: Create tests/fixtures/golden-queries.json with 10 queries and expected URLs:

[
  {
    "query": "authentication redesign",
    "expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
    "minResults": 1,
    "maxRank": 10
  }
]

Each query must have at least one expected URL appear in top 10 results.

Data Integrity Checks:

documents_fts row count matches documents row count
Search returns results for known content (not empty)
JSON output validates against defined schema
All result URLs are valid GitLab URLs

Scope:

Hybrid retrieval:
- Vector recall (sqlite-vec) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
Query embedding generation (same Ollama pipeline as documents)
Result ranking and scoring (document-level)
Search filters: --type=issue|mr|discussion, --author=username, --after=date, --label=name, --project=path, --path=file, --limit=N
- --limit=N controls result count (default: 20, max: 100)
- --path filters documents by referenced file paths (from DiffNote positions):
  - If --path ends with /: prefix match (path LIKE 'src/auth/%')
  - Otherwise: exact match OR prefix on directory boundary
  - Examples: --path=src/auth/ matches src/auth/login.ts, src/auth/utils/helpers.ts
  - Examples: --path=src/auth/login.ts matches only that exact file
  - Glob patterns deferred to post-MVP
- Label filtering operates on document_labels (indexed, exact-match)
- Filters work identically in hybrid and lexical modes
Debug: --explain returns rank contributions from vector + FTS + RRF
Output formatting: ranked list with title, snippet, score, URL
JSON output mode for AI/agent consumption (stable documented schema)
Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
Empty state handling:
- No documents indexed: No data indexed. Run 'gi sync' first.
- Query returns no results: No results found for "query".
- Filters exclude all results: No results match the specified filters.
- Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")

Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:

Determine recall size (adaptive based on filters):
- baseTopK = 50
- If any filters present (--project, --type, --author, --label, --path, --after): topK = 200
- This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
Query both vector index (top topK) and FTS5 (top topK)
- Vector recall via sqlite-vec + FTS lexical recall via fts5
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
Merge results by document_id
Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N)
- rrfScore = Σ 1 / (k + rank) with k=60 (tunable)
- RRF is simpler than weighted sums and doesn't require score normalization
Apply remaining filters (date ranges, labels, paths that weren't applied in SQL)
Return top K results

Why Adaptive Recall:

Fixed top-50 + filter can easily return 0 results even when relevant docs exist
Increasing recall when filters are present catches more candidates before filtering
SQL-level filtering is preferred (faster, uses indexes) but not always possible

Why RRF over Weighted Sums:

FTS5 BM25 scores and vector distances use different scales
Weighted sums (0.7 * vector + 0.3 * fts) require careful normalization
RRF operates on ranks, not scores, making it robust to scale differences
Well-established in information retrieval literature

Graceful Degradation:

If Ollama is unreachable during search, automatically fall back to FTS5-only search
Display warning: "Embedding service unavailable, using lexical search only"
embed command fails with actionable error if Ollama is down

CLI Interface:

# Basic semantic search
gi search "authentication redesign"

# Search within specific project
gi search "authentication" --project=group/project-one

# Search by file path (finds discussions/MRs touching this file)
gi search "rate limit" --path=src/client.ts

# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical

# Filtered search
gi search "authentication" --type=mr --after=2024-01-01

# Filter by label
gi search "performance" --label=bug --label=critical

# JSON output for programmatic use
gi search "payment processing" --json

# Explain search (shows RRF contributions)
gi search "auth" --explain

CLI Output Example:

$ gi search "authentication"

Found 23 results (hybrid search, 0.34s)

[1] MR !847 - Refactor auth to use JWT tokens (0.82)
    @johndoe · 2024-03-15 · group/project-one
    "...moving away from session cookies to JWT for authentication..."
    https://gitlab.example.com/group/project-one/-/merge_requests/847

[2] Issue #234 - Authentication redesign discussion (0.79)
    @janedoe · 2024-02-28 · group/project-one
    "...we need to redesign the authentication flow because..."
    https://gitlab.example.com/group/project-one/-/issues/234

[3] Discussion on Issue #234 (0.76)
    @johndoe · 2024-03-01 · group/project-one
    "I think we should move to JWT-based auth because the session..."
    https://gitlab.example.com/group/project-one/-/issues/234#note_12345

JSON Output Schema (Stable):

For AI/agent consumption, --json output follows this stable schema:

interface SearchResult {
  documentId: number;
  sourceType: "issue" | "merge_request" | "discussion";
  title: string | null;
  url: string;
  projectPath: string;
  author: string | null;
  createdAt: string;       // ISO 8601
  updatedAt: string;       // ISO 8601
  score: number;           // normalized 0-1 (rrfScore / maxRrfScore in this result set)
  snippet: string;         // truncated content_text
  labels: string[];
  // Only present with --explain flag
  explain?: {
    vectorRank?: number;   // null if not in vector results
    ftsRank?: number;      // null if not in FTS results
    rrfScore: number;      // raw RRF score (rank-based, comparable within a query)
  };
}

// Note on score normalization:
// - `score` is normalized 0-1 for UI display convenience
// - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set)
// - Use `explain.rrfScore` for raw scores when comparing across queries
// - Scores are NOT comparable across different queries

interface SearchResponse {
  query: string;
  mode: "hybrid" | "lexical" | "semantic";
  totalResults: number;
  results: SearchResult[];
  warnings?: string[];     // e.g., "Embedding service unavailable"
}

Schema versioning: Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.

Checkpoint 5: Incremental Sync

Deliverable: Efficient ongoing synchronization with GitLab

Automated Tests (Vitest):

tests/unit/cursor-management.test.ts
  ✓ advances cursor after successful page commit
  ✓ uses tie-breaker id for identical timestamps
  ✓ does not advance cursor on failure
  ✓ resets cursor on --full flag

tests/unit/change-detection.test.ts
  ✓ detects content_hash mismatch
  ✓ queues document for re-embedding on change
  ✓ skips re-embedding when hash unchanged

tests/integration/incremental-sync.test.ts
  ✓ fetches only items updated after cursor
  ✓ refetches discussions for updated issues
  ✓ refetches discussions for updated MRs
  ✓ updates existing records (not duplicates)
  ✓ creates new records for new items
  ✓ re-embeds documents with changed content_hash

tests/integration/sync-recovery.test.ts
  ✓ resumes from cursor after interrupted sync
  ✓ marks failed run with error message
  ✓ handles rate limiting (429) with backoff
  ✓ respects Retry-After header

Manual CLI Smoke Tests:

Command	Expected Output	Pass Criteria
`gi sync` (no changes)	`0 issues, 0 MRs updated`	Fast completion, no API calls beyond cursor check
`gi sync` (after GitLab change)	`1 issue updated, 3 discussions refetched`	Detects and syncs the change
`gi sync --full`	Full sync progress	Resets cursors, fetches everything
`gi sync-status`	Last sync time, cursor positions	Shows current state
`gi sync` (with rate limit)	Backoff messages	Respects rate limits, completes eventually
`gi search "new content"` (after sync)	Returns new content	New content is searchable

End-to-End Sync Verification:

Note the current sync_cursors values
Create a new comment on an issue in GitLab
Run gi sync
Verify:
- Issue's updated_at in DB matches GitLab
- New discussion row exists
- New note row exists
- New document row exists for discussion
- New embedding exists for document
- gi search "new comment text" returns the new discussion
- Cursor advanced past the updated issue

Data Integrity Checks:

sync_cursors timestamp <= max updated_at in corresponding table
No orphaned documents (all have valid source_id)
embedding_metadata.content_hash = documents.content_hash for all rows
sync_runs has complete audit trail

Scope:

Delta sync based on stable tuple cursor (updated_at, gitlab_id)
Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
Dependent resources sync strategy (discussions refetched when parent updates)
Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
Sync status reporting
Recommended: run via cron every 10 minutes

Correctness Rules (MVP):

Fetch pages ordered by updated_at ASC, within identical timestamps by gitlab_id ASC
Cursor is a stable tuple (updated_at, gitlab_id):
- GitLab API cannot express (updated_at = X AND id > Y) server-side.
- Use cursor rewind + local filtering:
  - Call GitLab with updated_after = cursor_updated_at - rewindSeconds (default 2s, configurable)
  - Locally discard items where:
    - updated_at < cursor_updated_at, OR
    - updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id
  - This makes the tuple cursor rule true in practice while keeping API calls simple.
- Cursor advances only after successful DB commit for that page
- When advancing, set cursor to the last processed item's (updated_at, gitlab_id)
Dependent resources:
- For each updated issue/MR, refetch ALL its discussions
- Discussion documents are regenerated and re-embedded if content_hash changes
Rolling backfill window:
- After cursor-based delta sync, also fetch items where updated_at > NOW() - backfillDays
- This catches any items whose timestamps were updated without triggering our cursor
A document is queued for embedding iff documents.content_hash != embedding_metadata.content_hash
Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

Why Dependent Resource Model:

GitLab Discussions API doesn't provide a global updated_after stream
Discussions are listed per-issue or per-MR, not as a top-level resource
Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct

CLI Commands:

# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
gi sync                    # orchestrates all steps
gi sync --no-embed         # skip embedding step (fast ingest/debug)
gi sync --no-docs          # skip document regeneration (debug)

# Force full re-sync (resets cursors)
gi sync --full

# Override stale 'running' run after operator review
gi sync --force

# Show sync status
gi sync-status

Orchestration steps (in order):

Acquire app lock with heartbeat
Ingest delta (issues, MRs) based on cursors
- For each upserted issue/MR, enqueue into pending_discussion_fetches
- INSERT into dirty_sources for each upserted issue/MR
Process pending_discussion_fetches queue (bounded per run, retryable):
- Fetch discussions for each queued parent
- On success: upsert discussions/notes, INSERT into dirty_sources, DELETE from queue
- On failure: increment attempt_count, record last_error, leave in queue for retry
- Bound processing: max N parents per sync run to avoid unbounded API calls
Apply rolling backfill window
Regenerate documents for entities in dirty_sources (process + delete from queue)
Embed documents with changed content_hash
FTS triggers auto-sync (no explicit step needed)
Release lock, record sync_run as succeeded

Why queue-based discussion fetching:

One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync
Primary resource cursors can advance independently
Discussions can be retried without re-fetching all issues/MRs
Bounded processing prevents unbounded API calls per sync run

Individual commands remain available for checkpoint testing and debugging:

gi ingest --type=issues
gi ingest --type=merge_requests
gi embed --all
gi embed --retry-failed

CLI Command Reference

All commands support --help for detailed usage information.

Setup & Diagnostics

Command	CP	Description
`gi init`	0	Interactive setup wizard; creates gi.config.json
`gi auth-test`	0	Verify GitLab authentication
`gi doctor`	0	Check environment (GitLab, Ollama, DB)
`gi doctor --json`	0	JSON output for scripting
`gi version`	0	Show installed version

Data Ingestion

Command	CP	Description
`gi ingest --type=issues`	1	Fetch issues from GitLab
`gi ingest --type=merge_requests`	2	Fetch MRs and discussions
`gi generate-docs`	3A	Extract documents from issues/MRs/discussions
`gi embed --all`	3B	Generate embeddings for all documents
`gi embed --retry-failed`	3	Retry failed embeddings
`gi sync`	5	Full sync orchestration (ingest + docs + embed)
`gi sync --full`	5	Force complete re-sync (reset cursors)
`gi sync --force`	5	Override stale lock after operator review
`gi sync --no-embed`	5	Sync without embedding (faster)

Data Inspection

Command	CP	Description
`gi list issues [--limit=N] [--project=PATH]`	1	List issues
`gi list mrs --limit=N`	2	List merge requests
`gi count issues`	1	Count issues
`gi count mrs`	2	Count merge requests
`gi count discussions --type=issue`	1	Count issue discussions
`gi count discussions`	2	Count all discussions
`gi count discussions --type=mr`	2	Count MR discussions
`gi count notes --type=issue`	1	Count issue notes (excluding system)
`gi count notes`	2	Count all notes (excluding system)
`gi show issue <iid> [--project=PATH]`	1	Show issue details (prompts if iid ambiguous across projects)
`gi show mr <iid> [--project=PATH]`	2	Show MR details with discussions
`gi stats`	3	Embedding coverage statistics
`gi stats --json`	3	JSON stats for scripting
`gi sync-status`	1	Show cursor positions and last sync

Search

Command	CP	Description
`gi search "query"`	4	Hybrid semantic + lexical search
`gi search "query" --mode=lexical`	3	Lexical-only search (no Ollama required)
`gi search "query" --type=issue\|mr\|discussion`	4	Filter by document type
`gi search "query" --author=USERNAME`	4	Filter by author
`gi search "query" --after=YYYY-MM-DD`	4	Filter by date
`gi search "query" --label=NAME`	4	Filter by label (repeatable)
`gi search "query" --project=PATH`	4	Filter by project
`gi search "query" --path=FILE`	4	Filter by file path
`gi search "query" --limit=N`	4	Limit results (default: 20, max: 100)
`gi search "query" --json`	4	JSON output for scripting
`gi search "query" --explain`	4	Show ranking breakdown

Database Management

Command	CP	Description
`gi backup`	0	Create timestamped database backup
`gi reset --confirm`	0	Delete database and reset cursors

Error Handling

Common errors and their resolutions:

Configuration Errors

Error	Cause	Resolution
`Config file not found`	No gi.config.json	Run `gi init` to create configuration
`Invalid config: missing baseUrl`	Malformed config	Re-run `gi init` or fix gi.config.json manually
`Invalid config: no projects defined`	Empty projects array	Add at least one project path to config

Authentication Errors

Error	Cause	Resolution
`GITLAB_TOKEN environment variable not set`	Token not exported	`export GITLAB_TOKEN="glpat-xxx"`
`401 Unauthorized`	Invalid or expired token	Generate new token with `read_api` scope
`403 Forbidden`	Token lacks permissions	Ensure token has `read_api` scope

GitLab API Errors

Error	Cause	Resolution
`Project not found: group/project`	Invalid project path	Verify path matches GitLab URL (case-sensitive)
`429 Too Many Requests`	Rate limited	Wait for Retry-After period; sync will auto-retry
`Connection refused`	GitLab unreachable	Check GitLab URL and network connectivity

Data Errors

Error	Cause	Resolution
`No documents indexed`	Sync not run	Run `gi sync` first
`No results found`	Query too specific	Try broader search terms
`Database locked`	Concurrent access	Wait for other process; use `gi sync --force` if stale

Embedding Errors

Error	Cause	Resolution
`Ollama connection refused`	Ollama not running	Start Ollama or use `--mode=lexical`
`Model not found: nomic-embed-text`	Model not pulled	Run `ollama pull nomic-embed-text`
`Embedding failed for N documents`	Transient failures	Run `gi embed --retry-failed`

Operational Behavior

Scenario	Behavior
Ctrl+C during sync	Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`.
Disk full during write	Fails with clear error. Cursor preserved at last successful commit. Free space and resume.
Stale lock detected	Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers.
Network interruption	Retries with exponential backoff. After max retries, sync fails but cursor is preserved.
Embedding permanent failure	After 3 retries, document stays in `embedding_metadata` with `last_error` populated. Use `gi embed --retry-failed` to retry later, or `gi stats` to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS.
Orphaned records	MVP: No automatic cleanup. `last_seen_at` field enables future detection of items deleted in GitLab. Post-MVP: `gi gc --dry-run` to identify orphans, `gi gc --confirm` to remove.

Database Management

Database Location

The SQLite database is stored at an XDG-compliant location:

~/.local/share/gi/data.db

This can be overridden in gi.config.json:

{
  "storage": {
    "dbPath": "/custom/path/to/data.db"
  }
}

Backup

Create a timestamped backup of the database:

gi backup
# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db

Backups are SQLite .backup command copies (safe even during active writes due to WAL mode).

Reset

To completely reset the database and all sync cursors:

gi reset --confirm

This deletes:

The database file
All sync cursors
All embeddings

You'll need to run gi sync again to repopulate.

Schema Migrations

Database schema is version-tracked and migrations auto-apply on startup:

On first run, schema is created at latest version
On subsequent runs, pending migrations are applied automatically
Migration version is stored in schema_version table
Migrations are idempotent and reversible where possible

Manual migration check:

gi doctor --json | jq '.checks.database'
# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }

Future Work (Post-MVP)

The following features are explicitly deferred to keep MVP scope focused:

Feature	Description	Depends On
File History	Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing	MVP complete
Personal Dashboard	Filter by assigned/mentioned, integrate with gitlab-inbox tool	MVP complete
Person Context	Aggregate contributions by author, expertise inference	MVP complete
Decision Graph	LLM-assisted decision extraction, relationship visualization	MVP + LLM integration
MCP Server	Expose search as MCP tool for Claude Code integration	Checkpoint 4
Custom Tokenizer	Better handling of code identifiers (snake_case, paths)	Checkpoint 4

Checkpoint 6 (File History) Schema Preview:

-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
  id INTEGER PRIMARY KEY,
  merge_request_id INTEGER REFERENCES merge_requests(id),
  old_path TEXT,
  new_path TEXT,
  new_file BOOLEAN,
  deleted_file BOOLEAN,
  renamed_file BOOLEAN,
  UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);

-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
  note_id INTEGER PRIMARY KEY REFERENCES notes(id),
  old_path TEXT,
  new_path TEXT,
  old_line INTEGER,
  new_line INTEGER,
  position_type TEXT                          -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);

Verification Strategy

Each checkpoint includes:

Automated tests - Unit tests for data transformations, integration tests for API calls
CLI smoke tests - Manual commands with expected outputs documented
Data integrity checks - Count verification against GitLab, schema validation
Search quality tests - Known queries with expected results (for Checkpoint 4+)

Risk Mitigation

Risk	Mitigation
GitLab rate limiting	Exponential backoff, respect Retry-After headers, incremental sync
Embedding model quality	Start with nomic-embed-text; architecture allows model swap
SQLite scale limits	Monitor performance; Postgres migration path documented
Stale data	Incremental sync with change detection
Mid-sync failures	Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery
Missed updates	Rolling backfill window (14 days), tuple cursor semantics
Search quality	Hybrid (vector + FTS5) retrieval with RRF, golden query test suite
Concurrent sync corruption	DB lock + heartbeat + rolling backfill, automatic stale lock recovery
Embedding failures	Per-document error tracking, retry with backoff, targeted re-runs
Pathological discussions	Queue-based discussion fetching; one bad thread doesn't block entire sync
Empty search results with filters	Adaptive recall (topK 50→200 when filtered)

SQLite Performance Defaults (MVP):

Enable PRAGMA journal_mode=WAL; on every connection
Enable PRAGMA foreign_keys=ON; on every connection
Use explicit transactions for page/batch inserts
Targeted indexes on (project_id, updated_at) for primary resources

Schema Summary

Table	Checkpoint	Purpose
projects	0	Configured GitLab projects
sync_runs	0	Audit trail of sync operations (with heartbeat)
app_locks	0	Crash-safe single-flight lock
sync_cursors	0	Resumable sync state per primary resource
raw_payloads	0	Decoupled raw JSON storage (gitlab_id as TEXT)
schema_version	0	Database migration version tracking
issues	1	Normalized issues (unique by project+iid)
labels	1	Label definitions (unique by project + name)
issue_labels	1	Issue-label junction
discussions	1	Discussion threads (issue discussions in CP1, MR discussions in CP2)
notes	1	Individual comments with is_system flag (DiffNote paths added in CP2)
merge_requests	2	Normalized MRs (unique by project+iid)
mr_labels	2	MR-label junction
documents	3A	Unified searchable documents with truncation metadata
document_labels	3A	Document-label junction for fast filtering
document_paths	3A	Fast path filtering for documents (DiffNote file paths)
dirty_sources	3A	Queue for incremental document regeneration
pending_discussion_fetches	3A	Resumable queue for dependent discussion fetching
documents_fts	3A	Full-text search index (fts5 with porter stemmer)
embeddings	3B	Vector embeddings (sqlite-vec vec0, rowid=document_id)
embedding_metadata	3B	Embedding provenance + error tracking
mr_files	6	MR file changes (deferred to post-MVP)

Resolved Decisions

Question	Decision	Rationale
Comments structure	Discussions as first-class entities	Thread context is essential for decision traceability
System notes	Store flagged, exclude from embeddings	Preserves audit trail while avoiding semantic noise
DiffNote paths	Capture now	Enables immediate file/path search without full file-history feature
MR file linkage	Deferred to post-MVP (CP6)	Only needed for file-history feature
Labels	Index as filters	`document_labels` table enables fast `--label=X` filtering
Labels uniqueness	By (project_id, name)	GitLab API returns labels as strings
Sync method	Polling only for MVP	Webhooks add complexity; polling every 10 min is sufficient
Sync safety	DB lock + heartbeat + rolling backfill	Prevents race conditions and missed updates
Discussions sync	Resumable queue model	Queue-based fetching allows one pathological thread to not block entire sync
Hybrid ranking	RRF over weighted sums	Simpler, no score normalization needed
Embedding rowid	rowid = documents.id	Eliminates fragile rowid mapping
Embedding truncation	Note-boundary aware middle drop	Never cut mid-note; preserves semantic coherence
Embedding batching	32 docs/batch, 4 concurrent workers	Balance throughput, memory, and error isolation
FTS5 tokenizer	porter unicode61	Stemming improves recall
Ollama unavailable	Graceful degradation to FTS5	Search still works without semantic matching
JSON output	Stable documented schema	Enables reliable agent/MCP consumption
Database location	XDG compliant: `~/.local/share/gi/`	Standard location, user-configurable
`gi init` validation	Validate GitLab before writing config	Fail fast, better UX
Ctrl+C handling	Graceful shutdown	Finish page, commit cursor, exits cleanly
Empty state UX	Actionable messages	Guide user to next step
raw_payloads.gitlab_id	TEXT not INTEGER	Discussion IDs are strings; numeric IDs stored as strings
GitLab list params	Always scope=all&state=all	Ensures all historical data including closed items
Pagination	X-Next-Page headers with empty-page fallback	Headers are more robust than empty-page detection
Integration tests	Mocked by default, live tests optional	Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1
Search recall with filters	Adaptive topK (50→200 when filtered)	Prevents "no results" when relevant docs exist outside top-50
RRF score normalization	Per-query normalized 0-1	score = rrfScore / max(rrfScore); raw score in explain
--path semantics	Trailing / = prefix match	`--path=src/auth/` does prefix; otherwise exact match
CP3 structure	Split into 3A (FTS) and 3B (embeddings)	Lexical search works before embedding infra risk
Vector extension	sqlite-vec (not sqlite-vss)	sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere
CLI framework	Commander.js	Simple, lightweight, sufficient for single-user CLI tool
Logging	pino to stderr	JSON-structured, fast; stderr keeps stdout clean for JSON output piping
Error handling	Custom error class hierarchy	GiError base with codes; specific classes for config/gitlab/db/embedding errors
Truncation edge cases	Char-boundary cut for oversized notes	Single notes > 32000 chars truncated at char boundary with `[truncated]` marker
Ollama API	Use /api/embed for batching	Batch up to 32 docs per request; fall back to /api/embeddings for single

Next Steps

User approves this spec
Generate Checkpoint 0 PRD for project setup
Implement Checkpoint 0
Human validates → proceed to Checkpoint 1
Repeat for each checkpoint

84 KiB Raw Blame History Unescape Escape

GitLab Knowledge Engine - Spec Document

Executive Summary

Quick Start

Prerequisites

Installation

First Run

Troubleshooting First Run

Discovery Summary

Pain Points Identified

Constraints

Target Use Cases (Priority Order)

Architecture Overview

Technology Choices

Alternative Considered: sqlite-vss

Alternative Considered: Postgres + pgvector

Project Structure

Dependencies

Runtime Dependencies

Dev Dependencies

GitLab API Strategy

Primary Resources (Bulk Fetch)

Dependent Resources (Per-Parent Fetch)

Sync Pattern

Critical Assumption (Softened)

Rate Limiting

Checkpoint Structure

Checkpoint 0: Project Setup

Checkpoint 1: Issue Ingestion

Checkpoint 2: MR Ingestion

Checkpoint 3A: Document Generation + FTS (Lexical Search)

Checkpoint 3B: Embedding Generation (Semantic Search)

Checkpoint 4: Hybrid Search (Semantic + Lexical)

Checkpoint 5: Incremental Sync

CLI Command Reference

Setup & Diagnostics

Data Ingestion

Data Inspection

Search

Database Management

Error Handling

Configuration Errors

Authentication Errors

GitLab API Errors

Data Errors

Embedding Errors

Operational Behavior

Database Management

Database Location

Backup

Reset

Schema Migrations

Future Work (Post-MVP)

Verification Strategy

Risk Mitigation

Schema Summary

Resolved Decisions

Next Steps

84 KiB

Raw Blame History