README.md provides complete user documentation: - Installation via cargo install or build from source - Quick start guide with example commands - Configuration file format with all options documented - Full command reference for init, auth-test, doctor, ingest, list, show, count, sync-status, migrate, and version - Database schema overview covering projects, issues, milestones, assignees, labels, discussions, notes, and raw payloads - Development setup with test, lint, and debug commands SPEC.md updated from original TypeScript planning document: - Added note clarifying this is historical (implementation uses Rust) - Updated sqlite-vss references to sqlite-vec (deprecated library) - Added architecture overview with Technology Choices rationale - Expanded project structure showing all planned modules docs/prd/ contains detailed checkpoint planning: - checkpoint-0.md: Initial project vision and requirements - checkpoint-1.md: Revised planning after technology decisions These documents capture the evolution from initial concept through the decision to use Rust for performance and type safety. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
84 KiB
GitLab Knowledge Engine - Spec Document
Note: This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See README.md for current documentation.
Executive Summary
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
Quick Start
Prerequisites
| Requirement | Version | Notes |
|---|---|---|
| Node.js | 20+ | LTS recommended |
| npm | 10+ | Comes with Node.js |
| Ollama | Latest | Optional for semantic search; lexical search works without it |
Installation
# Clone and install
git clone https://github.com/your-org/gitlab-inbox.git
cd gitlab-inbox
npm install
npm run build
npm link # Makes `gi` available globally
First Run
-
Set your GitLab token (create at GitLab > Settings > Access Tokens with
read_apiscope):export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx" -
Run the setup wizard:
gi initThis creates
gi.config.jsonwith your GitLab URL and project paths. -
Verify your environment:
gi doctorAll checks should pass (Ollama warning is OK if you only need lexical search).
-
Sync your data:
gi syncInitial sync takes 10-20 minutes depending on repo size and rate limits.
-
Search:
gi search "authentication redesign"
Troubleshooting First Run
| Symptom | Solution |
|---|---|
Config file not found |
Run gi init first |
GITLAB_TOKEN not set |
Export the environment variable |
401 Unauthorized |
Check token has read_api scope |
Project not found: group/project |
Verify project path in GitLab URL |
Ollama connection refused |
Start Ollama or use --mode=lexical for search |
Discovery Summary
Pain Points Identified
- Knowledge discovery - Tribal knowledge buried in old MRs/issues that nobody can find
- Decision traceability - Hard to find why decisions were made; context scattered across issue comments and MR discussions
Constraints
| Constraint | Detail |
|---|---|
| Hosting | Self-hosted only, no external APIs |
| Compute | Local dev machine (M-series Mac assumed) |
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
| Build Method | AI agents will implement; user is TypeScript expert for review |
Target Use Cases (Priority Order)
- MVP: Semantic Search - "Find discussions about authentication redesign"
- Future: File/Feature History - "What decisions were made about src/auth/login.ts?"
- Future: Personal Tracking - "What am I assigned to or mentioned in?"
- Future: Person Context - "What's @johndoe's background in this project?"
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ GitLab API │
│ (Issues, MRs, Notes) │
└─────────────────────────────────────────────────────────────────┘
(Commit-level indexing explicitly post-MVP)
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Ingestion Layer │
│ - Incremental sync (PAT-based polling) │
│ - Rate limiting / backoff │
│ - Raw JSON storage for replay │
│ - Dependent resource fetching (notes, MR changes) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Processing Layer │
│ - Normalize artifacts to unified schema │
│ - Extract searchable documents (canonical text + metadata) │
│ - Content hashing for change detection │
│ - MVP relationships: parent-child FKs + label/path associations│
│ (full cross-entity "decision graph" is post-MVP scope) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ - SQLite + sqlite-vec + FTS5 (hybrid search) │
│ - Structured metadata in relational tables │
│ - Vector embeddings for semantic search │
│ - Full-text index for lexical search fallback │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Query Interface │
│ - CLI for human testing │
│ - JSON API for AI agent testing │
│ - Semantic search with filters (author, date, type, label) │
└─────────────────────────────────────────────────────────────────┘
Technology Choices
| Component | Choice | Rationale |
|---|---|---|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension |
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
| CLI Framework | Commander.js | Simple, lightweight, well-documented |
| Logging | pino | Fast, JSON-structured, low overhead |
| Validation | Zod | TypeScript-first schema validation |
Alternative Considered: sqlite-vss
- sqlite-vss was the original choice but is now deprecated
- No Apple Silicon support (no prebuilt ARM binaries)
- Replaced by sqlite-vec, which is pure C with no dependencies
- sqlite-vec uses
vec0virtual table (vsvss0)
Alternative Considered: Postgres + pgvector
- Pros: More scalable, better for production multi-user
- Cons: Requires running Postgres, heavier setup
- Decision: Start with SQLite for simplicity; migration path exists if needed
Project Structure
gitlab-inbox/
├── src/
│ ├── cli/
│ │ ├── index.ts # CLI entry point (Commander.js)
│ │ └── commands/ # One file per command group
│ │ ├── init.ts
│ │ ├── sync.ts
│ │ ├── search.ts
│ │ ├── list.ts
│ │ └── doctor.ts
│ ├── core/
│ │ ├── config.ts # Config loading/validation (Zod)
│ │ ├── db.ts # Database connection + migrations
│ │ ├── errors.ts # Custom error classes
│ │ └── logger.ts # pino logger setup
│ ├── gitlab/
│ │ ├── client.ts # GitLab API client with rate limiting
│ │ ├── types.ts # GitLab API response types
│ │ └── transformers/ # Payload → normalized schema
│ │ ├── issue.ts
│ │ ├── merge-request.ts
│ │ └── discussion.ts
│ ├── ingestion/
│ │ ├── issues.ts
│ │ ├── merge-requests.ts
│ │ └── discussions.ts
│ ├── documents/
│ │ ├── extractor.ts # Document generation from entities
│ │ └── truncation.ts # Note-boundary aware truncation
│ ├── embedding/
│ │ ├── ollama.ts # Ollama client
│ │ └── pipeline.ts # Batch embedding orchestration
│ ├── search/
│ │ ├── hybrid.ts # RRF ranking logic
│ │ ├── fts.ts # FTS5 queries
│ │ └── vector.ts # sqlite-vec queries
│ └── types/
│ └── index.ts # Shared TypeScript types
├── tests/
│ ├── unit/
│ ├── integration/
│ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
│ └── fixtures/
│ └── golden-queries.json
├── migrations/ # Numbered SQL migration files
│ ├── 001_initial.sql
│ └── ...
├── gi.config.json # User config (gitignored)
├── package.json
├── tsconfig.json
├── vitest.config.ts
├── eslint.config.js
└── README.md
Dependencies
Runtime Dependencies
{
"dependencies": {
"better-sqlite3": "latest",
"sqlite-vec": "latest",
"commander": "latest",
"zod": "latest",
"pino": "latest",
"pino-pretty": "latest",
"ora": "latest",
"chalk": "latest",
"cli-table3": "latest"
}
}
| Package | Purpose |
|---|---|
| better-sqlite3 | Synchronous SQLite driver (fast, native) |
| sqlite-vec | Vector search extension (pure C, cross-platform) |
| commander | CLI argument parsing |
| zod | Schema validation for config and inputs |
| pino | Structured JSON logging |
| pino-pretty | Dev-mode log formatting |
| ora | CLI spinners for progress indication |
| chalk | Terminal colors |
| cli-table3 | ASCII tables for list output |
Dev Dependencies
{
"devDependencies": {
"typescript": "latest",
"@types/better-sqlite3": "latest",
"@types/node": "latest",
"vitest": "latest",
"msw": "latest",
"eslint": "latest",
"@typescript-eslint/eslint-plugin": "latest",
"@typescript-eslint/parser": "latest",
"tsx": "latest"
}
}
| Package | Purpose |
|---|---|
| typescript | TypeScript compiler |
| vitest | Test runner |
| msw | Mock Service Worker for API mocking in tests |
| eslint | Linting |
| tsx | Run TypeScript directly during development |
GitLab API Strategy
Primary Resources (Bulk Fetch)
Issues and MRs support efficient bulk fetching with incremental sync:
GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
Required query params for completeness:
scope=all- include all issues/MRs, not just authored by current userstate=all- include closed items (GitLab defaults may exclude them)
Without these params, the 2+ years of historical data would be incomplete.
Dependent Resources (Per-Parent Fetch)
Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
Pagination: Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss.
Sync Pattern
Initial sync:
- Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
- For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead)
- Fetch all MRs (paginated, ~60 calls)
- For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead)
- Total: thousands of API calls for initial sync
API Call Estimation Formula:
total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue
+ ceil(mrs/100) + mrs × avg_discussion_pages_per_mr
Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent:
- Issue list: 30 calls
- Issue discussions: 3,000 × 1.2 = 3,600 calls
- MR list: 30 calls
- MR discussions: 3,000 × 1.2 = 3,600 calls
- Total: ~7,260 calls
This matters for rate limit planning and setting realistic "10-20 minutes" expectations.
Incremental sync:
- Fetch issues where
updated_after=cursor(bulk) - For EACH updated issue → refetch ALL its discussions
- Fetch MRs where
updated_after=cursor(bulk) - For EACH updated MR → refetch ALL its discussions
Critical Assumption (Softened)
We expect adding a note/discussion updates the parent's updated_at, but we do not rely on it exclusively.
Mitigations (MVP):
- Tuple cursor semantics: Cursor is a stable tuple
(updated_at, gitlab_id). Ties are handled explicitly - process all items with equalupdated_atbefore advancing cursor. - Rolling backfill window: Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
- Periodic full re-sync: Remains optional as an extra safety net (
gi sync --full).
The backfill window provides 80% of the safety of full resync at <5% of the API cost.
Rate Limiting
- Default: 10 requests/second with exponential backoff
- Respect
Retry-Afterheaders on 429 responses - Add jitter to avoid thundering herd on retry
- Separate concurrency limits:
sync.primaryConcurrency: concurrent requests for issues/MRs list endpoints (default 4)sync.dependentConcurrency: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)- Bound concurrency per-project to avoid one repo starving the other
- Initial sync estimate: 10-20 minutes depending on rate limits
Checkpoint Structure
Each checkpoint is a testable milestone where a human can validate the system works before proceeding.
Checkpoint 0: Project Setup
Deliverable: Scaffolded project with GitLab API connection verified and project resolution working
Automated Tests (Vitest):
tests/unit/config.test.ts
✓ loads config from gi.config.json
✓ throws if config file missing
✓ throws if required fields missing (baseUrl, projects)
✓ validates project paths are non-empty strings
tests/unit/db.test.ts
✓ creates database file if not exists
✓ applies migrations in order
✓ sets WAL journal mode
✓ enables foreign keys
tests/integration/gitlab-client.test.ts
✓ (mocked) authenticates with valid PAT
✓ (mocked) returns 401 for invalid PAT
✓ (mocked) fetches project by path
✓ (mocked) handles rate limiting (429) with retry
tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI)
✓ authenticates with real PAT against configured baseUrl
✓ fetches real project by path
✓ handles actual rate limiting behavior
tests/integration/app-lock.test.ts
✓ acquires lock successfully
✓ updates heartbeat during operation
✓ detects stale lock and recovers
✓ refuses concurrent acquisition
tests/integration/init.test.ts
✓ creates config file with valid structure
✓ validates GitLab URL format
✓ validates GitLab connection before writing config
✓ validates each project path exists in GitLab
✓ fails if token not set
✓ fails if GitLab auth fails
✓ fails if any project path not found
✓ prompts before overwriting existing config
✓ respects --force to skip confirmation
✓ generates gi.config.json with sensible defaults
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi auth-test |
Authenticated as @username (User Name) |
Shows GitLab username and display name |
gi doctor |
Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
gi doctor --json |
JSON object with check results | Valid JSON, success: true for required checks |
GITLAB_TOKEN=invalid gi auth-test |
Error message | Non-zero exit code, clear error about auth failure |
gi init |
Interactive prompts | Creates valid gi.config.json |
gi init (config exists) |
Confirmation prompt | Warns before overwriting |
gi --help |
Command list | Shows all available commands |
gi version |
Version number | Shows installed version |
gi sync-status |
Last sync time, cursor positions | Shows successful last run |
Data Integrity Checks:
projectstable contains rows for each configured project pathgitlab_project_idmatches actual GitLab project IDsraw_payloadscontains project JSON for each synced project
Scope:
- Project structure (TypeScript, ESLint, Vitest)
- GitLab API client with PAT authentication
- Environment and project configuration
- Basic CLI scaffold with
auth-testcommand doctorcommand for environment verification- Projects table and initial project resolution (no issue/MR ingestion yet)
- DB migrations + WAL + FK enforcement
- Sync tracking with crash-safe single-flight lock (heartbeat-based)
- Rate limit handling with exponential backoff + jitter
gi initcommand for guided setup:- Prompts for GitLab base URL
- Prompts for project paths (comma-separated or multiple prompts)
- Prompts for token environment variable name (default: GITLAB_TOKEN)
- Validates before writing config:
- Token must be set in environment
- Tests auth with
GET /userendpoint - Validates each project path with
GET /projects/:path - Only writes config after all validations pass
- Generates
gi.config.jsonwith sensible defaults
gi --helpshows all available commandsgi <command> --helpshows command-specific helpgi versionshows installed version- First-run detection: if no config exists, suggest
gi init
Configuration (MVP):
// gi.config.json
{
"gitlab": {
"baseUrl": "https://gitlab.example.com",
"tokenEnvVar": "GITLAB_TOKEN"
},
"projects": [
{ "path": "group/project-one" },
{ "path": "group/project-two" }
],
"sync": {
"backfillDays": 14,
"staleLockMinutes": 10,
"heartbeatIntervalSeconds": 30,
"cursorRewindSeconds": 2,
"primaryConcurrency": 4,
"dependentConcurrency": 2
},
"storage": {
"compressRawPayloads": true
},
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"baseUrl": "http://localhost:11434",
"concurrency": 4
}
}
Raw Payload Compression:
- When
storage.compressRawPayloads: true(default), raw JSON payloads are gzip-compressed before storage raw_payloads.content_encodingindicates'identity'(uncompressed) or'gzip'(compressed)- Compression typically reduces storage by 70-80% for JSON payloads
- Decompression is handled transparently when reading payloads
- Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage
Error Classes (src/core/errors.ts):
// Base error class with error codes for programmatic handling
export class GiError extends Error {
constructor(message: string, public readonly code: string) {
super(message);
this.name = 'GiError';
}
}
// Config errors
export class ConfigNotFoundError extends GiError {
constructor() {
super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
}
}
export class ConfigValidationError extends GiError {
constructor(details: string) {
super(`Invalid config: ${details}`, 'CONFIG_INVALID');
}
}
// GitLab API errors
export class GitLabAuthError extends GiError {
constructor() {
super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
}
}
export class GitLabNotFoundError extends GiError {
constructor(resource: string) {
super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
}
}
export class GitLabRateLimitError extends GiError {
constructor(public readonly retryAfter: number) {
super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
}
}
// Database errors
export class DatabaseLockError extends GiError {
constructor() {
super('Another sync is running. Use --force to override.', 'DB_LOCKED');
}
}
// Embedding errors
export class OllamaConnectionError extends GiError {
constructor() {
super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
}
}
export class EmbeddingError extends GiError {
constructor(documentId: number, reason: string) {
super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
}
}
Logging Strategy (src/core/logger.ts):
import pino from 'pino';
// Logs go to stderr, results to stdout (allows clean JSON piping)
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
transport: process.env.NODE_ENV === 'production' ? undefined : {
target: 'pino-pretty',
options: { colorize: true, destination: 2 } // 2 = stderr
}
}, pino.destination(2));
Log Levels:
| Level | When to use |
|---|---|
| debug | Detailed sync progress, API calls, SQL queries |
| info | Sync start/complete, document counts, search timing |
| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries |
| error | Failures that stop operations |
Logging Conventions:
- Always include structured context:
logger.info({ project, count }, 'Fetched issues') - Errors include err object:
logger.error({ err, documentId }, 'Embedding failed') - All logs to stderr so
gi search --jsonoutput stays clean on stdout
DB Runtime Defaults (Checkpoint 0):
- On every connection:
PRAGMA journal_mode=WAL;PRAGMA foreign_keys=ON;
Schema (Checkpoint 0):
-- Projects table (configured targets)
CREATE TABLE projects (
id INTEGER PRIMARY KEY,
gitlab_project_id INTEGER UNIQUE NOT NULL,
path_with_namespace TEXT NOT NULL,
default_branch TEXT,
web_url TEXT,
created_at INTEGER,
updated_at INTEGER,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);
-- Sync tracking for reliability
CREATE TABLE sync_runs (
id INTEGER PRIMARY KEY,
started_at INTEGER NOT NULL,
heartbeat_at INTEGER NOT NULL,
finished_at INTEGER,
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
error TEXT
);
-- Crash-safe single-flight lock (DB-enforced)
CREATE TABLE app_locks (
name TEXT PRIMARY KEY, -- 'sync'
owner TEXT NOT NULL, -- random run token (UUIDv4)
acquired_at INTEGER NOT NULL,
heartbeat_at INTEGER NOT NULL
);
-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
project_id INTEGER NOT NULL REFERENCES projects(id),
resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests'
updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch)
tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering)
PRIMARY KEY(project_id, resource_type)
);
-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
id INTEGER PRIMARY KEY,
source TEXT NOT NULL, -- 'gitlab'
project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
gitlab_id TEXT NOT NULL, -- TEXT because discussion IDs are strings; numeric IDs stored as strings
fetched_at INTEGER NOT NULL,
content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
-- Schema version tracking for migrations
CREATE TABLE schema_version (
version INTEGER PRIMARY KEY,
applied_at INTEGER NOT NULL,
description TEXT
);
Checkpoint 1: Issue Ingestion
Deliverable: All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync
Automated Tests (Vitest):
tests/unit/issue-transformer.test.ts
✓ transforms GitLab issue payload to normalized schema
✓ extracts labels from issue payload
✓ handles missing optional fields gracefully
tests/unit/pagination.test.ts
✓ fetches all pages when multiple exist
✓ respects per_page parameter
✓ follows X-Next-Page header until empty/absent
✓ falls back to empty-page stop if headers missing (robustness)
tests/unit/discussion-transformer.test.ts
✓ transforms discussion payload to normalized schema
✓ extracts notes array from discussion
✓ sets individual_note flag correctly
✓ flags system notes with is_system=1
✓ preserves note order via position field
tests/integration/issue-ingestion.test.ts
✓ inserts issues into database
✓ creates labels from issue payloads
✓ links issues to labels via junction table
✓ stores raw payload for each issue
✓ updates cursor after successful page commit
✓ resumes from cursor on subsequent runs
tests/integration/issue-discussion-ingestion.test.ts
✓ fetches discussions for each issue
✓ creates discussion rows with correct issue FK
✓ creates note rows linked to discussions
✓ stores system notes with is_system=1 flag
✓ handles individual_note=true discussions
tests/integration/sync-runs.test.ts
✓ creates sync_run record on start
✓ marks run as succeeded on completion
✓ marks run as failed with error message on failure
✓ refuses concurrent run (single-flight)
✓ allows --force to override stale running status
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi ingest --type=issues |
Progress bar, final count | Completes without error |
gi list issues --limit=10 |
Table of 10 issues | Shows iid, title, state, author |
gi list issues --project=group/project-one |
Filtered list | Only shows issues from that project |
gi count issues |
Issues: 1,234 (example) |
Count matches GitLab UI |
gi show issue 123 |
Issue detail view | Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use --project=PATH |
gi count discussions --type=issue |
Issue Discussions: 5,678 |
Non-zero count |
gi count notes --type=issue |
Issue Notes: 12,345 (excluding 2,345 system) |
Non-zero count |
gi sync-status |
Last sync time, cursor positions | Shows successful last run |
Data Integrity Checks:
SELECT COUNT(*) FROM issuesmatches GitLab issue count for configured projects- Every issue has a corresponding
raw_payloadsrow - Labels in
issue_labelsjunction all exist inlabelstable sync_cursorshas entry for each (project_id, 'issues') pair- Re-running
gi ingest --type=issuesfetches 0 new items (cursor is current) SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'is non-zero- Every discussion has at least one note
individual_note = truediscussions have exactly one note
Scope:
- Issue fetcher with pagination handling
- Raw JSON storage in raw_payloads table
- Normalized issue schema in SQLite
- Labels ingestion derived from issue payload:
- Always persist label names from
labels: string[] - Optionally request
with_labels_details=trueto capture color/description when available
- Always persist label names from
- Issue discussions fetcher:
- Uses
GET /projects/:id/issues/:iid/discussions - Fetches all discussions for each issue during ingest
- Preserve system notes but flag them with
is_system=1
- Uses
- Incremental sync support (run tracking + per-project cursor)
- Basic list/count CLI commands
Reliability/Idempotency Rules:
- Every ingest/sync creates a
sync_runsrow - Single-flight via DB-enforced app lock:
- On start: acquire lock via transactional compare-and-swap:
BEGIN IMMEDIATE(acquires write lock immediately)- If no row exists → INSERT new lock
- Else if
heartbeat_atis stale (> staleLockMinutes) → UPDATE owner + timestamps - Else if
ownermatches current run → UPDATE heartbeat (re-entrant) - Else → ROLLBACK and fail fast (another run is active)
COMMIT
- During run: update
heartbeat_atevery 30 seconds - If existing lock's
heartbeat_atis stale (> 10 minutes), treat as abandoned and acquire --forceremains as operator override for edge cases, but should rarely be needed
- On start: acquire lock via transactional compare-and-swap:
- Cursor advances only after successful transaction commit per page/batch
- Ordering:
updated_at ASC, tie-breakergitlab_id ASC - Use explicit transactions for batch inserts
Schema Preview:
CREATE TABLE issues (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
created_at INTEGER,
updated_at INTEGER,
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);
CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);
-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER, -- optional (only if available)
project_id INTEGER NOT NULL REFERENCES projects(id),
name TEXT NOT NULL,
color TEXT,
description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);
CREATE TABLE issue_labels (
issue_id INTEGER REFERENCES issues(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
-- Discussion threads for issues (MR discussions added in CP2)
CREATE TABLE discussions (
id INTEGER PRIMARY KEY,
gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...")
project_id INTEGER NOT NULL REFERENCES projects(id),
issue_id INTEGER REFERENCES issues(id),
merge_request_id INTEGER, -- FK added in CP2 via ALTER TABLE
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion
first_note_at INTEGER, -- for ordering discussions
last_note_at INTEGER, -- for "recently active" queries
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
resolvable BOOLEAN, -- MR discussions can be resolved
resolved BOOLEAN,
CHECK (
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
)
);
CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
discussion_id INTEGER NOT NULL REFERENCES discussions(id),
project_id INTEGER NOT NULL REFERENCES projects(id),
type TEXT, -- 'DiscussionNote' | 'DiffNote' | null
is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.)
author_username TEXT,
body TEXT,
created_at INTEGER,
updated_at INTEGER,
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
position INTEGER, -- derived from array order in API response (0-indexed)
resolvable BOOLEAN,
resolved BOOLEAN,
resolved_by TEXT,
resolved_at INTEGER,
-- DiffNote position metadata (only populated for MR DiffNotes in CP2)
position_old_path TEXT,
position_new_path TEXT,
position_old_line INTEGER,
position_new_line INTEGER,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_system ON notes(is_system);
Checkpoint 2: MR Ingestion
Deliverable: All MRs + MR discussions + notes with DiffNote paths captured
Automated Tests (Vitest):
tests/unit/mr-transformer.test.ts
✓ transforms GitLab MR payload to normalized schema
✓ extracts labels from MR payload
✓ handles missing optional fields gracefully
tests/unit/diffnote-transformer.test.ts
✓ extracts DiffNote position metadata (paths and lines)
✓ handles missing position fields gracefully
tests/integration/mr-ingestion.test.ts
✓ inserts MRs into database
✓ creates labels from MR payloads
✓ links MRs to labels via junction table
✓ stores raw payload for each MR
tests/integration/mr-discussion-ingestion.test.ts
✓ fetches discussions for each MR
✓ creates discussion rows with correct MR FK
✓ creates note rows linked to discussions
✓ extracts position_new_path from DiffNotes
✓ captures note-level resolution status
✓ captures note type (DiscussionNote, DiffNote)
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi ingest --type=merge_requests |
Progress bar, final count | Completes without error |
gi list mrs --limit=10 |
Table of 10 MRs | Shows iid, title, state, author, branch |
gi count mrs |
Merge Requests: 567 (example) |
Count matches GitLab UI |
gi show mr 123 |
MR detail with discussions | Shows title, description, discussion threads |
gi count discussions |
Discussions: 12,345 |
Total count (issue + MR) |
gi count discussions --type=mr |
MR Discussions: 6,789 |
MR discussions only |
gi count notes |
Notes: 45,678 (excluding 8,901 system) |
Total with system note count |
Data Integrity Checks:
SELECT COUNT(*) FROM merge_requestsmatches GitLab MR countSELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest'is non-zero- DiffNotes have
position_new_pathpopulated when available - Discussion
first_note_at<=last_note_atfor all rows
Scope:
- MR fetcher with pagination
- MR discussions fetcher:
- Uses
GET /projects/:id/merge_requests/:iid/discussions - Fetches all discussions for each MR during ingest
- Capture DiffNote file path/line metadata from
positionfield for filename search
- Uses
- Relationship linking (discussion → MR, notes → discussion)
- Extended CLI commands for MR display with threads
- Add
idx_notes_typeandidx_notes_new_pathindexes for DiffNote queries
Note: MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.
Schema Additions:
CREATE TABLE merge_requests (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
source_branch TEXT,
target_branch TEXT,
created_at INTEGER,
updated_at INTEGER,
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
merged_at INTEGER,
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);
-- MR labels (reuse same labels table from CP1)
CREATE TABLE mr_labels (
merge_request_id INTEGER REFERENCES merge_requests(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
-- Additional indexes for DiffNote queries (tables created in CP1)
CREATE INDEX idx_notes_type ON notes(type);
CREATE INDEX idx_notes_new_path ON notes(position_new_path);
-- Migration: Add FK constraint to discussions table (was deferred from CP1)
-- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK
-- This is handled by the migration system; pseudocode for clarity:
-- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id)
-- 2. INSERT INTO discussions_new SELECT * FROM discussions
-- 3. DROP TABLE discussions
-- 4. ALTER TABLE discussions_new RENAME TO discussions
-- 5. Recreate indexes
MR Discussion Processing Rules:
- DiffNote position data is extracted and stored:
position.old_path,position.new_pathfor file-level searchposition.old_line,position.new_linefor line-level context
- MR discussions can be resolvable; resolution status is captured at note level
Checkpoint 3A: Document Generation + FTS (Lexical Search)
Deliverable: Documents generated + FTS5 index; gi search --mode=lexical works end-to-end (no Ollama required)
Automated Tests (Vitest):
tests/unit/document-extractor.test.ts
✓ extracts issue document (title + description)
✓ extracts MR document (title + description)
✓ extracts discussion document with full thread context
✓ includes parent issue/MR title in discussion header
✓ formats notes with author and timestamp
✓ excludes system notes from discussion documents by default
✓ includes system notes only when --include-system-notes enabled (debug)
✓ truncates content exceeding 8000 tokens at note boundaries
✓ preserves first and last notes when truncating middle
✓ computes SHA-256 content hash consistently
tests/integration/document-creation.test.ts
✓ creates document for each issue
✓ creates document for each MR
✓ creates document for each discussion
✓ populates document_labels junction table
✓ computes content_hash for each document
✓ excludes system notes from discussion content
tests/integration/fts-index.test.ts
✓ documents_fts row count matches documents
✓ FTS triggers fire on insert/update/delete
✓ updates propagate via triggers
tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ porter stemming works (search/searching)
✓ returns empty for non-matching query
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi generate-docs |
Progress bar, final count | Completes without error |
gi generate-docs (re-run) |
0 documents to regenerate |
Skips unchanged docs |
gi search "authentication" --mode=lexical |
FTS results | Returns matching documents, works without Ollama |
gi stats |
Document count stats | Shows document coverage |
Data Integrity Checks:
SELECT COUNT(*) FROM documents= issues + MRs + discussionsSELECT COUNT(*) FROM documents_fts=SELECT COUNT(*) FROM documents(via FTS triggers)SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000logs truncation warnings- Discussion documents include parent title in content_text
- Discussion documents exclude system notes
Scope:
- Document extraction layer:
- Canonical "search documents" derived from issues/MRs/discussions
- Stable content hashing for change detection (SHA-256 of content_text)
- Truncation: content_text capped at 8000 tokens at NOTE boundaries
- Implementation: Use character budget, not exact token count
maxChars = 32000(conservative 4 chars/token estimate)- Drop whole notes from middle, never cut mid-note
approxTokens = ceil(charCount / 4)for reporting/logging only
- System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search)
- Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via
document_labelsjoin table - FTS5 index for lexical search
gi search --mode=lexicalCLI command (works without Ollama)
This checkpoint delivers a working search experience before introducing embedding infrastructure risk.
Schema Additions (CP3A):
-- Unified searchable documents (derived from issues/MRs/discussions)
-- Note: Full documents table schema is in CP3B section for continuity with embeddings
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id in the source table
project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT, -- for discussions: first note author
label_names TEXT, -- JSON array (display/debug only)
created_at INTEGER,
updated_at INTEGER,
url TEXT,
title TEXT, -- null for discussions
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
content_hash TEXT NOT NULL, -- SHA-256 for change detection
is_truncated BOOLEAN NOT NULL DEFAULT 0,
truncated_reason TEXT, -- 'token_limit_middle_drop' | null
UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
document_id INTEGER NOT NULL REFERENCES documents(id),
label_name TEXT NOT NULL,
PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
-- Fast path filtering for documents (extracted from DiffNote positions)
CREATE TABLE document_paths (
document_id INTEGER NOT NULL REFERENCES documents(id),
path TEXT NOT NULL,
PRIMARY KEY(document_id, path)
);
CREATE INDEX idx_document_paths_path ON document_paths(path);
-- Track sources that require document regeneration (populated during ingestion)
CREATE TABLE dirty_sources (
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id
queued_at INTEGER NOT NULL,
PRIMARY KEY(source_type, source_id)
);
-- Resumable dependent fetches (discussions are per-parent resources)
CREATE TABLE pending_discussion_fetches (
project_id INTEGER NOT NULL REFERENCES projects(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
noteable_iid INTEGER NOT NULL, -- parent iid (stable human identifier)
queued_at INTEGER NOT NULL,
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER,
last_error TEXT,
PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_retry
ON pending_discussion_fetches(attempt_count, last_attempt_at)
WHERE last_error IS NOT NULL;
-- Full-text search for lexical retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
FTS5 Tokenizer Notes:
porterenables stemming (searching "authentication" matches "authenticating", "authenticated")unicode61handles Unicode properly- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
Checkpoint 3B: Embedding Generation (Semantic Search)
Deliverable: Embeddings generated + gi search --mode=semantic works; graceful fallback if Ollama unavailable
Automated Tests (Vitest):
tests/unit/embedding-client.test.ts
✓ connects to Ollama API
✓ generates embedding for text input
✓ returns 768-dimension vector
✓ handles Ollama connection failure gracefully
✓ batches requests (32 documents per batch)
tests/integration/embedding-storage.test.ts
✓ stores embedding in sqlite-vec
✓ embedding rowid matches document id
✓ creates embedding_metadata record
✓ skips re-embedding when content_hash unchanged
✓ re-embeds when content_hash changes
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi embed --all |
Progress bar with ETA | Completes without error |
gi embed --all (re-run) |
0 documents to embed |
Skips already-embedded docs |
gi stats |
Embedding coverage stats | Shows 100% coverage |
gi stats --json |
JSON stats object | Valid JSON with document/embedding counts |
gi embed --all (Ollama stopped) |
Clear error message | Non-zero exit, actionable error |
gi search "authentication" --mode=semantic |
Vector results | Returns semantically similar documents |
Data Integrity Checks:
SELECT COUNT(*) FROM embeddings=SELECT COUNT(*) FROM documentsSELECT COUNT(*) FROM embedding_metadata=SELECT COUNT(*) FROM documents- All
embedding_metadata.content_hashmatches correspondingdocuments.content_hash
Scope:
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline:
- Batch size: 32 documents per batch
- Concurrency: configurable (default 4 workers)
- Retry with exponential backoff for transient failures (max 3 attempts)
- Per-document failure recording to enable targeted re-runs
- Vector storage in SQLite (sqlite-vec extension)
- Progress tracking and resumability
gi search --mode=semanticCLI command
Ollama API Contract:
// POST http://localhost:11434/api/embed (batch endpoint - preferred)
interface OllamaEmbedRequest {
model: string; // "nomic-embed-text"
input: string[]; // array of texts to embed (up to 32)
}
interface OllamaEmbedResponse {
model: string;
embeddings: number[][]; // array of 768-dim vectors
}
// POST http://localhost:11434/api/embeddings (single text - fallback)
interface OllamaEmbeddingsRequest {
model: string;
prompt: string;
}
interface OllamaEmbeddingsResponse {
embedding: number[];
}
Usage:
- Use
/api/embedfor batching (up to 32 documents per request) - Fall back to
/api/embeddingsfor single documents or if batch fails - Check Ollama availability with
GET http://localhost:11434/api/tags
Schema Additions (CP3B):
-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
embedding float[768]
);
-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
model TEXT NOT NULL, -- 'nomic-embed-text'
dims INTEGER NOT NULL, -- 768
content_hash TEXT NOT NULL, -- copied from documents.content_hash
created_at INTEGER NOT NULL,
-- Error tracking for resumable embedding
last_error TEXT, -- error message from last failed attempt
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER -- when last attempt occurred
);
-- Index for finding failed embeddings to retry
CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
Storage Rule (MVP):
- Insert embedding with
rowid = documents.id - Upsert
embedding_metadatabydocument_id - This alignment simplifies joins and eliminates rowid mapping fragility
Document Extraction Rules:
| Source | content_text Construction |
|---|---|
| Issue | title + "\n\n" + description |
| MR | title + "\n\n" + description |
| Discussion | Full thread with context (see below) |
Discussion Document Format:
[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread
--- Thread ---
@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
@janedoe (2024-03-15):
Agreed. What about refresh token strategy?
@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
System Notes Exclusion Rule:
- System notes (is_system=1) are stored in the DB for audit purposes
- System notes are EXCLUDED from discussion documents by default
- This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings
- Debug flag
--include-system-notesavailable for troubleshooting
This format preserves:
- Parent context (issue/MR title and number)
- Project path for scoped search
- Direct URL for navigation
- Labels for context
- File paths from DiffNotes (enables immediate file search)
- Author attribution for each note
- Temporal ordering of the conversation
- Full thread semantics for decision traceability
Truncation (Note-Boundary Aware): If content exceeds 8000 tokens (~32000 chars):
Algorithm:
- Count non-system notes in the discussion
- If total chars ≤ maxChars, no truncation needed
- Otherwise, drop whole notes from the MIDDLE:
- Preserve first N notes and last M notes
- Never cut mid-note (produces unreadable snippets and worse embeddings)
- Continue dropping middle notes until under maxChars
- Insert marker:
\n\n[... N notes omitted for length ...]\n\n - Set
documents.is_truncated = 1 - Set
documents.truncated_reason = 'token_limit_middle_drop' - Log a warning with document ID and original/truncated token count
Edge Cases:
- Single note > 32000 chars: Truncate at character boundary, append
[truncated], settruncated_reason = 'single_note_oversized' - First + last note > 32000 chars: Keep only first note (truncated if needed), set
truncated_reason = 'first_last_oversized' - Only one note in discussion: If it exceeds limit, truncate at char boundary with
[truncated]
Why note-boundary truncation:
- Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
- Keeping whole notes preserves semantic coherence for embeddings
- First notes contain context/problem statement; last notes contain conclusions
- Middle notes are often back-and-forth that's less critical
Token estimation: approxTokens = ceil(charCount / 4). No tokenizer dependency.
This metadata enables:
- Monitoring truncation frequency in production
- Future investigation of high-value truncated documents
- Debugging when search misses expected content
Checkpoint 4: Hybrid Search (Semantic + Lexical)
Deliverable: Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
Automated Tests (Vitest):
tests/unit/search-query.test.ts
✓ parses filter flags (--type, --author, --after, --label)
✓ validates date format for --after
✓ handles multiple --label flags
tests/unit/rrf-ranking.test.ts
✓ computes RRF score correctly
✓ merges results from vector and FTS retrievers
✓ handles documents appearing in only one retriever
✓ respects k=60 parameter
tests/integration/vector-search.test.ts
✓ returns results for semantic query
✓ ranks similar content higher
✓ returns empty for nonsense query
tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ handles porter stemming (search/searching)
✓ returns empty for non-matching query
tests/integration/hybrid-search.test.ts
✓ combines vector and FTS results
✓ applies type filter correctly
✓ applies author filter correctly
✓ applies date filter correctly
✓ applies label filter correctly
✓ falls back to FTS when Ollama unavailable
tests/e2e/golden-queries.test.ts
✓ "authentication redesign" returns known auth-related items
✓ "database migration" returns known migration items
✓ [8 more domain-specific golden queries]
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi search "authentication" |
Ranked results with snippets | Returns relevant items, shows score |
gi search "authentication" --project=group/project-one |
Project-scoped results | Only results from that project |
gi search "authentication" --type=mr |
Only MR results | No issues or discussions in output |
gi search "authentication" --author=johndoe |
Filtered by author | All results have @johndoe |
gi search "authentication" --after=2024-01-01 |
Date filtered | All results after date |
gi search "authentication" --label=bug |
Label filtered | All results have bug label |
gi search "redis" --mode=lexical |
FTS results only | Shows FTS results, no embeddings |
gi search "auth" --path=src/auth/ |
Path-filtered results | Only results referencing files in src/auth/ |
gi search "authentication" --json |
JSON output | Valid JSON matching stable schema |
gi search "authentication" --explain |
Rank breakdown | Shows vector/FTS/RRF contributions |
gi search "authentication" --limit=5 |
5 results max | Returns at most 5 results |
gi search "xyznonexistent123" |
No results message | Graceful empty state |
gi search "authentication" (no data synced) |
No data message | Shows "Run gi sync first" |
gi search "authentication" (Ollama stopped) |
FTS results + warning | Shows warning, still returns results |
Golden Query Test Suite:
Create tests/fixtures/golden-queries.json with 10 queries and expected URLs:
[
{
"query": "authentication redesign",
"expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
"minResults": 1,
"maxRank": 10
}
]
Each query must have at least one expected URL appear in top 10 results.
Data Integrity Checks:
documents_ftsrow count matchesdocumentsrow count- Search returns results for known content (not empty)
- JSON output validates against defined schema
- All result URLs are valid GitLab URLs
Scope:
- Hybrid retrieval:
- Vector recall (sqlite-vec) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Query embedding generation (same Ollama pipeline as documents)
- Result ranking and scoring (document-level)
- Search filters:
--type=issue|mr|discussion,--author=username,--after=date,--label=name,--project=path,--path=file,--limit=N--limit=Ncontrols result count (default: 20, max: 100)--pathfilters documents by referenced file paths (from DiffNote positions):- If
--pathends with/: prefix match (path LIKE 'src/auth/%') - Otherwise: exact match OR prefix on directory boundary
- Examples:
--path=src/auth/matchessrc/auth/login.ts,src/auth/utils/helpers.ts - Examples:
--path=src/auth/login.tsmatches only that exact file - Glob patterns deferred to post-MVP
- If
- Label filtering operates on
document_labels(indexed, exact-match) - Filters work identically in hybrid and lexical modes
- Debug:
--explainreturns rank contributions from vector + FTS + RRF - Output formatting: ranked list with title, snippet, score, URL
- JSON output mode for AI/agent consumption (stable documented schema)
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
- Empty state handling:
- No documents indexed:
No data indexed. Run 'gi sync' first. - Query returns no results:
No results found for "query". - Filters exclude all results:
No results match the specified filters. - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
- No documents indexed:
Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:
- Determine recall size (adaptive based on filters):
baseTopK = 50- If any filters present (--project, --type, --author, --label, --path, --after):
topK = 200 - This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
- Query both vector index (top topK) and FTS5 (top topK)
- Vector recall via sqlite-vec + FTS lexical recall via fts5
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
- Merge results by document_id
- Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N)
rrfScore = Σ 1 / (k + rank)with k=60 (tunable)- RRF is simpler than weighted sums and doesn't require score normalization
- Apply remaining filters (date ranges, labels, paths that weren't applied in SQL)
- Return top K results
Why Adaptive Recall:
- Fixed top-50 + filter can easily return 0 results even when relevant docs exist
- Increasing recall when filters are present catches more candidates before filtering
- SQL-level filtering is preferred (faster, uses indexes) but not always possible
Why RRF over Weighted Sums:
- FTS5 BM25 scores and vector distances use different scales
- Weighted sums (
0.7 * vector + 0.3 * fts) require careful normalization - RRF operates on ranks, not scores, making it robust to scale differences
- Well-established in information retrieval literature
Graceful Degradation:
- If Ollama is unreachable during search, automatically fall back to FTS5-only search
- Display warning: "Embedding service unavailable, using lexical search only"
embedcommand fails with actionable error if Ollama is down
CLI Interface:
# Basic semantic search
gi search "authentication redesign"
# Search within specific project
gi search "authentication" --project=group/project-one
# Search by file path (finds discussions/MRs touching this file)
gi search "rate limit" --path=src/client.ts
# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical
# Filtered search
gi search "authentication" --type=mr --after=2024-01-01
# Filter by label
gi search "performance" --label=bug --label=critical
# JSON output for programmatic use
gi search "payment processing" --json
# Explain search (shows RRF contributions)
gi search "auth" --explain
CLI Output Example:
$ gi search "authentication"
Found 23 results (hybrid search, 0.34s)
[1] MR !847 - Refactor auth to use JWT tokens (0.82)
@johndoe · 2024-03-15 · group/project-one
"...moving away from session cookies to JWT for authentication..."
https://gitlab.example.com/group/project-one/-/merge_requests/847
[2] Issue #234 - Authentication redesign discussion (0.79)
@janedoe · 2024-02-28 · group/project-one
"...we need to redesign the authentication flow because..."
https://gitlab.example.com/group/project-one/-/issues/234
[3] Discussion on Issue #234 (0.76)
@johndoe · 2024-03-01 · group/project-one
"I think we should move to JWT-based auth because the session..."
https://gitlab.example.com/group/project-one/-/issues/234#note_12345
JSON Output Schema (Stable):
For AI/agent consumption, --json output follows this stable schema:
interface SearchResult {
documentId: number;
sourceType: "issue" | "merge_request" | "discussion";
title: string | null;
url: string;
projectPath: string;
author: string | null;
createdAt: string; // ISO 8601
updatedAt: string; // ISO 8601
score: number; // normalized 0-1 (rrfScore / maxRrfScore in this result set)
snippet: string; // truncated content_text
labels: string[];
// Only present with --explain flag
explain?: {
vectorRank?: number; // null if not in vector results
ftsRank?: number; // null if not in FTS results
rrfScore: number; // raw RRF score (rank-based, comparable within a query)
};
}
// Note on score normalization:
// - `score` is normalized 0-1 for UI display convenience
// - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set)
// - Use `explain.rrfScore` for raw scores when comparing across queries
// - Scores are NOT comparable across different queries
interface SearchResponse {
query: string;
mode: "hybrid" | "lexical" | "semantic";
totalResults: number;
results: SearchResult[];
warnings?: string[]; // e.g., "Embedding service unavailable"
}
Schema versioning: Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.
Checkpoint 5: Incremental Sync
Deliverable: Efficient ongoing synchronization with GitLab
Automated Tests (Vitest):
tests/unit/cursor-management.test.ts
✓ advances cursor after successful page commit
✓ uses tie-breaker id for identical timestamps
✓ does not advance cursor on failure
✓ resets cursor on --full flag
tests/unit/change-detection.test.ts
✓ detects content_hash mismatch
✓ queues document for re-embedding on change
✓ skips re-embedding when hash unchanged
tests/integration/incremental-sync.test.ts
✓ fetches only items updated after cursor
✓ refetches discussions for updated issues
✓ refetches discussions for updated MRs
✓ updates existing records (not duplicates)
✓ creates new records for new items
✓ re-embeds documents with changed content_hash
tests/integration/sync-recovery.test.ts
✓ resumes from cursor after interrupted sync
✓ marks failed run with error message
✓ handles rate limiting (429) with backoff
✓ respects Retry-After header
Manual CLI Smoke Tests:
| Command | Expected Output | Pass Criteria |
|---|---|---|
gi sync (no changes) |
0 issues, 0 MRs updated |
Fast completion, no API calls beyond cursor check |
gi sync (after GitLab change) |
1 issue updated, 3 discussions refetched |
Detects and syncs the change |
gi sync --full |
Full sync progress | Resets cursors, fetches everything |
gi sync-status |
Last sync time, cursor positions | Shows current state |
gi sync (with rate limit) |
Backoff messages | Respects rate limits, completes eventually |
gi search "new content" (after sync) |
Returns new content | New content is searchable |
End-to-End Sync Verification:
- Note the current
sync_cursorsvalues - Create a new comment on an issue in GitLab
- Run
gi sync - Verify:
- Issue's
updated_atin DB matches GitLab - New discussion row exists
- New note row exists
- New document row exists for discussion
- New embedding exists for document
gi search "new comment text"returns the new discussion- Cursor advanced past the updated issue
- Issue's
Data Integrity Checks:
sync_cursorstimestamp <= maxupdated_atin corresponding table- No orphaned documents (all have valid source_id)
embedding_metadata.content_hash=documents.content_hashfor all rowssync_runshas complete audit trail
Scope:
- Delta sync based on stable tuple cursor
(updated_at, gitlab_id) - Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
- Dependent resources sync strategy (discussions refetched when parent updates)
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
- Sync status reporting
- Recommended: run via cron every 10 minutes
Correctness Rules (MVP):
- Fetch pages ordered by
updated_at ASC, within identical timestamps bygitlab_id ASC - Cursor is a stable tuple
(updated_at, gitlab_id):- GitLab API cannot express
(updated_at = X AND id > Y)server-side. - Use cursor rewind + local filtering:
- Call GitLab with
updated_after = cursor_updated_at - rewindSeconds(default 2s, configurable) - Locally discard items where:
updated_at < cursor_updated_at, ORupdated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id
- This makes the tuple cursor rule true in practice while keeping API calls simple.
- Call GitLab with
- Cursor advances only after successful DB commit for that page
- When advancing, set cursor to the last processed item's
(updated_at, gitlab_id)
- GitLab API cannot express
- Dependent resources:
- For each updated issue/MR, refetch ALL its discussions
- Discussion documents are regenerated and re-embedded if content_hash changes
- Rolling backfill window:
- After cursor-based delta sync, also fetch items where
updated_at > NOW() - backfillDays - This catches any items whose timestamps were updated without triggering our cursor
- After cursor-based delta sync, also fetch items where
- A document is queued for embedding iff
documents.content_hash != embedding_metadata.content_hash - Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
Why Dependent Resource Model:
- GitLab Discussions API doesn't provide a global
updated_afterstream - Discussions are listed per-issue or per-MR, not as a top-level resource
- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct
CLI Commands:
# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
gi sync # orchestrates all steps
gi sync --no-embed # skip embedding step (fast ingest/debug)
gi sync --no-docs # skip document regeneration (debug)
# Force full re-sync (resets cursors)
gi sync --full
# Override stale 'running' run after operator review
gi sync --force
# Show sync status
gi sync-status
Orchestration steps (in order):
- Acquire app lock with heartbeat
- Ingest delta (issues, MRs) based on cursors
- For each upserted issue/MR, enqueue into
pending_discussion_fetches - INSERT into
dirty_sourcesfor each upserted issue/MR
- For each upserted issue/MR, enqueue into
- Process
pending_discussion_fetchesqueue (bounded per run, retryable):- Fetch discussions for each queued parent
- On success: upsert discussions/notes, INSERT into
dirty_sources, DELETE from queue - On failure: increment
attempt_count, recordlast_error, leave in queue for retry - Bound processing: max N parents per sync run to avoid unbounded API calls
- Apply rolling backfill window
- Regenerate documents for entities in
dirty_sources(process + delete from queue) - Embed documents with changed content_hash
- FTS triggers auto-sync (no explicit step needed)
- Release lock, record sync_run as succeeded
Why queue-based discussion fetching:
- One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync
- Primary resource cursors can advance independently
- Discussions can be retried without re-fetching all issues/MRs
- Bounded processing prevents unbounded API calls per sync run
Individual commands remain available for checkpoint testing and debugging:
gi ingest --type=issuesgi ingest --type=merge_requestsgi embed --allgi embed --retry-failed
CLI Command Reference
All commands support --help for detailed usage information.
Setup & Diagnostics
| Command | CP | Description |
|---|---|---|
gi init |
0 | Interactive setup wizard; creates gi.config.json |
gi auth-test |
0 | Verify GitLab authentication |
gi doctor |
0 | Check environment (GitLab, Ollama, DB) |
gi doctor --json |
0 | JSON output for scripting |
gi version |
0 | Show installed version |
Data Ingestion
| Command | CP | Description |
|---|---|---|
gi ingest --type=issues |
1 | Fetch issues from GitLab |
gi ingest --type=merge_requests |
2 | Fetch MRs and discussions |
gi generate-docs |
3A | Extract documents from issues/MRs/discussions |
gi embed --all |
3B | Generate embeddings for all documents |
gi embed --retry-failed |
3 | Retry failed embeddings |
gi sync |
5 | Full sync orchestration (ingest + docs + embed) |
gi sync --full |
5 | Force complete re-sync (reset cursors) |
gi sync --force |
5 | Override stale lock after operator review |
gi sync --no-embed |
5 | Sync without embedding (faster) |
Data Inspection
| Command | CP | Description |
|---|---|---|
gi list issues [--limit=N] [--project=PATH] |
1 | List issues |
gi list mrs --limit=N |
2 | List merge requests |
gi count issues |
1 | Count issues |
gi count mrs |
2 | Count merge requests |
gi count discussions --type=issue |
1 | Count issue discussions |
gi count discussions |
2 | Count all discussions |
gi count discussions --type=mr |
2 | Count MR discussions |
gi count notes --type=issue |
1 | Count issue notes (excluding system) |
gi count notes |
2 | Count all notes (excluding system) |
gi show issue <iid> [--project=PATH] |
1 | Show issue details (prompts if iid ambiguous across projects) |
gi show mr <iid> [--project=PATH] |
2 | Show MR details with discussions |
gi stats |
3 | Embedding coverage statistics |
gi stats --json |
3 | JSON stats for scripting |
gi sync-status |
1 | Show cursor positions and last sync |
Search
| Command | CP | Description |
|---|---|---|
gi search "query" |
4 | Hybrid semantic + lexical search |
gi search "query" --mode=lexical |
3 | Lexical-only search (no Ollama required) |
gi search "query" --type=issue|mr|discussion |
4 | Filter by document type |
gi search "query" --author=USERNAME |
4 | Filter by author |
gi search "query" --after=YYYY-MM-DD |
4 | Filter by date |
gi search "query" --label=NAME |
4 | Filter by label (repeatable) |
gi search "query" --project=PATH |
4 | Filter by project |
gi search "query" --path=FILE |
4 | Filter by file path |
gi search "query" --limit=N |
4 | Limit results (default: 20, max: 100) |
gi search "query" --json |
4 | JSON output for scripting |
gi search "query" --explain |
4 | Show ranking breakdown |
Database Management
| Command | CP | Description |
|---|---|---|
gi backup |
0 | Create timestamped database backup |
gi reset --confirm |
0 | Delete database and reset cursors |
Error Handling
Common errors and their resolutions:
Configuration Errors
| Error | Cause | Resolution |
|---|---|---|
Config file not found |
No gi.config.json | Run gi init to create configuration |
Invalid config: missing baseUrl |
Malformed config | Re-run gi init or fix gi.config.json manually |
Invalid config: no projects defined |
Empty projects array | Add at least one project path to config |
Authentication Errors
| Error | Cause | Resolution |
|---|---|---|
GITLAB_TOKEN environment variable not set |
Token not exported | export GITLAB_TOKEN="glpat-xxx" |
401 Unauthorized |
Invalid or expired token | Generate new token with read_api scope |
403 Forbidden |
Token lacks permissions | Ensure token has read_api scope |
GitLab API Errors
| Error | Cause | Resolution |
|---|---|---|
Project not found: group/project |
Invalid project path | Verify path matches GitLab URL (case-sensitive) |
429 Too Many Requests |
Rate limited | Wait for Retry-After period; sync will auto-retry |
Connection refused |
GitLab unreachable | Check GitLab URL and network connectivity |
Data Errors
| Error | Cause | Resolution |
|---|---|---|
No documents indexed |
Sync not run | Run gi sync first |
No results found |
Query too specific | Try broader search terms |
Database locked |
Concurrent access | Wait for other process; use gi sync --force if stale |
Embedding Errors
| Error | Cause | Resolution |
|---|---|---|
Ollama connection refused |
Ollama not running | Start Ollama or use --mode=lexical |
Model not found: nomic-embed-text |
Model not pulled | Run ollama pull nomic-embed-text |
Embedding failed for N documents |
Transient failures | Run gi embed --retry-failed |
Operational Behavior
| Scenario | Behavior |
|---|---|
| Ctrl+C during sync | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with gi sync. |
| Disk full during write | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
| Stale lock detected | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
| Network interruption | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
| Embedding permanent failure | After 3 retries, document stays in embedding_metadata with last_error populated. Use gi embed --retry-failed to retry later, or gi stats to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS. |
| Orphaned records | MVP: No automatic cleanup. last_seen_at field enables future detection of items deleted in GitLab. Post-MVP: gi gc --dry-run to identify orphans, gi gc --confirm to remove. |
Database Management
Database Location
The SQLite database is stored at an XDG-compliant location:
~/.local/share/gi/data.db
This can be overridden in gi.config.json:
{
"storage": {
"dbPath": "/custom/path/to/data.db"
}
}
Backup
Create a timestamped backup of the database:
gi backup
# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db
Backups are SQLite .backup command copies (safe even during active writes due to WAL mode).
Reset
To completely reset the database and all sync cursors:
gi reset --confirm
This deletes:
- The database file
- All sync cursors
- All embeddings
You'll need to run gi sync again to repopulate.
Schema Migrations
Database schema is version-tracked and migrations auto-apply on startup:
- On first run, schema is created at latest version
- On subsequent runs, pending migrations are applied automatically
- Migration version is stored in
schema_versiontable - Migrations are idempotent and reversible where possible
Manual migration check:
gi doctor --json | jq '.checks.database'
# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }
Future Work (Post-MVP)
The following features are explicitly deferred to keep MVP scope focused:
| Feature | Description | Depends On |
|---|---|---|
| File History | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
| Personal Dashboard | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
| Person Context | Aggregate contributions by author, expertise inference | MVP complete |
| Decision Graph | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
| MCP Server | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
| Custom Tokenizer | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |
Checkpoint 6 (File History) Schema Preview:
-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER REFERENCES merge_requests(id),
old_path TEXT,
new_path TEXT,
new_file BOOLEAN,
deleted_file BOOLEAN,
renamed_file BOOLEAN,
UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
note_id INTEGER PRIMARY KEY REFERENCES notes(id),
old_path TEXT,
new_path TEXT,
old_line INTEGER,
new_line INTEGER,
position_type TEXT -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
Verification Strategy
Each checkpoint includes:
- Automated tests - Unit tests for data transformations, integration tests for API calls
- CLI smoke tests - Manual commands with expected outputs documented
- Data integrity checks - Count verification against GitLab, schema validation
- Search quality tests - Known queries with expected results (for Checkpoint 4+)
Risk Mitigation
| Risk | Mitigation |
|---|---|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
| SQLite scale limits | Monitor performance; Postgres migration path documented |
| Stale data | Incremental sync with change detection |
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
| Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery |
| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
| Pathological discussions | Queue-based discussion fetching; one bad thread doesn't block entire sync |
| Empty search results with filters | Adaptive recall (topK 50→200 when filtered) |
SQLite Performance Defaults (MVP):
- Enable
PRAGMA journal_mode=WAL;on every connection - Enable
PRAGMA foreign_keys=ON;on every connection - Use explicit transactions for page/batch inserts
- Targeted indexes on
(project_id, updated_at)for primary resources
Schema Summary
| Table | Checkpoint | Purpose |
|---|---|---|
| projects | 0 | Configured GitLab projects |
| sync_runs | 0 | Audit trail of sync operations (with heartbeat) |
| app_locks | 0 | Crash-safe single-flight lock |
| sync_cursors | 0 | Resumable sync state per primary resource |
| raw_payloads | 0 | Decoupled raw JSON storage (gitlab_id as TEXT) |
| schema_version | 0 | Database migration version tracking |
| issues | 1 | Normalized issues (unique by project+iid) |
| labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction |
| discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) |
| notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) |
| merge_requests | 2 | Normalized MRs (unique by project+iid) |
| mr_labels | 2 | MR-label junction |
| documents | 3A | Unified searchable documents with truncation metadata |
| document_labels | 3A | Document-label junction for fast filtering |
| document_paths | 3A | Fast path filtering for documents (DiffNote file paths) |
| dirty_sources | 3A | Queue for incremental document regeneration |
| pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
| documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) |
| embedding_metadata | 3B | Embedding provenance + error tracking |
| mr_files | 6 | MR file changes (deferred to post-MVP) |
Resolved Decisions
| Question | Decision | Rationale |
|---|---|---|
| Comments structure | Discussions as first-class entities | Thread context is essential for decision traceability |
| System notes | Store flagged, exclude from embeddings | Preserves audit trail while avoiding semantic noise |
| DiffNote paths | Capture now | Enables immediate file/path search without full file-history feature |
| MR file linkage | Deferred to post-MVP (CP6) | Only needed for file-history feature |
| Labels | Index as filters | document_labels table enables fast --label=X filtering |
| Labels uniqueness | By (project_id, name) | GitLab API returns labels as strings |
| Sync method | Polling only for MVP | Webhooks add complexity; polling every 10 min is sufficient |
| Sync safety | DB lock + heartbeat + rolling backfill | Prevents race conditions and missed updates |
| Discussions sync | Resumable queue model | Queue-based fetching allows one pathological thread to not block entire sync |
| Hybrid ranking | RRF over weighted sums | Simpler, no score normalization needed |
| Embedding rowid | rowid = documents.id | Eliminates fragile rowid mapping |
| Embedding truncation | Note-boundary aware middle drop | Never cut mid-note; preserves semantic coherence |
| Embedding batching | 32 docs/batch, 4 concurrent workers | Balance throughput, memory, and error isolation |
| FTS5 tokenizer | porter unicode61 | Stemming improves recall |
| Ollama unavailable | Graceful degradation to FTS5 | Search still works without semantic matching |
| JSON output | Stable documented schema | Enables reliable agent/MCP consumption |
| Database location | XDG compliant: ~/.local/share/gi/ |
Standard location, user-configurable |
gi init validation |
Validate GitLab before writing config | Fail fast, better UX |
| Ctrl+C handling | Graceful shutdown | Finish page, commit cursor, exits cleanly |
| Empty state UX | Actionable messages | Guide user to next step |
| raw_payloads.gitlab_id | TEXT not INTEGER | Discussion IDs are strings; numeric IDs stored as strings |
| GitLab list params | Always scope=all&state=all | Ensures all historical data including closed items |
| Pagination | X-Next-Page headers with empty-page fallback | Headers are more robust than empty-page detection |
| Integration tests | Mocked by default, live tests optional | Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1 |
| Search recall with filters | Adaptive topK (50→200 when filtered) | Prevents "no results" when relevant docs exist outside top-50 |
| RRF score normalization | Per-query normalized 0-1 | score = rrfScore / max(rrfScore); raw score in explain |
| --path semantics | Trailing / = prefix match | --path=src/auth/ does prefix; otherwise exact match |
| CP3 structure | Split into 3A (FTS) and 3B (embeddings) | Lexical search works before embedding infra risk |
| Vector extension | sqlite-vec (not sqlite-vss) | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere |
| CLI framework | Commander.js | Simple, lightweight, sufficient for single-user CLI tool |
| Logging | pino to stderr | JSON-structured, fast; stderr keeps stdout clean for JSON output piping |
| Error handling | Custom error class hierarchy | GiError base with codes; specific classes for config/gitlab/db/embedding errors |
| Truncation edge cases | Char-boundary cut for oversized notes | Single notes > 32000 chars truncated at char boundary with [truncated] marker |
| Ollama API | Use /api/embed for batching | Batch up to 32 docs per request; fall back to /api/embeddings for single |
Next Steps
- User approves this spec
- Generate Checkpoint 0 PRD for project setup
- Implement Checkpoint 0
- Human validates → proceed to Checkpoint 1
- Repeat for each checkpoint