Files
gitlore/SPEC.md
2026-01-28 15:49:14 -05:00

2046 lines
84 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Gitlore - Spec Document
> **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation. The project was renamed from "gitlab-inbox" to "gitlore" and the CLI from "gi" to "lore".
## Executive Summary
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
---
## Quick Start
### Prerequisites
| Requirement | Version | Notes |
|-------------|---------|-------|
| Node.js | 20+ | LTS recommended |
| npm | 10+ | Comes with Node.js |
| Ollama | Latest | Optional for semantic search; lexical search works without it |
### Installation
```bash
# Clone and install
git clone https://github.com/your-org/gitlab-inbox.git
cd gitlab-inbox
npm install
npm run build
npm link # Makes `gi` available globally
```
### First Run
1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope):
```bash
export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
```
2. **Run the setup wizard:**
```bash
gi init
```
This creates `gi.config.json` with your GitLab URL and project paths.
3. **Verify your environment:**
```bash
gi doctor
```
All checks should pass (Ollama warning is OK if you only need lexical search).
4. **Sync your data:**
```bash
gi sync
```
Initial sync takes 10-20 minutes depending on repo size and rate limits.
5. **Search:**
```bash
gi search "authentication redesign"
```
### Troubleshooting First Run
| Symptom | Solution |
|---------|----------|
| `Config file not found` | Run `gi init` first |
| `GITLAB_TOKEN not set` | Export the environment variable |
| `401 Unauthorized` | Check token has `read_api` scope |
| `Project not found: group/project` | Verify project path in GitLab URL |
| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search |
---
## Discovery Summary
### Pain Points Identified
1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find
2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions
### Constraints
| Constraint | Detail |
|------------|--------|
| Hosting | Self-hosted only, no external APIs |
| Compute | Local dev machine (M-series Mac assumed) |
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
| Build Method | AI agents will implement; user is TypeScript expert for review |
### Target Use Cases (Priority Order)
1. **MVP: Semantic Search** - "Find discussions about authentication redesign"
2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?"
3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?"
4. **Future: Person Context** - "What's @johndoe's background in this project?"
---
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ GitLab API │
│ (Issues, MRs, Notes) │
└─────────────────────────────────────────────────────────────────┘
(Commit-level indexing explicitly post-MVP)
┌─────────────────────────────────────────────────────────────────┐
│ Data Ingestion Layer │
│ - Incremental sync (PAT-based polling) │
│ - Rate limiting / backoff │
│ - Raw JSON storage for replay │
│ - Dependent resource fetching (notes, MR changes) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Data Processing Layer │
│ - Normalize artifacts to unified schema │
│ - Extract searchable documents (canonical text + metadata) │
│ - Content hashing for change detection │
│ - MVP relationships: parent-child FKs + label/path associations│
│ (full cross-entity "decision graph" is post-MVP scope) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ - SQLite + sqlite-vec + FTS5 (hybrid search) │
│ - Structured metadata in relational tables │
│ - Vector embeddings for semantic search │
│ - Full-text index for lexical search fallback │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Query Interface │
│ - CLI for human testing │
│ - JSON API for AI agent testing │
│ - Semantic search with filters (author, date, type, label) │
└─────────────────────────────────────────────────────────────────┘
```
### Technology Choices
| Component | Choice | Rationale |
|-----------|--------|-----------|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension |
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
| CLI Framework | Commander.js | Simple, lightweight, well-documented |
| Logging | pino | Fast, JSON-structured, low overhead |
| Validation | Zod | TypeScript-first schema validation |
### Alternative Considered: sqlite-vss
- sqlite-vss was the original choice but is now deprecated
- No Apple Silicon support (no prebuilt ARM binaries)
- Replaced by sqlite-vec, which is pure C with no dependencies
- sqlite-vec uses `vec0` virtual table (vs `vss0`)
### Alternative Considered: Postgres + pgvector
- Pros: More scalable, better for production multi-user
- Cons: Requires running Postgres, heavier setup
- Decision: Start with SQLite for simplicity; migration path exists if needed
---
## Project Structure
```
gitlab-inbox/
├── src/
│ ├── cli/
│ │ ├── index.ts # CLI entry point (Commander.js)
│ │ └── commands/ # One file per command group
│ │ ├── init.ts
│ │ ├── sync.ts
│ │ ├── search.ts
│ │ ├── list.ts
│ │ └── doctor.ts
│ ├── core/
│ │ ├── config.ts # Config loading/validation (Zod)
│ │ ├── db.ts # Database connection + migrations
│ │ ├── errors.ts # Custom error classes
│ │ └── logger.ts # pino logger setup
│ ├── gitlab/
│ │ ├── client.ts # GitLab API client with rate limiting
│ │ ├── types.ts # GitLab API response types
│ │ └── transformers/ # Payload → normalized schema
│ │ ├── issue.ts
│ │ ├── merge-request.ts
│ │ └── discussion.ts
│ ├── ingestion/
│ │ ├── issues.ts
│ │ ├── merge-requests.ts
│ │ └── discussions.ts
│ ├── documents/
│ │ ├── extractor.ts # Document generation from entities
│ │ └── truncation.ts # Note-boundary aware truncation
│ ├── embedding/
│ │ ├── ollama.ts # Ollama client
│ │ └── pipeline.ts # Batch embedding orchestration
│ ├── search/
│ │ ├── hybrid.ts # RRF ranking logic
│ │ ├── fts.ts # FTS5 queries
│ │ └── vector.ts # sqlite-vec queries
│ └── types/
│ └── index.ts # Shared TypeScript types
├── tests/
│ ├── unit/
│ ├── integration/
│ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
│ └── fixtures/
│ └── golden-queries.json
├── migrations/ # Numbered SQL migration files
│ ├── 001_initial.sql
│ └── ...
├── gi.config.json # User config (gitignored)
├── package.json
├── tsconfig.json
├── vitest.config.ts
├── eslint.config.js
└── README.md
```
---
## Dependencies
### Runtime Dependencies
```json
{
"dependencies": {
"better-sqlite3": "latest",
"sqlite-vec": "latest",
"commander": "latest",
"zod": "latest",
"pino": "latest",
"pino-pretty": "latest",
"ora": "latest",
"chalk": "latest",
"cli-table3": "latest"
}
}
```
| Package | Purpose |
|---------|---------|
| better-sqlite3 | Synchronous SQLite driver (fast, native) |
| sqlite-vec | Vector search extension (pure C, cross-platform) |
| commander | CLI argument parsing |
| zod | Schema validation for config and inputs |
| pino | Structured JSON logging |
| pino-pretty | Dev-mode log formatting |
| ora | CLI spinners for progress indication |
| chalk | Terminal colors |
| cli-table3 | ASCII tables for list output |
### Dev Dependencies
```json
{
"devDependencies": {
"typescript": "latest",
"@types/better-sqlite3": "latest",
"@types/node": "latest",
"vitest": "latest",
"msw": "latest",
"eslint": "latest",
"@typescript-eslint/eslint-plugin": "latest",
"@typescript-eslint/parser": "latest",
"tsx": "latest"
}
}
```
| Package | Purpose |
|---------|---------|
| typescript | TypeScript compiler |
| vitest | Test runner |
| msw | Mock Service Worker for API mocking in tests |
| eslint | Linting |
| tsx | Run TypeScript directly during development |
---
## GitLab API Strategy
### Primary Resources (Bulk Fetch)
Issues and MRs support efficient bulk fetching with incremental sync:
```
GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
```
**Required query params for completeness:**
- `scope=all` - include all issues/MRs, not just authored by current user
- `state=all` - include closed items (GitLab defaults may exclude them)
Without these params, the 2+ years of historical data would be incomplete.
### Dependent Resources (Per-Parent Fetch)
Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
```
GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
```
**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss.
### Sync Pattern
**Initial sync:**
1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
2. For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead)
3. Fetch all MRs (paginated, ~60 calls)
4. For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead)
5. Total: thousands of API calls for initial sync
**API Call Estimation Formula:**
```
total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue
+ ceil(mrs/100) + mrs × avg_discussion_pages_per_mr
```
Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent:
- Issue list: 30 calls
- Issue discussions: 3,000 × 1.2 = 3,600 calls
- MR list: 30 calls
- MR discussions: 3,000 × 1.2 = 3,600 calls
- **Total: ~7,260 calls**
This matters for rate limit planning and setting realistic "10-20 minutes" expectations.
**Incremental sync:**
1. Fetch issues where `updated_after=cursor` (bulk)
2. For EACH updated issue → refetch ALL its discussions
3. Fetch MRs where `updated_after=cursor` (bulk)
4. For EACH updated MR → refetch ALL its discussions
### Critical Assumption (Softened)
We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively.
**Mitigations (MVP):**
1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor.
2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`).
The backfill window provides 80% of the safety of full resync at <5% of the API cost.
### Rate Limiting
- Default: 10 requests/second with exponential backoff
- Respect `Retry-After` headers on 429 responses
- Add jitter to avoid thundering herd on retry
- **Separate concurrency limits:**
- `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4)
- `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
- Bound concurrency per-project to avoid one repo starving the other
- Initial sync estimate: 10-20 minutes depending on rate limits
---
## Checkpoint Structure
Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
### Checkpoint 0: Project Setup
**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working
**Automated Tests (Vitest):**
```
tests/unit/config.test.ts
✓ loads config from gi.config.json
✓ throws if config file missing
✓ throws if required fields missing (baseUrl, projects)
✓ validates project paths are non-empty strings
tests/unit/db.test.ts
✓ creates database file if not exists
✓ applies migrations in order
✓ sets WAL journal mode
✓ enables foreign keys
tests/integration/gitlab-client.test.ts
✓ (mocked) authenticates with valid PAT
✓ (mocked) returns 401 for invalid PAT
✓ (mocked) fetches project by path
✓ (mocked) handles rate limiting (429) with retry
tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI)
✓ authenticates with real PAT against configured baseUrl
✓ fetches real project by path
✓ handles actual rate limiting behavior
tests/integration/app-lock.test.ts
✓ acquires lock successfully
✓ updates heartbeat during operation
✓ detects stale lock and recovers
✓ refuses concurrent acquisition
tests/integration/init.test.ts
✓ creates config file with valid structure
✓ validates GitLab URL format
✓ validates GitLab connection before writing config
✓ validates each project path exists in GitLab
✓ fails if token not set
✓ fails if GitLab auth fails
✓ fails if any project path not found
✓ prompts before overwriting existing config
✓ respects --force to skip confirmation
✓ generates gi.config.json with sensible defaults
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name |
| `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
| `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks |
| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
| `gi init` | Interactive prompts | Creates valid gi.config.json |
| `gi init` (config exists) | Confirmation prompt | Warns before overwriting |
| `gi --help` | Command list | Shows all available commands |
| `gi version` | Version number | Shows installed version |
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
**Data Integrity Checks:**
- [ ] `projects` table contains rows for each configured project path
- [ ] `gitlab_project_id` matches actual GitLab project IDs
- [ ] `raw_payloads` contains project JSON for each synced project
**Scope:**
- Project structure (TypeScript, ESLint, Vitest)
- GitLab API client with PAT authentication
- Environment and project configuration
- Basic CLI scaffold with `auth-test` command
- `doctor` command for environment verification
- Projects table and initial project resolution (no issue/MR ingestion yet)
- DB migrations + WAL + FK enforcement
- Sync tracking with crash-safe single-flight lock (heartbeat-based)
- Rate limit handling with exponential backoff + jitter
- `gi init` command for guided setup:
- Prompts for GitLab base URL
- Prompts for project paths (comma-separated or multiple prompts)
- Prompts for token environment variable name (default: GITLAB_TOKEN)
- **Validates before writing config:**
- Token must be set in environment
- Tests auth with `GET /user` endpoint
- Validates each project path with `GET /projects/:path`
- Only writes config after all validations pass
- Generates `gi.config.json` with sensible defaults
- `gi --help` shows all available commands
- `gi <command> --help` shows command-specific help
- `gi version` shows installed version
- First-run detection: if no config exists, suggest `gi init`
**Configuration (MVP):**
```json
// gi.config.json
{
"gitlab": {
"baseUrl": "https://gitlab.example.com",
"tokenEnvVar": "GITLAB_TOKEN"
},
"projects": [
{ "path": "group/project-one" },
{ "path": "group/project-two" }
],
"sync": {
"backfillDays": 14,
"staleLockMinutes": 10,
"heartbeatIntervalSeconds": 30,
"cursorRewindSeconds": 2,
"primaryConcurrency": 4,
"dependentConcurrency": 2
},
"storage": {
"compressRawPayloads": true
},
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"baseUrl": "http://localhost:11434",
"concurrency": 4
}
}
```
**Raw Payload Compression:**
- When `storage.compressRawPayloads: true` (default), raw JSON payloads are gzip-compressed before storage
- `raw_payloads.content_encoding` indicates `'identity'` (uncompressed) or `'gzip'` (compressed)
- Compression typically reduces storage by 70-80% for JSON payloads
- Decompression is handled transparently when reading payloads
- Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage
**Error Classes (src/core/errors.ts):**
```typescript
// Base error class with error codes for programmatic handling
export class GiError extends Error {
constructor(message: string, public readonly code: string) {
super(message);
this.name = 'GiError';
}
}
// Config errors
export class ConfigNotFoundError extends GiError {
constructor() {
super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
}
}
export class ConfigValidationError extends GiError {
constructor(details: string) {
super(`Invalid config: ${details}`, 'CONFIG_INVALID');
}
}
// GitLab API errors
export class GitLabAuthError extends GiError {
constructor() {
super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
}
}
export class GitLabNotFoundError extends GiError {
constructor(resource: string) {
super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
}
}
export class GitLabRateLimitError extends GiError {
constructor(public readonly retryAfter: number) {
super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
}
}
// Database errors
export class DatabaseLockError extends GiError {
constructor() {
super('Another sync is running. Use --force to override.', 'DB_LOCKED');
}
}
// Embedding errors
export class OllamaConnectionError extends GiError {
constructor() {
super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
}
}
export class EmbeddingError extends GiError {
constructor(documentId: number, reason: string) {
super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
}
}
```
**Logging Strategy (src/core/logger.ts):**
```typescript
import pino from 'pino';
// Logs go to stderr, results to stdout (allows clean JSON piping)
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
transport: process.env.NODE_ENV === 'production' ? undefined : {
target: 'pino-pretty',
options: { colorize: true, destination: 2 } // 2 = stderr
}
}, pino.destination(2));
```
**Log Levels:**
| Level | When to use |
|-------|-------------|
| debug | Detailed sync progress, API calls, SQL queries |
| info | Sync start/complete, document counts, search timing |
| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries |
| error | Failures that stop operations |
**Logging Conventions:**
- Always include structured context: `logger.info({ project, count }, 'Fetched issues')`
- Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')`
- All logs to stderr so `gi search --json` output stays clean on stdout
**DB Runtime Defaults (Checkpoint 0):**
- On every connection:
- `PRAGMA journal_mode=WAL;`
- `PRAGMA foreign_keys=ON;`
**Schema (Checkpoint 0):**
```sql
-- Projects table (configured targets)
CREATE TABLE projects (
id INTEGER PRIMARY KEY,
gitlab_project_id INTEGER UNIQUE NOT NULL,
path_with_namespace TEXT NOT NULL,
default_branch TEXT,
web_url TEXT,
created_at INTEGER,
updated_at INTEGER,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);
-- Sync tracking for reliability
CREATE TABLE sync_runs (
id INTEGER PRIMARY KEY,
started_at INTEGER NOT NULL,
heartbeat_at INTEGER NOT NULL,
finished_at INTEGER,
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
error TEXT
);
-- Crash-safe single-flight lock (DB-enforced)
CREATE TABLE app_locks (
name TEXT PRIMARY KEY, -- 'sync'
owner TEXT NOT NULL, -- random run token (UUIDv4)
acquired_at INTEGER NOT NULL,
heartbeat_at INTEGER NOT NULL
);
-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
project_id INTEGER NOT NULL REFERENCES projects(id),
resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests'
updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch)
tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering)
PRIMARY KEY(project_id, resource_type)
);
-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
id INTEGER PRIMARY KEY,
source TEXT NOT NULL, -- 'gitlab'
project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
gitlab_id TEXT NOT NULL, -- TEXT because discussion IDs are strings; numeric IDs stored as strings
fetched_at INTEGER NOT NULL,
content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
-- Schema version tracking for migrations
CREATE TABLE schema_version (
version INTEGER PRIMARY KEY,
applied_at INTEGER NOT NULL,
description TEXT
);
```
---
### Checkpoint 1: Issue Ingestion
**Deliverable:** All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync
**Automated Tests (Vitest):**
```
tests/unit/issue-transformer.test.ts
✓ transforms GitLab issue payload to normalized schema
✓ extracts labels from issue payload
✓ handles missing optional fields gracefully
tests/unit/pagination.test.ts
✓ fetches all pages when multiple exist
✓ respects per_page parameter
✓ follows X-Next-Page header until empty/absent
✓ falls back to empty-page stop if headers missing (robustness)
tests/unit/discussion-transformer.test.ts
✓ transforms discussion payload to normalized schema
✓ extracts notes array from discussion
✓ sets individual_note flag correctly
✓ flags system notes with is_system=1
✓ preserves note order via position field
tests/integration/issue-ingestion.test.ts
✓ inserts issues into database
✓ creates labels from issue payloads
✓ links issues to labels via junction table
✓ stores raw payload for each issue
✓ updates cursor after successful page commit
✓ resumes from cursor on subsequent runs
tests/integration/issue-discussion-ingestion.test.ts
✓ fetches discussions for each issue
✓ creates discussion rows with correct issue FK
✓ creates note rows linked to discussions
✓ stores system notes with is_system=1 flag
✓ handles individual_note=true discussions
tests/integration/sync-runs.test.ts
✓ creates sync_run record on start
✓ marks run as succeeded on completion
✓ marks run as failed with error message on failure
✓ refuses concurrent run (single-flight)
✓ allows --force to override stale running status
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi ingest --type=issues` | Progress bar, final count | Completes without error |
| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
| `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use `--project=PATH` |
| `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count |
| `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count |
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects
- [ ] Every issue has a corresponding `raw_payloads` row
- [ ] Labels in `issue_labels` junction all exist in `labels` table
- [ ] `sync_cursors` has entry for each (project_id, 'issues') pair
- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current)
- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'` is non-zero
- [ ] Every discussion has at least one note
- [ ] `individual_note = true` discussions have exactly one note
**Scope:**
- Issue fetcher with pagination handling
- Raw JSON storage in raw_payloads table
- Normalized issue schema in SQLite
- Labels ingestion derived from issue payload:
- Always persist label names from `labels: string[]`
- Optionally request `with_labels_details=true` to capture color/description when available
- Issue discussions fetcher:
- Uses `GET /projects/:id/issues/:iid/discussions`
- Fetches all discussions for each issue during ingest
- Preserve system notes but flag them with `is_system=1`
- Incremental sync support (run tracking + per-project cursor)
- Basic list/count CLI commands
**Reliability/Idempotency Rules:**
- Every ingest/sync creates a `sync_runs` row
- Single-flight via DB-enforced app lock:
- On start: acquire lock via transactional compare-and-swap:
- `BEGIN IMMEDIATE` (acquires write lock immediately)
- If no row exists → INSERT new lock
- Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps
- Else if `owner` matches current run → UPDATE heartbeat (re-entrant)
- Else → ROLLBACK and fail fast (another run is active)
- `COMMIT`
- During run: update `heartbeat_at` every 30 seconds
- If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
- `--force` remains as operator override for edge cases, but should rarely be needed
- Cursor advances only after successful transaction commit per page/batch
- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
- Use explicit transactions for batch inserts
**Schema Preview:**
```sql
CREATE TABLE issues (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
created_at INTEGER,
updated_at INTEGER,
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);
CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);
-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER, -- optional (only if available)
project_id INTEGER NOT NULL REFERENCES projects(id),
name TEXT NOT NULL,
color TEXT,
description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);
CREATE TABLE issue_labels (
issue_id INTEGER REFERENCES issues(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
-- Discussion threads for issues (MR discussions added in CP2)
CREATE TABLE discussions (
id INTEGER PRIMARY KEY,
gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...")
project_id INTEGER NOT NULL REFERENCES projects(id),
issue_id INTEGER REFERENCES issues(id),
merge_request_id INTEGER, -- FK added in CP2 via ALTER TABLE
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion
first_note_at INTEGER, -- for ordering discussions
last_note_at INTEGER, -- for "recently active" queries
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
resolvable BOOLEAN, -- MR discussions can be resolved
resolved BOOLEAN,
CHECK (
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
)
);
CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id);
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
-- Notes belong to discussions (preserving thread context)
CREATE TABLE notes (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
discussion_id INTEGER NOT NULL REFERENCES discussions(id),
project_id INTEGER NOT NULL REFERENCES projects(id),
type TEXT, -- 'DiscussionNote' | 'DiffNote' | null
is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.)
author_username TEXT,
body TEXT,
created_at INTEGER,
updated_at INTEGER,
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
position INTEGER, -- derived from array order in API response (0-indexed)
resolvable BOOLEAN,
resolved BOOLEAN,
resolved_by TEXT,
resolved_at INTEGER,
-- DiffNote position metadata (only populated for MR DiffNotes in CP2)
position_old_path TEXT,
position_new_path TEXT,
position_old_line INTEGER,
position_new_line INTEGER,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
CREATE INDEX idx_notes_author ON notes(author_username);
CREATE INDEX idx_notes_system ON notes(is_system);
```
---
### Checkpoint 2: MR Ingestion
**Deliverable:** All MRs + MR discussions + notes with DiffNote paths captured
**Automated Tests (Vitest):**
```
tests/unit/mr-transformer.test.ts
✓ transforms GitLab MR payload to normalized schema
✓ extracts labels from MR payload
✓ handles missing optional fields gracefully
tests/unit/diffnote-transformer.test.ts
✓ extracts DiffNote position metadata (paths and lines)
✓ handles missing position fields gracefully
tests/integration/mr-ingestion.test.ts
✓ inserts MRs into database
✓ creates labels from MR payloads
✓ links MRs to labels via junction table
✓ stores raw payload for each MR
tests/integration/mr-discussion-ingestion.test.ts
✓ fetches discussions for each MR
✓ creates discussion rows with correct MR FK
✓ creates note rows linked to discussions
✓ extracts position_new_path from DiffNotes
✓ captures note-level resolution status
✓ captures note type (DiscussionNote, DiffNote)
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error |
| `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch |
| `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI |
| `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads |
| `gi count discussions` | `Discussions: 12,345` | Total count (issue + MR) |
| `gi count discussions --type=mr` | `MR Discussions: 6,789` | MR discussions only |
| `gi count notes` | `Notes: 45,678 (excluding 8,901 system)` | Total with system note count |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest'` is non-zero
- [ ] DiffNotes have `position_new_path` populated when available
- [ ] Discussion `first_note_at` <= `last_note_at` for all rows
**Scope:**
- MR fetcher with pagination
- MR discussions fetcher:
- Uses `GET /projects/:id/merge_requests/:iid/discussions`
- Fetches all discussions for each MR during ingest
- Capture DiffNote file path/line metadata from `position` field for filename search
- Relationship linking (discussion → MR, notes → discussion)
- Extended CLI commands for MR display with threads
- Add `idx_notes_type` and `idx_notes_new_path` indexes for DiffNote queries
**Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.
**Schema Additions:**
```sql
CREATE TABLE merge_requests (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
source_branch TEXT,
target_branch TEXT,
created_at INTEGER,
updated_at INTEGER,
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
merged_at INTEGER,
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);
-- MR labels (reuse same labels table from CP1)
CREATE TABLE mr_labels (
merge_request_id INTEGER REFERENCES merge_requests(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
-- Additional indexes for DiffNote queries (tables created in CP1)
CREATE INDEX idx_notes_type ON notes(type);
CREATE INDEX idx_notes_new_path ON notes(position_new_path);
-- Migration: Add FK constraint to discussions table (was deferred from CP1)
-- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK
-- This is handled by the migration system; pseudocode for clarity:
-- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id)
-- 2. INSERT INTO discussions_new SELECT * FROM discussions
-- 3. DROP TABLE discussions
-- 4. ALTER TABLE discussions_new RENAME TO discussions
-- 5. Recreate indexes
```
**MR Discussion Processing Rules:**
- DiffNote position data is extracted and stored:
- `position.old_path`, `position.new_path` for file-level search
- `position.old_line`, `position.new_line` for line-level context
- MR discussions can be resolvable; resolution status is captured at note level
---
### Checkpoint 3A: Document Generation + FTS (Lexical Search)
**Deliverable:** Documents generated + FTS5 index; `gi search --mode=lexical` works end-to-end (no Ollama required)
**Automated Tests (Vitest):**
```
tests/unit/document-extractor.test.ts
✓ extracts issue document (title + description)
✓ extracts MR document (title + description)
✓ extracts discussion document with full thread context
✓ includes parent issue/MR title in discussion header
✓ formats notes with author and timestamp
✓ excludes system notes from discussion documents by default
✓ includes system notes only when --include-system-notes enabled (debug)
✓ truncates content exceeding 8000 tokens at note boundaries
✓ preserves first and last notes when truncating middle
✓ computes SHA-256 content hash consistently
tests/integration/document-creation.test.ts
✓ creates document for each issue
✓ creates document for each MR
✓ creates document for each discussion
✓ populates document_labels junction table
✓ computes content_hash for each document
✓ excludes system notes from discussion content
tests/integration/fts-index.test.ts
✓ documents_fts row count matches documents
✓ FTS triggers fire on insert/update/delete
✓ updates propagate via triggers
tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ porter stemming works (search/searching)
✓ returns empty for non-matching query
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi generate-docs` | Progress bar, final count | Completes without error |
| `gi generate-docs` (re-run) | `0 documents to regenerate` | Skips unchanged docs |
| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, works without Ollama |
| `gi stats` | Document count stats | Shows document coverage |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
- [ ] `SELECT COUNT(*) FROM documents_fts` = `SELECT COUNT(*) FROM documents` (via FTS triggers)
- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings
- [ ] Discussion documents include parent title in content_text
- [ ] Discussion documents exclude system notes
**Scope:**
- Document extraction layer:
- Canonical "search documents" derived from issues/MRs/discussions
- Stable content hashing for change detection (SHA-256 of content_text)
- Truncation: content_text capped at 8000 tokens at NOTE boundaries
- **Implementation:** Use character budget, not exact token count
- `maxChars = 32000` (conservative 4 chars/token estimate)
- Drop whole notes from middle, never cut mid-note
- `approxTokens = ceil(charCount / 4)` for reporting/logging only
- System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search)
- Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via `document_labels` join table
- FTS5 index for lexical search
- `gi search --mode=lexical` CLI command (works without Ollama)
This checkpoint delivers a working search experience before introducing embedding infrastructure risk.
**Schema Additions (CP3A):**
```sql
-- Unified searchable documents (derived from issues/MRs/discussions)
-- Note: Full documents table schema is in CP3B section for continuity with embeddings
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id in the source table
project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT, -- for discussions: first note author
label_names TEXT, -- JSON array (display/debug only)
created_at INTEGER,
updated_at INTEGER,
url TEXT,
title TEXT, -- null for discussions
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
content_hash TEXT NOT NULL, -- SHA-256 for change detection
is_truncated BOOLEAN NOT NULL DEFAULT 0,
truncated_reason TEXT, -- 'token_limit_middle_drop' | null
UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
document_id INTEGER NOT NULL REFERENCES documents(id),
label_name TEXT NOT NULL,
PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
-- Fast path filtering for documents (extracted from DiffNote positions)
CREATE TABLE document_paths (
document_id INTEGER NOT NULL REFERENCES documents(id),
path TEXT NOT NULL,
PRIMARY KEY(document_id, path)
);
CREATE INDEX idx_document_paths_path ON document_paths(path);
-- Track sources that require document regeneration (populated during ingestion)
CREATE TABLE dirty_sources (
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id
queued_at INTEGER NOT NULL,
PRIMARY KEY(source_type, source_id)
);
-- Resumable dependent fetches (discussions are per-parent resources)
CREATE TABLE pending_discussion_fetches (
project_id INTEGER NOT NULL REFERENCES projects(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
noteable_iid INTEGER NOT NULL, -- parent iid (stable human identifier)
queued_at INTEGER NOT NULL,
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER,
last_error TEXT,
PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_retry
ON pending_discussion_fetches(attempt_count, last_attempt_at)
WHERE last_error IS NOT NULL;
-- Full-text search for lexical retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
```
**FTS5 Tokenizer Notes:**
- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
- `unicode61` handles Unicode properly
- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
---
### Checkpoint 3B: Embedding Generation (Semantic Search)
**Deliverable:** Embeddings generated + `gi search --mode=semantic` works; graceful fallback if Ollama unavailable
**Automated Tests (Vitest):**
```
tests/unit/embedding-client.test.ts
✓ connects to Ollama API
✓ generates embedding for text input
✓ returns 768-dimension vector
✓ handles Ollama connection failure gracefully
✓ batches requests (32 documents per batch)
tests/integration/embedding-storage.test.ts
✓ stores embedding in sqlite-vec
✓ embedding rowid matches document id
✓ creates embedding_metadata record
✓ skips re-embedding when content_hash unchanged
✓ re-embeds when content_hash changes
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi embed --all` | Progress bar with ETA | Completes without error |
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
| `gi stats` | Embedding coverage stats | Shows 100% coverage |
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
| `gi search "authentication" --mode=semantic` | Vector results | Returns semantically similar documents |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents`
- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
**Scope:**
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline:
- Batch size: 32 documents per batch
- Concurrency: configurable (default 4 workers)
- Retry with exponential backoff for transient failures (max 3 attempts)
- Per-document failure recording to enable targeted re-runs
- Vector storage in SQLite (sqlite-vec extension)
- Progress tracking and resumability
- `gi search --mode=semantic` CLI command
**Ollama API Contract:**
```typescript
// POST http://localhost:11434/api/embed (batch endpoint - preferred)
interface OllamaEmbedRequest {
model: string; // "nomic-embed-text"
input: string[]; // array of texts to embed (up to 32)
}
interface OllamaEmbedResponse {
model: string;
embeddings: number[][]; // array of 768-dim vectors
}
// POST http://localhost:11434/api/embeddings (single text - fallback)
interface OllamaEmbeddingsRequest {
model: string;
prompt: string;
}
interface OllamaEmbeddingsResponse {
embedding: number[];
}
```
**Usage:**
- Use `/api/embed` for batching (up to 32 documents per request)
- Fall back to `/api/embeddings` for single documents or if batch fails
- Check Ollama availability with `GET http://localhost:11434/api/tags`
**Schema Additions (CP3B):**
```sql
-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
embedding float[768]
);
-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
model TEXT NOT NULL, -- 'nomic-embed-text'
dims INTEGER NOT NULL, -- 768
content_hash TEXT NOT NULL, -- copied from documents.content_hash
created_at INTEGER NOT NULL,
-- Error tracking for resumable embedding
last_error TEXT, -- error message from last failed attempt
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER -- when last attempt occurred
);
-- Index for finding failed embeddings to retry
CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
```
**Storage Rule (MVP):**
- Insert embedding with `rowid = documents.id`
- Upsert `embedding_metadata` by `document_id`
- This alignment simplifies joins and eliminates rowid mapping fragility
**Document Extraction Rules:**
| Source | content_text Construction |
|--------|--------------------------|
| Issue | `title + "\n\n" + description` |
| MR | `title + "\n\n" + description` |
| Discussion | Full thread with context (see below) |
**Discussion Document Format:**
```
[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread
--- Thread ---
@johndoe (2024-03-15):
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
@janedoe (2024-03-15):
Agreed. What about refresh token strategy?
@johndoe (2024-03-16):
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
```
**System Notes Exclusion Rule:**
- System notes (is_system=1) are stored in the DB for audit purposes
- System notes are EXCLUDED from discussion documents by default
- This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings
- Debug flag `--include-system-notes` available for troubleshooting
This format preserves:
- Parent context (issue/MR title and number)
- Project path for scoped search
- Direct URL for navigation
- Labels for context
- File paths from DiffNotes (enables immediate file search)
- Author attribution for each note
- Temporal ordering of the conversation
- Full thread semantics for decision traceability
**Truncation (Note-Boundary Aware):**
If content exceeds 8000 tokens (~32000 chars):
**Algorithm:**
1. Count non-system notes in the discussion
2. If total chars ≤ maxChars, no truncation needed
3. Otherwise, drop whole notes from the MIDDLE:
- Preserve first N notes and last M notes
- Never cut mid-note (produces unreadable snippets and worse embeddings)
- Continue dropping middle notes until under maxChars
4. Insert marker: `\n\n[... N notes omitted for length ...]\n\n`
5. Set `documents.is_truncated = 1`
6. Set `documents.truncated_reason = 'token_limit_middle_drop'`
7. Log a warning with document ID and original/truncated token count
**Edge Cases:**
- **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'`
- **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'`
- **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]`
**Why note-boundary truncation:**
- Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
- Keeping whole notes preserves semantic coherence for embeddings
- First notes contain context/problem statement; last notes contain conclusions
- Middle notes are often back-and-forth that's less critical
**Token estimation:** `approxTokens = ceil(charCount / 4)`. No tokenizer dependency.
This metadata enables:
- Monitoring truncation frequency in production
- Future investigation of high-value truncated documents
- Debugging when search misses expected content
---
### Checkpoint 4: Hybrid Search (Semantic + Lexical)
**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
**Automated Tests (Vitest):**
```
tests/unit/search-query.test.ts
✓ parses filter flags (--type, --author, --after, --label)
✓ validates date format for --after
✓ handles multiple --label flags
tests/unit/rrf-ranking.test.ts
✓ computes RRF score correctly
✓ merges results from vector and FTS retrievers
✓ handles documents appearing in only one retriever
✓ respects k=60 parameter
tests/integration/vector-search.test.ts
✓ returns results for semantic query
✓ ranks similar content higher
✓ returns empty for nonsense query
tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ handles porter stemming (search/searching)
✓ returns empty for non-matching query
tests/integration/hybrid-search.test.ts
✓ combines vector and FTS results
✓ applies type filter correctly
✓ applies author filter correctly
✓ applies date filter correctly
✓ applies label filter correctly
✓ falls back to FTS when Ollama unavailable
tests/e2e/golden-queries.test.ts
✓ "authentication redesign" returns known auth-related items
✓ "database migration" returns known migration items
✓ [8 more domain-specific golden queries]
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project |
| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
| `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
| `gi search "redis" --mode=lexical` | FTS results only | Shows FTS results, no embeddings |
| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
| `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema |
| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
| `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results |
| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
| `gi search "authentication"` (no data synced) | No data message | Shows "Run gi sync first" |
| `gi search "authentication"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
**Golden Query Test Suite:**
Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs:
```json
[
{
"query": "authentication redesign",
"expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
"minResults": 1,
"maxRank": 10
}
]
```
Each query must have at least one expected URL appear in top 10 results.
**Data Integrity Checks:**
- [ ] `documents_fts` row count matches `documents` row count
- [ ] Search returns results for known content (not empty)
- [ ] JSON output validates against defined schema
- [ ] All result URLs are valid GitLab URLs
**Scope:**
- Hybrid retrieval:
- Vector recall (sqlite-vec) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Query embedding generation (same Ollama pipeline as documents)
- Result ranking and scoring (document-level)
- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N`
- `--limit=N` controls result count (default: 20, max: 100)
- `--path` filters documents by referenced file paths (from DiffNote positions):
- If `--path` ends with `/`: prefix match (`path LIKE 'src/auth/%'`)
- Otherwise: exact match OR prefix on directory boundary
- Examples: `--path=src/auth/` matches `src/auth/login.ts`, `src/auth/utils/helpers.ts`
- Examples: `--path=src/auth/login.ts` matches only that exact file
- Glob patterns deferred to post-MVP
- Label filtering operates on `document_labels` (indexed, exact-match)
- Filters work identically in hybrid and lexical modes
- Debug: `--explain` returns rank contributions from vector + FTS + RRF
- Output formatting: ranked list with title, snippet, score, URL
- JSON output mode for AI/agent consumption (stable documented schema)
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
- Empty state handling:
- No documents indexed: `No data indexed. Run 'gi sync' first.`
- Query returns no results: `No results found for "query".`
- Filters exclude all results: `No results match the specified filters.`
- Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
1. Determine recall size (adaptive based on filters):
- `baseTopK = 50`
- If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200`
- This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
2. Query both vector index (top topK) and FTS5 (top topK)
- Vector recall via sqlite-vec + FTS lexical recall via fts5
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
3. Merge results by document_id
4. Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N)
- `rrfScore = Σ 1 / (k + rank)` with k=60 (tunable)
- RRF is simpler than weighted sums and doesn't require score normalization
5. Apply remaining filters (date ranges, labels, paths that weren't applied in SQL)
6. Return top K results
**Why Adaptive Recall:**
- Fixed top-50 + filter can easily return 0 results even when relevant docs exist
- Increasing recall when filters are present catches more candidates before filtering
- SQL-level filtering is preferred (faster, uses indexes) but not always possible
**Why RRF over Weighted Sums:**
- FTS5 BM25 scores and vector distances use different scales
- Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization
- RRF operates on ranks, not scores, making it robust to scale differences
- Well-established in information retrieval literature
**Graceful Degradation:**
- If Ollama is unreachable during search, automatically fall back to FTS5-only search
- Display warning: "Embedding service unavailable, using lexical search only"
- `embed` command fails with actionable error if Ollama is down
**CLI Interface:**
```bash
# Basic semantic search
gi search "authentication redesign"
# Search within specific project
gi search "authentication" --project=group/project-one
# Search by file path (finds discussions/MRs touching this file)
gi search "rate limit" --path=src/client.ts
# Pure FTS search (fallback if embeddings unavailable)
gi search "redis" --mode=lexical
# Filtered search
gi search "authentication" --type=mr --after=2024-01-01
# Filter by label
gi search "performance" --label=bug --label=critical
# JSON output for programmatic use
gi search "payment processing" --json
# Explain search (shows RRF contributions)
gi search "auth" --explain
```
**CLI Output Example:**
```
$ gi search "authentication"
Found 23 results (hybrid search, 0.34s)
[1] MR !847 - Refactor auth to use JWT tokens (0.82)
@johndoe · 2024-03-15 · group/project-one
"...moving away from session cookies to JWT for authentication..."
https://gitlab.example.com/group/project-one/-/merge_requests/847
[2] Issue #234 - Authentication redesign discussion (0.79)
@janedoe · 2024-02-28 · group/project-one
"...we need to redesign the authentication flow because..."
https://gitlab.example.com/group/project-one/-/issues/234
[3] Discussion on Issue #234 (0.76)
@johndoe · 2024-03-01 · group/project-one
"I think we should move to JWT-based auth because the session..."
https://gitlab.example.com/group/project-one/-/issues/234#note_12345
```
**JSON Output Schema (Stable):**
For AI/agent consumption, `--json` output follows this stable schema:
```typescript
interface SearchResult {
documentId: number;
sourceType: "issue" | "merge_request" | "discussion";
title: string | null;
url: string;
projectPath: string;
author: string | null;
createdAt: string; // ISO 8601
updatedAt: string; // ISO 8601
score: number; // normalized 0-1 (rrfScore / maxRrfScore in this result set)
snippet: string; // truncated content_text
labels: string[];
// Only present with --explain flag
explain?: {
vectorRank?: number; // null if not in vector results
ftsRank?: number; // null if not in FTS results
rrfScore: number; // raw RRF score (rank-based, comparable within a query)
};
}
// Note on score normalization:
// - `score` is normalized 0-1 for UI display convenience
// - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set)
// - Use `explain.rrfScore` for raw scores when comparing across queries
// - Scores are NOT comparable across different queries
interface SearchResponse {
query: string;
mode: "hybrid" | "lexical" | "semantic";
totalResults: number;
results: SearchResult[];
warnings?: string[]; // e.g., "Embedding service unavailable"
}
```
**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.
---
### Checkpoint 5: Incremental Sync
**Deliverable:** Efficient ongoing synchronization with GitLab
**Automated Tests (Vitest):**
```
tests/unit/cursor-management.test.ts
✓ advances cursor after successful page commit
✓ uses tie-breaker id for identical timestamps
✓ does not advance cursor on failure
✓ resets cursor on --full flag
tests/unit/change-detection.test.ts
✓ detects content_hash mismatch
✓ queues document for re-embedding on change
✓ skips re-embedding when hash unchanged
tests/integration/incremental-sync.test.ts
✓ fetches only items updated after cursor
✓ refetches discussions for updated issues
✓ refetches discussions for updated MRs
✓ updates existing records (not duplicates)
✓ creates new records for new items
✓ re-embeds documents with changed content_hash
tests/integration/sync-recovery.test.ts
✓ resumes from cursor after interrupted sync
✓ marks failed run with error message
✓ handles rate limiting (429) with backoff
✓ respects Retry-After header
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
| `gi sync --full` | Full sync progress | Resets cursors, fetches everything |
| `gi sync-status` | Last sync time, cursor positions | Shows current state |
| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
| `gi search "new content"` (after sync) | Returns new content | New content is searchable |
**End-to-End Sync Verification:**
1. Note the current `sync_cursors` values
2. Create a new comment on an issue in GitLab
3. Run `gi sync`
4. Verify:
- [ ] Issue's `updated_at` in DB matches GitLab
- [ ] New discussion row exists
- [ ] New note row exists
- [ ] New document row exists for discussion
- [ ] New embedding exists for document
- [ ] `gi search "new comment text"` returns the new discussion
- [ ] Cursor advanced past the updated issue
**Data Integrity Checks:**
- [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table
- [ ] No orphaned documents (all have valid source_id)
- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows
- [ ] `sync_runs` has complete audit trail
**Scope:**
- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)`
- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
- Dependent resources sync strategy (discussions refetched when parent updates)
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
- Sync status reporting
- Recommended: run via cron every 10 minutes
**Correctness Rules (MVP):**
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
- **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.**
- Use **cursor rewind + local filtering**:
- Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable)
- Locally discard items where:
- `updated_at < cursor_updated_at`, OR
- `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id`
- This makes the tuple cursor rule true in practice while keeping API calls simple.
- Cursor advances only after successful DB commit for that page
- When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
3. Dependent resources:
- For each updated issue/MR, refetch ALL its discussions
- Discussion documents are regenerated and re-embedded if content_hash changes
4. Rolling backfill window:
- After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays`
- This catches any items whose timestamps were updated without triggering our cursor
5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
**Why Dependent Resource Model:**
- GitLab Discussions API doesn't provide a global `updated_after` stream
- Discussions are listed per-issue or per-MR, not as a top-level resource
- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct
**CLI Commands:**
```bash
# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
gi sync # orchestrates all steps
gi sync --no-embed # skip embedding step (fast ingest/debug)
gi sync --no-docs # skip document regeneration (debug)
# Force full re-sync (resets cursors)
gi sync --full
# Override stale 'running' run after operator review
gi sync --force
# Show sync status
gi sync-status
```
**Orchestration steps (in order):**
1. Acquire app lock with heartbeat
2. Ingest delta (issues, MRs) based on cursors
- For each upserted issue/MR, enqueue into `pending_discussion_fetches`
- INSERT into `dirty_sources` for each upserted issue/MR
3. Process `pending_discussion_fetches` queue (bounded per run, retryable):
- Fetch discussions for each queued parent
- On success: upsert discussions/notes, INSERT into `dirty_sources`, DELETE from queue
- On failure: increment `attempt_count`, record `last_error`, leave in queue for retry
- Bound processing: max N parents per sync run to avoid unbounded API calls
4. Apply rolling backfill window
5. Regenerate documents for entities in `dirty_sources` (process + delete from queue)
6. Embed documents with changed content_hash
7. FTS triggers auto-sync (no explicit step needed)
8. Release lock, record sync_run as succeeded
**Why queue-based discussion fetching:**
- One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync
- Primary resource cursors can advance independently
- Discussions can be retried without re-fetching all issues/MRs
- Bounded processing prevents unbounded API calls per sync run
Individual commands remain available for checkpoint testing and debugging:
- `gi ingest --type=issues`
- `gi ingest --type=merge_requests`
- `gi embed --all`
- `gi embed --retry-failed`
---
## CLI Command Reference
All commands support `--help` for detailed usage information.
### Setup & Diagnostics
| Command | CP | Description |
|---------|-----|-------------|
| `gi init` | 0 | Interactive setup wizard; creates gi.config.json |
| `gi auth-test` | 0 | Verify GitLab authentication |
| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) |
| `gi doctor --json` | 0 | JSON output for scripting |
| `gi version` | 0 | Show installed version |
### Data Ingestion
| Command | CP | Description |
|---------|-----|-------------|
| `gi ingest --type=issues` | 1 | Fetch issues from GitLab |
| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions |
| `gi generate-docs` | 3A | Extract documents from issues/MRs/discussions |
| `gi embed --all` | 3B | Generate embeddings for all documents |
| `gi embed --retry-failed` | 3 | Retry failed embeddings |
| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) |
| `gi sync --full` | 5 | Force complete re-sync (reset cursors) |
| `gi sync --force` | 5 | Override stale lock after operator review |
| `gi sync --no-embed` | 5 | Sync without embedding (faster) |
### Data Inspection
| Command | CP | Description |
|---------|-----|-------------|
| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues |
| `gi list mrs --limit=N` | 2 | List merge requests |
| `gi count issues` | 1 | Count issues |
| `gi count mrs` | 2 | Count merge requests |
| `gi count discussions --type=issue` | 1 | Count issue discussions |
| `gi count discussions` | 2 | Count all discussions |
| `gi count discussions --type=mr` | 2 | Count MR discussions |
| `gi count notes --type=issue` | 1 | Count issue notes (excluding system) |
| `gi count notes` | 2 | Count all notes (excluding system) |
| `gi show issue <iid> [--project=PATH]` | 1 | Show issue details (prompts if iid ambiguous across projects) |
| `gi show mr <iid> [--project=PATH]` | 2 | Show MR details with discussions |
| `gi stats` | 3 | Embedding coverage statistics |
| `gi stats --json` | 3 | JSON stats for scripting |
| `gi sync-status` | 1 | Show cursor positions and last sync |
### Search
| Command | CP | Description |
|---------|-----|-------------|
| `gi search "query"` | 4 | Hybrid semantic + lexical search |
| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) |
| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type |
| `gi search "query" --author=USERNAME` | 4 | Filter by author |
| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date |
| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) |
| `gi search "query" --project=PATH` | 4 | Filter by project |
| `gi search "query" --path=FILE` | 4 | Filter by file path |
| `gi search "query" --limit=N` | 4 | Limit results (default: 20, max: 100) |
| `gi search "query" --json` | 4 | JSON output for scripting |
| `gi search "query" --explain` | 4 | Show ranking breakdown |
### Database Management
| Command | CP | Description |
|---------|-----|-------------|
| `gi backup` | 0 | Create timestamped database backup |
| `gi reset --confirm` | 0 | Delete database and reset cursors |
---
## Error Handling
Common errors and their resolutions:
### Configuration Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `Config file not found` | No gi.config.json | Run `gi init` to create configuration |
| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually |
| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config |
### Authentication Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` |
| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope |
| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope |
### GitLab API Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) |
| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry |
| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity |
### Data Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `No documents indexed` | Sync not run | Run `gi sync` first |
| `No results found` | Query too specific | Try broader search terms |
| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale |
### Embedding Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` |
| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` |
| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` |
### Operational Behavior
| Scenario | Behavior |
|----------|----------|
| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. |
| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
| **Embedding permanent failure** | After 3 retries, document stays in `embedding_metadata` with `last_error` populated. Use `gi embed --retry-failed` to retry later, or `gi stats` to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS. |
| **Orphaned records** | MVP: No automatic cleanup. `last_seen_at` field enables future detection of items deleted in GitLab. Post-MVP: `gi gc --dry-run` to identify orphans, `gi gc --confirm` to remove. |
---
## Database Management
### Database Location
The SQLite database is stored at an XDG-compliant location:
```
~/.local/share/gi/data.db
```
This can be overridden in `gi.config.json`:
```json
{
"storage": {
"dbPath": "/custom/path/to/data.db"
}
}
```
### Backup
Create a timestamped backup of the database:
```bash
gi backup
# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db
```
Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode).
### Reset
To completely reset the database and all sync cursors:
```bash
gi reset --confirm
```
This deletes:
- The database file
- All sync cursors
- All embeddings
You'll need to run `gi sync` again to repopulate.
### Schema Migrations
Database schema is version-tracked and migrations auto-apply on startup:
1. On first run, schema is created at latest version
2. On subsequent runs, pending migrations are applied automatically
3. Migration version is stored in `schema_version` table
4. Migrations are idempotent and reversible where possible
**Manual migration check:**
```bash
gi doctor --json | jq '.checks.database'
# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }
```
---
## Future Work (Post-MVP)
The following features are explicitly deferred to keep MVP scope focused:
| Feature | Description | Depends On |
|---------|-------------|------------|
| **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
| **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
| **Person Context** | Aggregate contributions by author, expertise inference | MVP complete |
| **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
| **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
| **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |
**Checkpoint 6 (File History) Schema Preview:**
```sql
-- Deferred from MVP; added when file-history feature is built
CREATE TABLE mr_files (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER REFERENCES merge_requests(id),
old_path TEXT,
new_path TEXT,
new_file BOOLEAN,
deleted_file BOOLEAN,
renamed_file BOOLEAN,
UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
-- DiffNote position data (for "show me comments on this file" queries)
-- Populated from notes.type='DiffNote' position object in GitLab API
CREATE TABLE note_positions (
note_id INTEGER PRIMARY KEY REFERENCES notes(id),
old_path TEXT,
new_path TEXT,
old_line INTEGER,
new_line INTEGER,
position_type TEXT -- 'text' | 'image' | etc.
);
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
```
---
## Verification Strategy
Each checkpoint includes:
1. **Automated tests** - Unit tests for data transformations, integration tests for API calls
2. **CLI smoke tests** - Manual commands with expected outputs documented
3. **Data integrity checks** - Count verification against GitLab, schema validation
4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+)
---
## Risk Mitigation
| Risk | Mitigation |
|------|------------|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
| SQLite scale limits | Monitor performance; Postgres migration path documented |
| Stale data | Incremental sync with change detection |
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
| Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery |
| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
| Pathological discussions | Queue-based discussion fetching; one bad thread doesn't block entire sync |
| Empty search results with filters | Adaptive recall (topK 50→200 when filtered) |
**SQLite Performance Defaults (MVP):**
- Enable `PRAGMA journal_mode=WAL;` on every connection
- Enable `PRAGMA foreign_keys=ON;` on every connection
- Use explicit transactions for page/batch inserts
- Targeted indexes on `(project_id, updated_at)` for primary resources
---
## Schema Summary
| Table | Checkpoint | Purpose |
|-------|------------|---------|
| projects | 0 | Configured GitLab projects |
| sync_runs | 0 | Audit trail of sync operations (with heartbeat) |
| app_locks | 0 | Crash-safe single-flight lock |
| sync_cursors | 0 | Resumable sync state per primary resource |
| raw_payloads | 0 | Decoupled raw JSON storage (gitlab_id as TEXT) |
| schema_version | 0 | Database migration version tracking |
| issues | 1 | Normalized issues (unique by project+iid) |
| labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction |
| discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) |
| notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) |
| merge_requests | 2 | Normalized MRs (unique by project+iid) |
| mr_labels | 2 | MR-label junction |
| documents | 3A | Unified searchable documents with truncation metadata |
| document_labels | 3A | Document-label junction for fast filtering |
| document_paths | 3A | Fast path filtering for documents (DiffNote file paths) |
| dirty_sources | 3A | Queue for incremental document regeneration |
| pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
| documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) |
| embedding_metadata | 3B | Embedding provenance + error tracking |
| mr_files | 6 | MR file changes (deferred to post-MVP) |
---
## Resolved Decisions
| Question | Decision | Rationale |
|----------|----------|-----------|
| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability |
| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise |
| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature |
| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature |
| Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering |
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient |
| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
| Discussions sync | **Resumable queue model** | Queue-based fetching allows one pathological thread to not block entire sync |
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
| Embedding truncation | **Note-boundary aware middle drop** | Never cut mid-note; preserves semantic coherence |
| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
| FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
| Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly |
| Empty state UX | **Actionable messages** | Guide user to next step |
| raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings |
| GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items |
| Pagination | **X-Next-Page headers with empty-page fallback** | Headers are more robust than empty-page detection |
| Integration tests | **Mocked by default, live tests optional** | Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1 |
| Search recall with filters | **Adaptive topK (50→200 when filtered)** | Prevents "no results" when relevant docs exist outside top-50 |
| RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain |
| --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match |
| CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk |
| Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere |
| CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool |
| Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping |
| Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors |
| Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker |
| Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single |
---
## Next Steps
1. User approves this spec
2. Generate Checkpoint 0 PRD for project setup
3. Implement Checkpoint 0
4. Human validates → proceed to Checkpoint 1
5. Repeat for each checkpoint