README.md provides complete user documentation: - Installation via cargo install or build from source - Quick start guide with example commands - Configuration file format with all options documented - Full command reference for init, auth-test, doctor, ingest, list, show, count, sync-status, migrate, and version - Database schema overview covering projects, issues, milestones, assignees, labels, discussions, notes, and raw payloads - Development setup with test, lint, and debug commands SPEC.md updated from original TypeScript planning document: - Added note clarifying this is historical (implementation uses Rust) - Updated sqlite-vss references to sqlite-vec (deprecated library) - Added architecture overview with Technology Choices rationale - Expanded project structure showing all planned modules docs/prd/ contains detailed checkpoint planning: - checkpoint-0.md: Initial project vision and requirements - checkpoint-1.md: Revised planning after technology decisions These documents capture the evolution from initial concept through the decision to use Rust for performance and type safety. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2046 lines
84 KiB
Markdown
2046 lines
84 KiB
Markdown
# GitLab Knowledge Engine - Spec Document
|
||
|
||
> **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation.
|
||
|
||
## Executive Summary
|
||
|
||
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
### Prerequisites
|
||
|
||
| Requirement | Version | Notes |
|
||
|-------------|---------|-------|
|
||
| Node.js | 20+ | LTS recommended |
|
||
| npm | 10+ | Comes with Node.js |
|
||
| Ollama | Latest | Optional for semantic search; lexical search works without it |
|
||
|
||
### Installation
|
||
|
||
```bash
|
||
# Clone and install
|
||
git clone https://github.com/your-org/gitlab-inbox.git
|
||
cd gitlab-inbox
|
||
npm install
|
||
npm run build
|
||
npm link # Makes `gi` available globally
|
||
```
|
||
|
||
### First Run
|
||
|
||
1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope):
|
||
```bash
|
||
export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
|
||
```
|
||
|
||
2. **Run the setup wizard:**
|
||
```bash
|
||
gi init
|
||
```
|
||
This creates `gi.config.json` with your GitLab URL and project paths.
|
||
|
||
3. **Verify your environment:**
|
||
```bash
|
||
gi doctor
|
||
```
|
||
All checks should pass (Ollama warning is OK if you only need lexical search).
|
||
|
||
4. **Sync your data:**
|
||
```bash
|
||
gi sync
|
||
```
|
||
Initial sync takes 10-20 minutes depending on repo size and rate limits.
|
||
|
||
5. **Search:**
|
||
```bash
|
||
gi search "authentication redesign"
|
||
```
|
||
|
||
### Troubleshooting First Run
|
||
|
||
| Symptom | Solution |
|
||
|---------|----------|
|
||
| `Config file not found` | Run `gi init` first |
|
||
| `GITLAB_TOKEN not set` | Export the environment variable |
|
||
| `401 Unauthorized` | Check token has `read_api` scope |
|
||
| `Project not found: group/project` | Verify project path in GitLab URL |
|
||
| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search |
|
||
|
||
---
|
||
|
||
## Discovery Summary
|
||
|
||
### Pain Points Identified
|
||
1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find
|
||
2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions
|
||
|
||
### Constraints
|
||
| Constraint | Detail |
|
||
|------------|--------|
|
||
| Hosting | Self-hosted only, no external APIs |
|
||
| Compute | Local dev machine (M-series Mac assumed) |
|
||
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
|
||
| Build Method | AI agents will implement; user is TypeScript expert for review |
|
||
|
||
### Target Use Cases (Priority Order)
|
||
1. **MVP: Semantic Search** - "Find discussions about authentication redesign"
|
||
2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?"
|
||
3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?"
|
||
4. **Future: Person Context** - "What's @johndoe's background in this project?"
|
||
|
||
---
|
||
|
||
## Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ GitLab API │
|
||
│ (Issues, MRs, Notes) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
(Commit-level indexing explicitly post-MVP)
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Data Ingestion Layer │
|
||
│ - Incremental sync (PAT-based polling) │
|
||
│ - Rate limiting / backoff │
|
||
│ - Raw JSON storage for replay │
|
||
│ - Dependent resource fetching (notes, MR changes) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Data Processing Layer │
|
||
│ - Normalize artifacts to unified schema │
|
||
│ - Extract searchable documents (canonical text + metadata) │
|
||
│ - Content hashing for change detection │
|
||
│ - MVP relationships: parent-child FKs + label/path associations│
|
||
│ (full cross-entity "decision graph" is post-MVP scope) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Storage Layer │
|
||
│ - SQLite + sqlite-vec + FTS5 (hybrid search) │
|
||
│ - Structured metadata in relational tables │
|
||
│ - Vector embeddings for semantic search │
|
||
│ - Full-text index for lexical search fallback │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Query Interface │
|
||
│ - CLI for human testing │
|
||
│ - JSON API for AI agent testing │
|
||
│ - Semantic search with filters (author, date, type, label) │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Technology Choices
|
||
|
||
| Component | Choice | Rationale |
|
||
|-----------|--------|-----------|
|
||
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
|
||
| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension |
|
||
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
|
||
| CLI Framework | Commander.js | Simple, lightweight, well-documented |
|
||
| Logging | pino | Fast, JSON-structured, low overhead |
|
||
| Validation | Zod | TypeScript-first schema validation |
|
||
|
||
### Alternative Considered: sqlite-vss
|
||
- sqlite-vss was the original choice but is now deprecated
|
||
- No Apple Silicon support (no prebuilt ARM binaries)
|
||
- Replaced by sqlite-vec, which is pure C with no dependencies
|
||
- sqlite-vec uses `vec0` virtual table (vs `vss0`)
|
||
|
||
### Alternative Considered: Postgres + pgvector
|
||
- Pros: More scalable, better for production multi-user
|
||
- Cons: Requires running Postgres, heavier setup
|
||
- Decision: Start with SQLite for simplicity; migration path exists if needed
|
||
|
||
---
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
gitlab-inbox/
|
||
├── src/
|
||
│ ├── cli/
|
||
│ │ ├── index.ts # CLI entry point (Commander.js)
|
||
│ │ └── commands/ # One file per command group
|
||
│ │ ├── init.ts
|
||
│ │ ├── sync.ts
|
||
│ │ ├── search.ts
|
||
│ │ ├── list.ts
|
||
│ │ └── doctor.ts
|
||
│ ├── core/
|
||
│ │ ├── config.ts # Config loading/validation (Zod)
|
||
│ │ ├── db.ts # Database connection + migrations
|
||
│ │ ├── errors.ts # Custom error classes
|
||
│ │ └── logger.ts # pino logger setup
|
||
│ ├── gitlab/
|
||
│ │ ├── client.ts # GitLab API client with rate limiting
|
||
│ │ ├── types.ts # GitLab API response types
|
||
│ │ └── transformers/ # Payload → normalized schema
|
||
│ │ ├── issue.ts
|
||
│ │ ├── merge-request.ts
|
||
│ │ └── discussion.ts
|
||
│ ├── ingestion/
|
||
│ │ ├── issues.ts
|
||
│ │ ├── merge-requests.ts
|
||
│ │ └── discussions.ts
|
||
│ ├── documents/
|
||
│ │ ├── extractor.ts # Document generation from entities
|
||
│ │ └── truncation.ts # Note-boundary aware truncation
|
||
│ ├── embedding/
|
||
│ │ ├── ollama.ts # Ollama client
|
||
│ │ └── pipeline.ts # Batch embedding orchestration
|
||
│ ├── search/
|
||
│ │ ├── hybrid.ts # RRF ranking logic
|
||
│ │ ├── fts.ts # FTS5 queries
|
||
│ │ └── vector.ts # sqlite-vec queries
|
||
│ └── types/
|
||
│ └── index.ts # Shared TypeScript types
|
||
├── tests/
|
||
│ ├── unit/
|
||
│ ├── integration/
|
||
│ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
|
||
│ └── fixtures/
|
||
│ └── golden-queries.json
|
||
├── migrations/ # Numbered SQL migration files
|
||
│ ├── 001_initial.sql
|
||
│ └── ...
|
||
├── gi.config.json # User config (gitignored)
|
||
├── package.json
|
||
├── tsconfig.json
|
||
├── vitest.config.ts
|
||
├── eslint.config.js
|
||
└── README.md
|
||
```
|
||
|
||
---
|
||
|
||
## Dependencies
|
||
|
||
### Runtime Dependencies
|
||
|
||
```json
|
||
{
|
||
"dependencies": {
|
||
"better-sqlite3": "latest",
|
||
"sqlite-vec": "latest",
|
||
"commander": "latest",
|
||
"zod": "latest",
|
||
"pino": "latest",
|
||
"pino-pretty": "latest",
|
||
"ora": "latest",
|
||
"chalk": "latest",
|
||
"cli-table3": "latest"
|
||
}
|
||
}
|
||
```
|
||
|
||
| Package | Purpose |
|
||
|---------|---------|
|
||
| better-sqlite3 | Synchronous SQLite driver (fast, native) |
|
||
| sqlite-vec | Vector search extension (pure C, cross-platform) |
|
||
| commander | CLI argument parsing |
|
||
| zod | Schema validation for config and inputs |
|
||
| pino | Structured JSON logging |
|
||
| pino-pretty | Dev-mode log formatting |
|
||
| ora | CLI spinners for progress indication |
|
||
| chalk | Terminal colors |
|
||
| cli-table3 | ASCII tables for list output |
|
||
|
||
### Dev Dependencies
|
||
|
||
```json
|
||
{
|
||
"devDependencies": {
|
||
"typescript": "latest",
|
||
"@types/better-sqlite3": "latest",
|
||
"@types/node": "latest",
|
||
"vitest": "latest",
|
||
"msw": "latest",
|
||
"eslint": "latest",
|
||
"@typescript-eslint/eslint-plugin": "latest",
|
||
"@typescript-eslint/parser": "latest",
|
||
"tsx": "latest"
|
||
}
|
||
}
|
||
```
|
||
|
||
| Package | Purpose |
|
||
|---------|---------|
|
||
| typescript | TypeScript compiler |
|
||
| vitest | Test runner |
|
||
| msw | Mock Service Worker for API mocking in tests |
|
||
| eslint | Linting |
|
||
| tsx | Run TypeScript directly during development |
|
||
|
||
---
|
||
|
||
## GitLab API Strategy
|
||
|
||
### Primary Resources (Bulk Fetch)
|
||
|
||
Issues and MRs support efficient bulk fetching with incremental sync:
|
||
|
||
```
|
||
GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
|
||
GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
|
||
```
|
||
|
||
**Required query params for completeness:**
|
||
- `scope=all` - include all issues/MRs, not just authored by current user
|
||
- `state=all` - include closed items (GitLab defaults may exclude them)
|
||
|
||
Without these params, the 2+ years of historical data would be incomplete.
|
||
|
||
### Dependent Resources (Per-Parent Fetch)
|
||
|
||
Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
|
||
|
||
```
|
||
GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
|
||
GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
|
||
```
|
||
|
||
**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent to avoid silent data loss.
|
||
|
||
### Sync Pattern
|
||
|
||
**Initial sync:**
|
||
1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
|
||
2. For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead)
|
||
3. Fetch all MRs (paginated, ~60 calls)
|
||
4. For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead)
|
||
5. Total: thousands of API calls for initial sync
|
||
|
||
**API Call Estimation Formula:**
|
||
```
|
||
total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue
|
||
+ ceil(mrs/100) + mrs × avg_discussion_pages_per_mr
|
||
```
|
||
|
||
Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent:
|
||
- Issue list: 30 calls
|
||
- Issue discussions: 3,000 × 1.2 = 3,600 calls
|
||
- MR list: 30 calls
|
||
- MR discussions: 3,000 × 1.2 = 3,600 calls
|
||
- **Total: ~7,260 calls**
|
||
|
||
This matters for rate limit planning and setting realistic "10-20 minutes" expectations.
|
||
|
||
**Incremental sync:**
|
||
1. Fetch issues where `updated_after=cursor` (bulk)
|
||
2. For EACH updated issue → refetch ALL its discussions
|
||
3. Fetch MRs where `updated_after=cursor` (bulk)
|
||
4. For EACH updated MR → refetch ALL its discussions
|
||
|
||
### Critical Assumption (Softened)
|
||
|
||
We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively.
|
||
|
||
**Mitigations (MVP):**
|
||
1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor.
|
||
2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
|
||
3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`).
|
||
|
||
The backfill window provides 80% of the safety of full resync at <5% of the API cost.
|
||
|
||
### Rate Limiting
|
||
|
||
- Default: 10 requests/second with exponential backoff
|
||
- Respect `Retry-After` headers on 429 responses
|
||
- Add jitter to avoid thundering herd on retry
|
||
- **Separate concurrency limits:**
|
||
- `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4)
|
||
- `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
|
||
- Bound concurrency per-project to avoid one repo starving the other
|
||
- Initial sync estimate: 10-20 minutes depending on rate limits
|
||
|
||
---
|
||
|
||
## Checkpoint Structure
|
||
|
||
Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
|
||
|
||
### Checkpoint 0: Project Setup
|
||
**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/config.test.ts
|
||
✓ loads config from gi.config.json
|
||
✓ throws if config file missing
|
||
✓ throws if required fields missing (baseUrl, projects)
|
||
✓ validates project paths are non-empty strings
|
||
|
||
tests/unit/db.test.ts
|
||
✓ creates database file if not exists
|
||
✓ applies migrations in order
|
||
✓ sets WAL journal mode
|
||
✓ enables foreign keys
|
||
|
||
tests/integration/gitlab-client.test.ts
|
||
✓ (mocked) authenticates with valid PAT
|
||
✓ (mocked) returns 401 for invalid PAT
|
||
✓ (mocked) fetches project by path
|
||
✓ (mocked) handles rate limiting (429) with retry
|
||
|
||
tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI)
|
||
✓ authenticates with real PAT against configured baseUrl
|
||
✓ fetches real project by path
|
||
✓ handles actual rate limiting behavior
|
||
|
||
tests/integration/app-lock.test.ts
|
||
✓ acquires lock successfully
|
||
✓ updates heartbeat during operation
|
||
✓ detects stale lock and recovers
|
||
✓ refuses concurrent acquisition
|
||
|
||
tests/integration/init.test.ts
|
||
✓ creates config file with valid structure
|
||
✓ validates GitLab URL format
|
||
✓ validates GitLab connection before writing config
|
||
✓ validates each project path exists in GitLab
|
||
✓ fails if token not set
|
||
✓ fails if GitLab auth fails
|
||
✓ fails if any project path not found
|
||
✓ prompts before overwriting existing config
|
||
✓ respects --force to skip confirmation
|
||
✓ generates gi.config.json with sensible defaults
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi auth-test` | `Authenticated as @username (User Name)` | Shows GitLab username and display name |
|
||
| `gi doctor` | Status table with ✓/✗ for each check | All checks pass (or Ollama shows warning if not running) |
|
||
| `gi doctor --json` | JSON object with check results | Valid JSON, `success: true` for required checks |
|
||
| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
|
||
| `gi init` | Interactive prompts | Creates valid gi.config.json |
|
||
| `gi init` (config exists) | Confirmation prompt | Warns before overwriting |
|
||
| `gi --help` | Command list | Shows all available commands |
|
||
| `gi version` | Version number | Shows installed version |
|
||
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `projects` table contains rows for each configured project path
|
||
- [ ] `gitlab_project_id` matches actual GitLab project IDs
|
||
- [ ] `raw_payloads` contains project JSON for each synced project
|
||
|
||
**Scope:**
|
||
- Project structure (TypeScript, ESLint, Vitest)
|
||
- GitLab API client with PAT authentication
|
||
- Environment and project configuration
|
||
- Basic CLI scaffold with `auth-test` command
|
||
- `doctor` command for environment verification
|
||
- Projects table and initial project resolution (no issue/MR ingestion yet)
|
||
- DB migrations + WAL + FK enforcement
|
||
- Sync tracking with crash-safe single-flight lock (heartbeat-based)
|
||
- Rate limit handling with exponential backoff + jitter
|
||
- `gi init` command for guided setup:
|
||
- Prompts for GitLab base URL
|
||
- Prompts for project paths (comma-separated or multiple prompts)
|
||
- Prompts for token environment variable name (default: GITLAB_TOKEN)
|
||
- **Validates before writing config:**
|
||
- Token must be set in environment
|
||
- Tests auth with `GET /user` endpoint
|
||
- Validates each project path with `GET /projects/:path`
|
||
- Only writes config after all validations pass
|
||
- Generates `gi.config.json` with sensible defaults
|
||
- `gi --help` shows all available commands
|
||
- `gi <command> --help` shows command-specific help
|
||
- `gi version` shows installed version
|
||
- First-run detection: if no config exists, suggest `gi init`
|
||
|
||
**Configuration (MVP):**
|
||
```json
|
||
// gi.config.json
|
||
{
|
||
"gitlab": {
|
||
"baseUrl": "https://gitlab.example.com",
|
||
"tokenEnvVar": "GITLAB_TOKEN"
|
||
},
|
||
"projects": [
|
||
{ "path": "group/project-one" },
|
||
{ "path": "group/project-two" }
|
||
],
|
||
"sync": {
|
||
"backfillDays": 14,
|
||
"staleLockMinutes": 10,
|
||
"heartbeatIntervalSeconds": 30,
|
||
"cursorRewindSeconds": 2,
|
||
"primaryConcurrency": 4,
|
||
"dependentConcurrency": 2
|
||
},
|
||
"storage": {
|
||
"compressRawPayloads": true
|
||
},
|
||
"embedding": {
|
||
"provider": "ollama",
|
||
"model": "nomic-embed-text",
|
||
"baseUrl": "http://localhost:11434",
|
||
"concurrency": 4
|
||
}
|
||
}
|
||
```
|
||
|
||
**Raw Payload Compression:**
|
||
- When `storage.compressRawPayloads: true` (default), raw JSON payloads are gzip-compressed before storage
|
||
- `raw_payloads.content_encoding` indicates `'identity'` (uncompressed) or `'gzip'` (compressed)
|
||
- Compression typically reduces storage by 70-80% for JSON payloads
|
||
- Decompression is handled transparently when reading payloads
|
||
- Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage
|
||
|
||
**Error Classes (src/core/errors.ts):**
|
||
|
||
```typescript
|
||
// Base error class with error codes for programmatic handling
|
||
export class GiError extends Error {
|
||
constructor(message: string, public readonly code: string) {
|
||
super(message);
|
||
this.name = 'GiError';
|
||
}
|
||
}
|
||
|
||
// Config errors
|
||
export class ConfigNotFoundError extends GiError {
|
||
constructor() {
|
||
super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
|
||
}
|
||
}
|
||
|
||
export class ConfigValidationError extends GiError {
|
||
constructor(details: string) {
|
||
super(`Invalid config: ${details}`, 'CONFIG_INVALID');
|
||
}
|
||
}
|
||
|
||
// GitLab API errors
|
||
export class GitLabAuthError extends GiError {
|
||
constructor() {
|
||
super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
|
||
}
|
||
}
|
||
|
||
export class GitLabNotFoundError extends GiError {
|
||
constructor(resource: string) {
|
||
super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
|
||
}
|
||
}
|
||
|
||
export class GitLabRateLimitError extends GiError {
|
||
constructor(public readonly retryAfter: number) {
|
||
super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
|
||
}
|
||
}
|
||
|
||
// Database errors
|
||
export class DatabaseLockError extends GiError {
|
||
constructor() {
|
||
super('Another sync is running. Use --force to override.', 'DB_LOCKED');
|
||
}
|
||
}
|
||
|
||
// Embedding errors
|
||
export class OllamaConnectionError extends GiError {
|
||
constructor() {
|
||
super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
|
||
}
|
||
}
|
||
|
||
export class EmbeddingError extends GiError {
|
||
constructor(documentId: number, reason: string) {
|
||
super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
|
||
}
|
||
}
|
||
```
|
||
|
||
**Logging Strategy (src/core/logger.ts):**
|
||
|
||
```typescript
|
||
import pino from 'pino';
|
||
|
||
// Logs go to stderr, results to stdout (allows clean JSON piping)
|
||
export const logger = pino({
|
||
level: process.env.LOG_LEVEL || 'info',
|
||
transport: process.env.NODE_ENV === 'production' ? undefined : {
|
||
target: 'pino-pretty',
|
||
options: { colorize: true, destination: 2 } // 2 = stderr
|
||
}
|
||
}, pino.destination(2));
|
||
```
|
||
|
||
**Log Levels:**
|
||
| Level | When to use |
|
||
|-------|-------------|
|
||
| debug | Detailed sync progress, API calls, SQL queries |
|
||
| info | Sync start/complete, document counts, search timing |
|
||
| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries |
|
||
| error | Failures that stop operations |
|
||
|
||
**Logging Conventions:**
|
||
- Always include structured context: `logger.info({ project, count }, 'Fetched issues')`
|
||
- Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')`
|
||
- All logs to stderr so `gi search --json` output stays clean on stdout
|
||
|
||
**DB Runtime Defaults (Checkpoint 0):**
|
||
- On every connection:
|
||
- `PRAGMA journal_mode=WAL;`
|
||
- `PRAGMA foreign_keys=ON;`
|
||
|
||
**Schema (Checkpoint 0):**
|
||
```sql
|
||
-- Projects table (configured targets)
|
||
CREATE TABLE projects (
|
||
id INTEGER PRIMARY KEY,
|
||
gitlab_project_id INTEGER UNIQUE NOT NULL,
|
||
path_with_namespace TEXT NOT NULL,
|
||
default_branch TEXT,
|
||
web_url TEXT,
|
||
created_at INTEGER,
|
||
updated_at INTEGER,
|
||
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||
);
|
||
CREATE INDEX idx_projects_path ON projects(path_with_namespace);
|
||
|
||
-- Sync tracking for reliability
|
||
CREATE TABLE sync_runs (
|
||
id INTEGER PRIMARY KEY,
|
||
started_at INTEGER NOT NULL,
|
||
heartbeat_at INTEGER NOT NULL,
|
||
finished_at INTEGER,
|
||
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
|
||
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
|
||
error TEXT
|
||
);
|
||
|
||
-- Crash-safe single-flight lock (DB-enforced)
|
||
CREATE TABLE app_locks (
|
||
name TEXT PRIMARY KEY, -- 'sync'
|
||
owner TEXT NOT NULL, -- random run token (UUIDv4)
|
||
acquired_at INTEGER NOT NULL,
|
||
heartbeat_at INTEGER NOT NULL
|
||
);
|
||
|
||
-- Sync cursors for primary resources only
|
||
-- Notes and MR changes are dependent resources (fetched via parent updates)
|
||
CREATE TABLE sync_cursors (
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests'
|
||
updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch)
|
||
tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering)
|
||
PRIMARY KEY(project_id, resource_type)
|
||
);
|
||
|
||
-- Raw payload storage (decoupled from entity tables)
|
||
CREATE TABLE raw_payloads (
|
||
id INTEGER PRIMARY KEY,
|
||
source TEXT NOT NULL, -- 'gitlab'
|
||
project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources
|
||
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
|
||
gitlab_id TEXT NOT NULL, -- TEXT because discussion IDs are strings; numeric IDs stored as strings
|
||
fetched_at INTEGER NOT NULL,
|
||
content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
|
||
payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON
|
||
);
|
||
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
|
||
CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
|
||
|
||
-- Schema version tracking for migrations
|
||
CREATE TABLE schema_version (
|
||
version INTEGER PRIMARY KEY,
|
||
applied_at INTEGER NOT NULL,
|
||
description TEXT
|
||
);
|
||
```
|
||
|
||
---
|
||
|
||
### Checkpoint 1: Issue Ingestion
|
||
**Deliverable:** All issues + labels + issue discussions from target repos stored locally with resumable cursor-based sync
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/issue-transformer.test.ts
|
||
✓ transforms GitLab issue payload to normalized schema
|
||
✓ extracts labels from issue payload
|
||
✓ handles missing optional fields gracefully
|
||
|
||
tests/unit/pagination.test.ts
|
||
✓ fetches all pages when multiple exist
|
||
✓ respects per_page parameter
|
||
✓ follows X-Next-Page header until empty/absent
|
||
✓ falls back to empty-page stop if headers missing (robustness)
|
||
|
||
tests/unit/discussion-transformer.test.ts
|
||
✓ transforms discussion payload to normalized schema
|
||
✓ extracts notes array from discussion
|
||
✓ sets individual_note flag correctly
|
||
✓ flags system notes with is_system=1
|
||
✓ preserves note order via position field
|
||
|
||
tests/integration/issue-ingestion.test.ts
|
||
✓ inserts issues into database
|
||
✓ creates labels from issue payloads
|
||
✓ links issues to labels via junction table
|
||
✓ stores raw payload for each issue
|
||
✓ updates cursor after successful page commit
|
||
✓ resumes from cursor on subsequent runs
|
||
|
||
tests/integration/issue-discussion-ingestion.test.ts
|
||
✓ fetches discussions for each issue
|
||
✓ creates discussion rows with correct issue FK
|
||
✓ creates note rows linked to discussions
|
||
✓ stores system notes with is_system=1 flag
|
||
✓ handles individual_note=true discussions
|
||
|
||
tests/integration/sync-runs.test.ts
|
||
✓ creates sync_run record on start
|
||
✓ marks run as succeeded on completion
|
||
✓ marks run as failed with error message on failure
|
||
✓ refuses concurrent run (single-flight)
|
||
✓ allows --force to override stale running status
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi ingest --type=issues` | Progress bar, final count | Completes without error |
|
||
| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
|
||
| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
|
||
| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
|
||
| `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use `--project=PATH` |
|
||
| `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count |
|
||
| `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count |
|
||
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects
|
||
- [ ] Every issue has a corresponding `raw_payloads` row
|
||
- [ ] Labels in `issue_labels` junction all exist in `labels` table
|
||
- [ ] `sync_cursors` has entry for each (project_id, 'issues') pair
|
||
- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current)
|
||
- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'` is non-zero
|
||
- [ ] Every discussion has at least one note
|
||
- [ ] `individual_note = true` discussions have exactly one note
|
||
|
||
**Scope:**
|
||
- Issue fetcher with pagination handling
|
||
- Raw JSON storage in raw_payloads table
|
||
- Normalized issue schema in SQLite
|
||
- Labels ingestion derived from issue payload:
|
||
- Always persist label names from `labels: string[]`
|
||
- Optionally request `with_labels_details=true` to capture color/description when available
|
||
- Issue discussions fetcher:
|
||
- Uses `GET /projects/:id/issues/:iid/discussions`
|
||
- Fetches all discussions for each issue during ingest
|
||
- Preserve system notes but flag them with `is_system=1`
|
||
- Incremental sync support (run tracking + per-project cursor)
|
||
- Basic list/count CLI commands
|
||
|
||
**Reliability/Idempotency Rules:**
|
||
- Every ingest/sync creates a `sync_runs` row
|
||
- Single-flight via DB-enforced app lock:
|
||
- On start: acquire lock via transactional compare-and-swap:
|
||
- `BEGIN IMMEDIATE` (acquires write lock immediately)
|
||
- If no row exists → INSERT new lock
|
||
- Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps
|
||
- Else if `owner` matches current run → UPDATE heartbeat (re-entrant)
|
||
- Else → ROLLBACK and fail fast (another run is active)
|
||
- `COMMIT`
|
||
- During run: update `heartbeat_at` every 30 seconds
|
||
- If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
|
||
- `--force` remains as operator override for edge cases, but should rarely be needed
|
||
- Cursor advances only after successful transaction commit per page/batch
|
||
- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
|
||
- Use explicit transactions for batch inserts
|
||
|
||
**Schema Preview:**
|
||
```sql
|
||
CREATE TABLE issues (
|
||
id INTEGER PRIMARY KEY,
|
||
gitlab_id INTEGER UNIQUE NOT NULL,
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
iid INTEGER NOT NULL,
|
||
title TEXT,
|
||
description TEXT,
|
||
state TEXT,
|
||
author_username TEXT,
|
||
created_at INTEGER,
|
||
updated_at INTEGER,
|
||
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||
web_url TEXT,
|
||
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||
);
|
||
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
|
||
CREATE INDEX idx_issues_author ON issues(author_username);
|
||
CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);
|
||
|
||
-- Labels are derived from issue payloads (string array)
|
||
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
|
||
CREATE TABLE labels (
|
||
id INTEGER PRIMARY KEY,
|
||
gitlab_id INTEGER, -- optional (only if available)
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
name TEXT NOT NULL,
|
||
color TEXT,
|
||
description TEXT
|
||
);
|
||
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
|
||
CREATE INDEX idx_labels_name ON labels(name);
|
||
|
||
CREATE TABLE issue_labels (
|
||
issue_id INTEGER REFERENCES issues(id),
|
||
label_id INTEGER REFERENCES labels(id),
|
||
PRIMARY KEY(issue_id, label_id)
|
||
);
|
||
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
|
||
|
||
-- Discussion threads for issues (MR discussions added in CP2)
|
||
CREATE TABLE discussions (
|
||
id INTEGER PRIMARY KEY,
|
||
gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...")
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
issue_id INTEGER REFERENCES issues(id),
|
||
merge_request_id INTEGER, -- FK added in CP2 via ALTER TABLE
|
||
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
|
||
individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion
|
||
first_note_at INTEGER, -- for ordering discussions
|
||
last_note_at INTEGER, -- for "recently active" queries
|
||
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||
resolvable BOOLEAN, -- MR discussions can be resolved
|
||
resolved BOOLEAN,
|
||
CHECK (
|
||
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
|
||
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
|
||
)
|
||
);
|
||
CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id);
|
||
CREATE INDEX idx_discussions_issue ON discussions(issue_id);
|
||
CREATE INDEX idx_discussions_mr ON discussions(merge_request_id);
|
||
CREATE INDEX idx_discussions_last_note ON discussions(last_note_at);
|
||
|
||
-- Notes belong to discussions (preserving thread context)
|
||
CREATE TABLE notes (
|
||
id INTEGER PRIMARY KEY,
|
||
gitlab_id INTEGER UNIQUE NOT NULL,
|
||
discussion_id INTEGER NOT NULL REFERENCES discussions(id),
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
type TEXT, -- 'DiscussionNote' | 'DiffNote' | null
|
||
is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.)
|
||
author_username TEXT,
|
||
body TEXT,
|
||
created_at INTEGER,
|
||
updated_at INTEGER,
|
||
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||
position INTEGER, -- derived from array order in API response (0-indexed)
|
||
resolvable BOOLEAN,
|
||
resolved BOOLEAN,
|
||
resolved_by TEXT,
|
||
resolved_at INTEGER,
|
||
-- DiffNote position metadata (only populated for MR DiffNotes in CP2)
|
||
position_old_path TEXT,
|
||
position_new_path TEXT,
|
||
position_old_line INTEGER,
|
||
position_new_line INTEGER,
|
||
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||
);
|
||
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
|
||
CREATE INDEX idx_notes_author ON notes(author_username);
|
||
CREATE INDEX idx_notes_system ON notes(is_system);
|
||
```
|
||
|
||
---
|
||
|
||
### Checkpoint 2: MR Ingestion
|
||
**Deliverable:** All MRs + MR discussions + notes with DiffNote paths captured
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/mr-transformer.test.ts
|
||
✓ transforms GitLab MR payload to normalized schema
|
||
✓ extracts labels from MR payload
|
||
✓ handles missing optional fields gracefully
|
||
|
||
tests/unit/diffnote-transformer.test.ts
|
||
✓ extracts DiffNote position metadata (paths and lines)
|
||
✓ handles missing position fields gracefully
|
||
|
||
tests/integration/mr-ingestion.test.ts
|
||
✓ inserts MRs into database
|
||
✓ creates labels from MR payloads
|
||
✓ links MRs to labels via junction table
|
||
✓ stores raw payload for each MR
|
||
|
||
tests/integration/mr-discussion-ingestion.test.ts
|
||
✓ fetches discussions for each MR
|
||
✓ creates discussion rows with correct MR FK
|
||
✓ creates note rows linked to discussions
|
||
✓ extracts position_new_path from DiffNotes
|
||
✓ captures note-level resolution status
|
||
✓ captures note type (DiscussionNote, DiffNote)
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi ingest --type=merge_requests` | Progress bar, final count | Completes without error |
|
||
| `gi list mrs --limit=10` | Table of 10 MRs | Shows iid, title, state, author, branch |
|
||
| `gi count mrs` | `Merge Requests: 567` (example) | Count matches GitLab UI |
|
||
| `gi show mr 123` | MR detail with discussions | Shows title, description, discussion threads |
|
||
| `gi count discussions` | `Discussions: 12,345` | Total count (issue + MR) |
|
||
| `gi count discussions --type=mr` | `MR Discussions: 6,789` | MR discussions only |
|
||
| `gi count notes` | `Notes: 45,678 (excluding 8,901 system)` | Total with system note count |
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
|
||
- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='MergeRequest'` is non-zero
|
||
- [ ] DiffNotes have `position_new_path` populated when available
|
||
- [ ] Discussion `first_note_at` <= `last_note_at` for all rows
|
||
|
||
**Scope:**
|
||
- MR fetcher with pagination
|
||
- MR discussions fetcher:
|
||
- Uses `GET /projects/:id/merge_requests/:iid/discussions`
|
||
- Fetches all discussions for each MR during ingest
|
||
- Capture DiffNote file path/line metadata from `position` field for filename search
|
||
- Relationship linking (discussion → MR, notes → discussion)
|
||
- Extended CLI commands for MR display with threads
|
||
- Add `idx_notes_type` and `idx_notes_new_path` indexes for DiffNote queries
|
||
|
||
**Note:** MR file changes (mr_files) are deferred to Checkpoint 6 (File History) since they're only needed for "what MRs touched this file?" queries.
|
||
|
||
**Schema Additions:**
|
||
```sql
|
||
CREATE TABLE merge_requests (
|
||
id INTEGER PRIMARY KEY,
|
||
gitlab_id INTEGER UNIQUE NOT NULL,
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
iid INTEGER NOT NULL,
|
||
title TEXT,
|
||
description TEXT,
|
||
state TEXT,
|
||
author_username TEXT,
|
||
source_branch TEXT,
|
||
target_branch TEXT,
|
||
created_at INTEGER,
|
||
updated_at INTEGER,
|
||
last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||
merged_at INTEGER,
|
||
web_url TEXT,
|
||
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||
);
|
||
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
|
||
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
|
||
CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);
|
||
|
||
-- MR labels (reuse same labels table from CP1)
|
||
CREATE TABLE mr_labels (
|
||
merge_request_id INTEGER REFERENCES merge_requests(id),
|
||
label_id INTEGER REFERENCES labels(id),
|
||
PRIMARY KEY(merge_request_id, label_id)
|
||
);
|
||
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
|
||
|
||
-- Additional indexes for DiffNote queries (tables created in CP1)
|
||
CREATE INDEX idx_notes_type ON notes(type);
|
||
CREATE INDEX idx_notes_new_path ON notes(position_new_path);
|
||
|
||
-- Migration: Add FK constraint to discussions table (was deferred from CP1)
|
||
-- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK
|
||
-- This is handled by the migration system; pseudocode for clarity:
|
||
-- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id)
|
||
-- 2. INSERT INTO discussions_new SELECT * FROM discussions
|
||
-- 3. DROP TABLE discussions
|
||
-- 4. ALTER TABLE discussions_new RENAME TO discussions
|
||
-- 5. Recreate indexes
|
||
```
|
||
|
||
**MR Discussion Processing Rules:**
|
||
- DiffNote position data is extracted and stored:
|
||
- `position.old_path`, `position.new_path` for file-level search
|
||
- `position.old_line`, `position.new_line` for line-level context
|
||
- MR discussions can be resolvable; resolution status is captured at note level
|
||
|
||
---
|
||
|
||
### Checkpoint 3A: Document Generation + FTS (Lexical Search)
|
||
**Deliverable:** Documents generated + FTS5 index; `gi search --mode=lexical` works end-to-end (no Ollama required)
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/document-extractor.test.ts
|
||
✓ extracts issue document (title + description)
|
||
✓ extracts MR document (title + description)
|
||
✓ extracts discussion document with full thread context
|
||
✓ includes parent issue/MR title in discussion header
|
||
✓ formats notes with author and timestamp
|
||
✓ excludes system notes from discussion documents by default
|
||
✓ includes system notes only when --include-system-notes enabled (debug)
|
||
✓ truncates content exceeding 8000 tokens at note boundaries
|
||
✓ preserves first and last notes when truncating middle
|
||
✓ computes SHA-256 content hash consistently
|
||
|
||
tests/integration/document-creation.test.ts
|
||
✓ creates document for each issue
|
||
✓ creates document for each MR
|
||
✓ creates document for each discussion
|
||
✓ populates document_labels junction table
|
||
✓ computes content_hash for each document
|
||
✓ excludes system notes from discussion content
|
||
|
||
tests/integration/fts-index.test.ts
|
||
✓ documents_fts row count matches documents
|
||
✓ FTS triggers fire on insert/update/delete
|
||
✓ updates propagate via triggers
|
||
|
||
tests/integration/fts-search.test.ts
|
||
✓ returns exact keyword matches
|
||
✓ porter stemming works (search/searching)
|
||
✓ returns empty for non-matching query
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi generate-docs` | Progress bar, final count | Completes without error |
|
||
| `gi generate-docs` (re-run) | `0 documents to regenerate` | Skips unchanged docs |
|
||
| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, works without Ollama |
|
||
| `gi stats` | Document count stats | Shows document coverage |
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
|
||
- [ ] `SELECT COUNT(*) FROM documents_fts` = `SELECT COUNT(*) FROM documents` (via FTS triggers)
|
||
- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings
|
||
- [ ] Discussion documents include parent title in content_text
|
||
- [ ] Discussion documents exclude system notes
|
||
|
||
**Scope:**
|
||
- Document extraction layer:
|
||
- Canonical "search documents" derived from issues/MRs/discussions
|
||
- Stable content hashing for change detection (SHA-256 of content_text)
|
||
- Truncation: content_text capped at 8000 tokens at NOTE boundaries
|
||
- **Implementation:** Use character budget, not exact token count
|
||
- `maxChars = 32000` (conservative 4 chars/token estimate)
|
||
- Drop whole notes from middle, never cut mid-note
|
||
- `approxTokens = ceil(charCount / 4)` for reporting/logging only
|
||
- System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search)
|
||
- Denormalized metadata for fast filtering (author, labels, dates)
|
||
- Fast label filtering via `document_labels` join table
|
||
- FTS5 index for lexical search
|
||
- `gi search --mode=lexical` CLI command (works without Ollama)
|
||
|
||
This checkpoint delivers a working search experience before introducing embedding infrastructure risk.
|
||
|
||
**Schema Additions (CP3A):**
|
||
```sql
|
||
-- Unified searchable documents (derived from issues/MRs/discussions)
|
||
-- Note: Full documents table schema is in CP3B section for continuity with embeddings
|
||
CREATE TABLE documents (
|
||
id INTEGER PRIMARY KEY,
|
||
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
|
||
source_id INTEGER NOT NULL, -- local DB id in the source table
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
author_username TEXT, -- for discussions: first note author
|
||
label_names TEXT, -- JSON array (display/debug only)
|
||
created_at INTEGER,
|
||
updated_at INTEGER,
|
||
url TEXT,
|
||
title TEXT, -- null for discussions
|
||
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
|
||
content_hash TEXT NOT NULL, -- SHA-256 for change detection
|
||
is_truncated BOOLEAN NOT NULL DEFAULT 0,
|
||
truncated_reason TEXT, -- 'token_limit_middle_drop' | null
|
||
UNIQUE(source_type, source_id)
|
||
);
|
||
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
|
||
CREATE INDEX idx_documents_author ON documents(author_username);
|
||
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
|
||
|
||
-- Fast label filtering for documents (indexed exact-match)
|
||
CREATE TABLE document_labels (
|
||
document_id INTEGER NOT NULL REFERENCES documents(id),
|
||
label_name TEXT NOT NULL,
|
||
PRIMARY KEY(document_id, label_name)
|
||
);
|
||
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
|
||
|
||
-- Fast path filtering for documents (extracted from DiffNote positions)
|
||
CREATE TABLE document_paths (
|
||
document_id INTEGER NOT NULL REFERENCES documents(id),
|
||
path TEXT NOT NULL,
|
||
PRIMARY KEY(document_id, path)
|
||
);
|
||
CREATE INDEX idx_document_paths_path ON document_paths(path);
|
||
|
||
-- Track sources that require document regeneration (populated during ingestion)
|
||
CREATE TABLE dirty_sources (
|
||
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
|
||
source_id INTEGER NOT NULL, -- local DB id
|
||
queued_at INTEGER NOT NULL,
|
||
PRIMARY KEY(source_type, source_id)
|
||
);
|
||
|
||
-- Resumable dependent fetches (discussions are per-parent resources)
|
||
CREATE TABLE pending_discussion_fetches (
|
||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
|
||
noteable_iid INTEGER NOT NULL, -- parent iid (stable human identifier)
|
||
queued_at INTEGER NOT NULL,
|
||
attempt_count INTEGER NOT NULL DEFAULT 0,
|
||
last_attempt_at INTEGER,
|
||
last_error TEXT,
|
||
PRIMARY KEY(project_id, noteable_type, noteable_iid)
|
||
);
|
||
CREATE INDEX idx_pending_discussions_retry
|
||
ON pending_discussion_fetches(attempt_count, last_attempt_at)
|
||
WHERE last_error IS NOT NULL;
|
||
|
||
-- Full-text search for lexical retrieval
|
||
-- Using porter stemmer for better matching of word variants
|
||
CREATE VIRTUAL TABLE documents_fts USING fts5(
|
||
title,
|
||
content_text,
|
||
content='documents',
|
||
content_rowid='id',
|
||
tokenize='porter unicode61'
|
||
);
|
||
|
||
-- Triggers to keep FTS in sync
|
||
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
|
||
INSERT INTO documents_fts(rowid, title, content_text)
|
||
VALUES (new.id, new.title, new.content_text);
|
||
END;
|
||
|
||
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
|
||
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
|
||
VALUES('delete', old.id, old.title, old.content_text);
|
||
END;
|
||
|
||
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
|
||
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
|
||
VALUES('delete', old.id, old.title, old.content_text);
|
||
INSERT INTO documents_fts(rowid, title, content_text)
|
||
VALUES (new.id, new.title, new.content_text);
|
||
END;
|
||
```
|
||
|
||
**FTS5 Tokenizer Notes:**
|
||
- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
|
||
- `unicode61` handles Unicode properly
|
||
- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
|
||
|
||
---
|
||
|
||
### Checkpoint 3B: Embedding Generation (Semantic Search)
|
||
**Deliverable:** Embeddings generated + `gi search --mode=semantic` works; graceful fallback if Ollama unavailable
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/embedding-client.test.ts
|
||
✓ connects to Ollama API
|
||
✓ generates embedding for text input
|
||
✓ returns 768-dimension vector
|
||
✓ handles Ollama connection failure gracefully
|
||
✓ batches requests (32 documents per batch)
|
||
|
||
tests/integration/embedding-storage.test.ts
|
||
✓ stores embedding in sqlite-vec
|
||
✓ embedding rowid matches document id
|
||
✓ creates embedding_metadata record
|
||
✓ skips re-embedding when content_hash unchanged
|
||
✓ re-embeds when content_hash changes
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi embed --all` | Progress bar with ETA | Completes without error |
|
||
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
|
||
| `gi stats` | Embedding coverage stats | Shows 100% coverage |
|
||
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
|
||
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
|
||
| `gi search "authentication" --mode=semantic` | Vector results | Returns semantically similar documents |
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents`
|
||
- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
|
||
- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
|
||
|
||
**Scope:**
|
||
- Ollama integration (nomic-embed-text model)
|
||
- Embedding generation pipeline:
|
||
- Batch size: 32 documents per batch
|
||
- Concurrency: configurable (default 4 workers)
|
||
- Retry with exponential backoff for transient failures (max 3 attempts)
|
||
- Per-document failure recording to enable targeted re-runs
|
||
- Vector storage in SQLite (sqlite-vec extension)
|
||
- Progress tracking and resumability
|
||
- `gi search --mode=semantic` CLI command
|
||
|
||
**Ollama API Contract:**
|
||
|
||
```typescript
|
||
// POST http://localhost:11434/api/embed (batch endpoint - preferred)
|
||
interface OllamaEmbedRequest {
|
||
model: string; // "nomic-embed-text"
|
||
input: string[]; // array of texts to embed (up to 32)
|
||
}
|
||
|
||
interface OllamaEmbedResponse {
|
||
model: string;
|
||
embeddings: number[][]; // array of 768-dim vectors
|
||
}
|
||
|
||
// POST http://localhost:11434/api/embeddings (single text - fallback)
|
||
interface OllamaEmbeddingsRequest {
|
||
model: string;
|
||
prompt: string;
|
||
}
|
||
|
||
interface OllamaEmbeddingsResponse {
|
||
embedding: number[];
|
||
}
|
||
```
|
||
|
||
**Usage:**
|
||
- Use `/api/embed` for batching (up to 32 documents per request)
|
||
- Fall back to `/api/embeddings` for single documents or if batch fails
|
||
- Check Ollama availability with `GET http://localhost:11434/api/tags`
|
||
|
||
**Schema Additions (CP3B):**
|
||
```sql
|
||
-- sqlite-vec virtual table for vector search
|
||
-- Storage rule: embeddings.rowid = documents.id
|
||
CREATE VIRTUAL TABLE embeddings USING vec0(
|
||
embedding float[768]
|
||
);
|
||
|
||
-- Embedding provenance + change detection
|
||
-- document_id is PRIMARY KEY and equals embeddings.rowid
|
||
CREATE TABLE embedding_metadata (
|
||
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
|
||
model TEXT NOT NULL, -- 'nomic-embed-text'
|
||
dims INTEGER NOT NULL, -- 768
|
||
content_hash TEXT NOT NULL, -- copied from documents.content_hash
|
||
created_at INTEGER NOT NULL,
|
||
-- Error tracking for resumable embedding
|
||
last_error TEXT, -- error message from last failed attempt
|
||
attempt_count INTEGER NOT NULL DEFAULT 0,
|
||
last_attempt_at INTEGER -- when last attempt occurred
|
||
);
|
||
|
||
-- Index for finding failed embeddings to retry
|
||
CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
|
||
```
|
||
|
||
**Storage Rule (MVP):**
|
||
- Insert embedding with `rowid = documents.id`
|
||
- Upsert `embedding_metadata` by `document_id`
|
||
- This alignment simplifies joins and eliminates rowid mapping fragility
|
||
|
||
**Document Extraction Rules:**
|
||
|
||
| Source | content_text Construction |
|
||
|--------|--------------------------|
|
||
| Issue | `title + "\n\n" + description` |
|
||
| MR | `title + "\n\n" + description` |
|
||
| Discussion | Full thread with context (see below) |
|
||
|
||
**Discussion Document Format:**
|
||
```
|
||
[[Discussion]] Issue #234: Authentication redesign
|
||
Project: group/project-one
|
||
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
|
||
Labels: ["bug", "auth"]
|
||
Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread
|
||
|
||
--- Thread ---
|
||
|
||
@johndoe (2024-03-15):
|
||
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
|
||
|
||
@janedoe (2024-03-15):
|
||
Agreed. What about refresh token strategy?
|
||
|
||
@johndoe (2024-03-16):
|
||
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
|
||
```
|
||
|
||
**System Notes Exclusion Rule:**
|
||
- System notes (is_system=1) are stored in the DB for audit purposes
|
||
- System notes are EXCLUDED from discussion documents by default
|
||
- This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings
|
||
- Debug flag `--include-system-notes` available for troubleshooting
|
||
|
||
This format preserves:
|
||
- Parent context (issue/MR title and number)
|
||
- Project path for scoped search
|
||
- Direct URL for navigation
|
||
- Labels for context
|
||
- File paths from DiffNotes (enables immediate file search)
|
||
- Author attribution for each note
|
||
- Temporal ordering of the conversation
|
||
- Full thread semantics for decision traceability
|
||
|
||
**Truncation (Note-Boundary Aware):**
|
||
If content exceeds 8000 tokens (~32000 chars):
|
||
|
||
**Algorithm:**
|
||
1. Count non-system notes in the discussion
|
||
2. If total chars ≤ maxChars, no truncation needed
|
||
3. Otherwise, drop whole notes from the MIDDLE:
|
||
- Preserve first N notes and last M notes
|
||
- Never cut mid-note (produces unreadable snippets and worse embeddings)
|
||
- Continue dropping middle notes until under maxChars
|
||
4. Insert marker: `\n\n[... N notes omitted for length ...]\n\n`
|
||
5. Set `documents.is_truncated = 1`
|
||
6. Set `documents.truncated_reason = 'token_limit_middle_drop'`
|
||
7. Log a warning with document ID and original/truncated token count
|
||
|
||
**Edge Cases:**
|
||
- **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'`
|
||
- **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'`
|
||
- **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]`
|
||
|
||
**Why note-boundary truncation:**
|
||
- Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
|
||
- Keeping whole notes preserves semantic coherence for embeddings
|
||
- First notes contain context/problem statement; last notes contain conclusions
|
||
- Middle notes are often back-and-forth that's less critical
|
||
|
||
**Token estimation:** `approxTokens = ceil(charCount / 4)`. No tokenizer dependency.
|
||
|
||
This metadata enables:
|
||
- Monitoring truncation frequency in production
|
||
- Future investigation of high-value truncated documents
|
||
- Debugging when search misses expected content
|
||
|
||
---
|
||
|
||
### Checkpoint 4: Hybrid Search (Semantic + Lexical)
|
||
**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/search-query.test.ts
|
||
✓ parses filter flags (--type, --author, --after, --label)
|
||
✓ validates date format for --after
|
||
✓ handles multiple --label flags
|
||
|
||
tests/unit/rrf-ranking.test.ts
|
||
✓ computes RRF score correctly
|
||
✓ merges results from vector and FTS retrievers
|
||
✓ handles documents appearing in only one retriever
|
||
✓ respects k=60 parameter
|
||
|
||
tests/integration/vector-search.test.ts
|
||
✓ returns results for semantic query
|
||
✓ ranks similar content higher
|
||
✓ returns empty for nonsense query
|
||
|
||
tests/integration/fts-search.test.ts
|
||
✓ returns exact keyword matches
|
||
✓ handles porter stemming (search/searching)
|
||
✓ returns empty for non-matching query
|
||
|
||
tests/integration/hybrid-search.test.ts
|
||
✓ combines vector and FTS results
|
||
✓ applies type filter correctly
|
||
✓ applies author filter correctly
|
||
✓ applies date filter correctly
|
||
✓ applies label filter correctly
|
||
✓ falls back to FTS when Ollama unavailable
|
||
|
||
tests/e2e/golden-queries.test.ts
|
||
✓ "authentication redesign" returns known auth-related items
|
||
✓ "database migration" returns known migration items
|
||
✓ [8 more domain-specific golden queries]
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
|
||
| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project |
|
||
| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
|
||
| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
|
||
| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
|
||
| `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
|
||
| `gi search "redis" --mode=lexical` | FTS results only | Shows FTS results, no embeddings |
|
||
| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
|
||
| `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema |
|
||
| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
|
||
| `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results |
|
||
| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
|
||
| `gi search "authentication"` (no data synced) | No data message | Shows "Run gi sync first" |
|
||
| `gi search "authentication"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
|
||
|
||
**Golden Query Test Suite:**
|
||
Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs:
|
||
```json
|
||
[
|
||
{
|
||
"query": "authentication redesign",
|
||
"expectedUrls": [".../-/issues/234", ".../-/merge_requests/847"],
|
||
"minResults": 1,
|
||
"maxRank": 10
|
||
}
|
||
]
|
||
```
|
||
Each query must have at least one expected URL appear in top 10 results.
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `documents_fts` row count matches `documents` row count
|
||
- [ ] Search returns results for known content (not empty)
|
||
- [ ] JSON output validates against defined schema
|
||
- [ ] All result URLs are valid GitLab URLs
|
||
|
||
**Scope:**
|
||
- Hybrid retrieval:
|
||
- Vector recall (sqlite-vec) + FTS lexical recall (fts5)
|
||
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
|
||
- Query embedding generation (same Ollama pipeline as documents)
|
||
- Result ranking and scoring (document-level)
|
||
- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N`
|
||
- `--limit=N` controls result count (default: 20, max: 100)
|
||
- `--path` filters documents by referenced file paths (from DiffNote positions):
|
||
- If `--path` ends with `/`: prefix match (`path LIKE 'src/auth/%'`)
|
||
- Otherwise: exact match OR prefix on directory boundary
|
||
- Examples: `--path=src/auth/` matches `src/auth/login.ts`, `src/auth/utils/helpers.ts`
|
||
- Examples: `--path=src/auth/login.ts` matches only that exact file
|
||
- Glob patterns deferred to post-MVP
|
||
- Label filtering operates on `document_labels` (indexed, exact-match)
|
||
- Filters work identically in hybrid and lexical modes
|
||
- Debug: `--explain` returns rank contributions from vector + FTS + RRF
|
||
- Output formatting: ranked list with title, snippet, score, URL
|
||
- JSON output mode for AI/agent consumption (stable documented schema)
|
||
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
|
||
- Empty state handling:
|
||
- No documents indexed: `No data indexed. Run 'gi sync' first.`
|
||
- Query returns no results: `No results found for "query".`
|
||
- Filters exclude all results: `No results match the specified filters.`
|
||
- Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
|
||
|
||
**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
|
||
1. Determine recall size (adaptive based on filters):
|
||
- `baseTopK = 50`
|
||
- If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200`
|
||
- This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
|
||
2. Query both vector index (top topK) and FTS5 (top topK)
|
||
- Vector recall via sqlite-vec + FTS lexical recall via fts5
|
||
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
|
||
3. Merge results by document_id
|
||
4. Combine with Reciprocal Rank Fusion (RRF):
|
||
- For each retriever list, assign ranks (1..N)
|
||
- `rrfScore = Σ 1 / (k + rank)` with k=60 (tunable)
|
||
- RRF is simpler than weighted sums and doesn't require score normalization
|
||
5. Apply remaining filters (date ranges, labels, paths that weren't applied in SQL)
|
||
6. Return top K results
|
||
|
||
**Why Adaptive Recall:**
|
||
- Fixed top-50 + filter can easily return 0 results even when relevant docs exist
|
||
- Increasing recall when filters are present catches more candidates before filtering
|
||
- SQL-level filtering is preferred (faster, uses indexes) but not always possible
|
||
|
||
**Why RRF over Weighted Sums:**
|
||
- FTS5 BM25 scores and vector distances use different scales
|
||
- Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization
|
||
- RRF operates on ranks, not scores, making it robust to scale differences
|
||
- Well-established in information retrieval literature
|
||
|
||
**Graceful Degradation:**
|
||
- If Ollama is unreachable during search, automatically fall back to FTS5-only search
|
||
- Display warning: "Embedding service unavailable, using lexical search only"
|
||
- `embed` command fails with actionable error if Ollama is down
|
||
|
||
**CLI Interface:**
|
||
```bash
|
||
# Basic semantic search
|
||
gi search "authentication redesign"
|
||
|
||
# Search within specific project
|
||
gi search "authentication" --project=group/project-one
|
||
|
||
# Search by file path (finds discussions/MRs touching this file)
|
||
gi search "rate limit" --path=src/client.ts
|
||
|
||
# Pure FTS search (fallback if embeddings unavailable)
|
||
gi search "redis" --mode=lexical
|
||
|
||
# Filtered search
|
||
gi search "authentication" --type=mr --after=2024-01-01
|
||
|
||
# Filter by label
|
||
gi search "performance" --label=bug --label=critical
|
||
|
||
# JSON output for programmatic use
|
||
gi search "payment processing" --json
|
||
|
||
# Explain search (shows RRF contributions)
|
||
gi search "auth" --explain
|
||
```
|
||
|
||
**CLI Output Example:**
|
||
```
|
||
$ gi search "authentication"
|
||
|
||
Found 23 results (hybrid search, 0.34s)
|
||
|
||
[1] MR !847 - Refactor auth to use JWT tokens (0.82)
|
||
@johndoe · 2024-03-15 · group/project-one
|
||
"...moving away from session cookies to JWT for authentication..."
|
||
https://gitlab.example.com/group/project-one/-/merge_requests/847
|
||
|
||
[2] Issue #234 - Authentication redesign discussion (0.79)
|
||
@janedoe · 2024-02-28 · group/project-one
|
||
"...we need to redesign the authentication flow because..."
|
||
https://gitlab.example.com/group/project-one/-/issues/234
|
||
|
||
[3] Discussion on Issue #234 (0.76)
|
||
@johndoe · 2024-03-01 · group/project-one
|
||
"I think we should move to JWT-based auth because the session..."
|
||
https://gitlab.example.com/group/project-one/-/issues/234#note_12345
|
||
```
|
||
|
||
**JSON Output Schema (Stable):**
|
||
|
||
For AI/agent consumption, `--json` output follows this stable schema:
|
||
|
||
```typescript
|
||
interface SearchResult {
|
||
documentId: number;
|
||
sourceType: "issue" | "merge_request" | "discussion";
|
||
title: string | null;
|
||
url: string;
|
||
projectPath: string;
|
||
author: string | null;
|
||
createdAt: string; // ISO 8601
|
||
updatedAt: string; // ISO 8601
|
||
score: number; // normalized 0-1 (rrfScore / maxRrfScore in this result set)
|
||
snippet: string; // truncated content_text
|
||
labels: string[];
|
||
// Only present with --explain flag
|
||
explain?: {
|
||
vectorRank?: number; // null if not in vector results
|
||
ftsRank?: number; // null if not in FTS results
|
||
rrfScore: number; // raw RRF score (rank-based, comparable within a query)
|
||
};
|
||
}
|
||
|
||
// Note on score normalization:
|
||
// - `score` is normalized 0-1 for UI display convenience
|
||
// - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set)
|
||
// - Use `explain.rrfScore` for raw scores when comparing across queries
|
||
// - Scores are NOT comparable across different queries
|
||
|
||
interface SearchResponse {
|
||
query: string;
|
||
mode: "hybrid" | "lexical" | "semantic";
|
||
totalResults: number;
|
||
results: SearchResult[];
|
||
warnings?: string[]; // e.g., "Embedding service unavailable"
|
||
}
|
||
```
|
||
|
||
**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.
|
||
|
||
---
|
||
|
||
### Checkpoint 5: Incremental Sync
|
||
**Deliverable:** Efficient ongoing synchronization with GitLab
|
||
|
||
**Automated Tests (Vitest):**
|
||
```
|
||
tests/unit/cursor-management.test.ts
|
||
✓ advances cursor after successful page commit
|
||
✓ uses tie-breaker id for identical timestamps
|
||
✓ does not advance cursor on failure
|
||
✓ resets cursor on --full flag
|
||
|
||
tests/unit/change-detection.test.ts
|
||
✓ detects content_hash mismatch
|
||
✓ queues document for re-embedding on change
|
||
✓ skips re-embedding when hash unchanged
|
||
|
||
tests/integration/incremental-sync.test.ts
|
||
✓ fetches only items updated after cursor
|
||
✓ refetches discussions for updated issues
|
||
✓ refetches discussions for updated MRs
|
||
✓ updates existing records (not duplicates)
|
||
✓ creates new records for new items
|
||
✓ re-embeds documents with changed content_hash
|
||
|
||
tests/integration/sync-recovery.test.ts
|
||
✓ resumes from cursor after interrupted sync
|
||
✓ marks failed run with error message
|
||
✓ handles rate limiting (429) with backoff
|
||
✓ respects Retry-After header
|
||
```
|
||
|
||
**Manual CLI Smoke Tests:**
|
||
| Command | Expected Output | Pass Criteria |
|
||
|---------|-----------------|---------------|
|
||
| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
|
||
| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
|
||
| `gi sync --full` | Full sync progress | Resets cursors, fetches everything |
|
||
| `gi sync-status` | Last sync time, cursor positions | Shows current state |
|
||
| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
|
||
| `gi search "new content"` (after sync) | Returns new content | New content is searchable |
|
||
|
||
**End-to-End Sync Verification:**
|
||
1. Note the current `sync_cursors` values
|
||
2. Create a new comment on an issue in GitLab
|
||
3. Run `gi sync`
|
||
4. Verify:
|
||
- [ ] Issue's `updated_at` in DB matches GitLab
|
||
- [ ] New discussion row exists
|
||
- [ ] New note row exists
|
||
- [ ] New document row exists for discussion
|
||
- [ ] New embedding exists for document
|
||
- [ ] `gi search "new comment text"` returns the new discussion
|
||
- [ ] Cursor advanced past the updated issue
|
||
|
||
**Data Integrity Checks:**
|
||
- [ ] `sync_cursors` timestamp <= max `updated_at` in corresponding table
|
||
- [ ] No orphaned documents (all have valid source_id)
|
||
- [ ] `embedding_metadata.content_hash` = `documents.content_hash` for all rows
|
||
- [ ] `sync_runs` has complete audit trail
|
||
|
||
**Scope:**
|
||
- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)`
|
||
- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
|
||
- Dependent resources sync strategy (discussions refetched when parent updates)
|
||
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
|
||
- Sync status reporting
|
||
- Recommended: run via cron every 10 minutes
|
||
|
||
**Correctness Rules (MVP):**
|
||
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
|
||
2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
|
||
- **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.**
|
||
- Use **cursor rewind + local filtering**:
|
||
- Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable)
|
||
- Locally discard items where:
|
||
- `updated_at < cursor_updated_at`, OR
|
||
- `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id`
|
||
- This makes the tuple cursor rule true in practice while keeping API calls simple.
|
||
- Cursor advances only after successful DB commit for that page
|
||
- When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
|
||
3. Dependent resources:
|
||
- For each updated issue/MR, refetch ALL its discussions
|
||
- Discussion documents are regenerated and re-embedded if content_hash changes
|
||
4. Rolling backfill window:
|
||
- After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays`
|
||
- This catches any items whose timestamps were updated without triggering our cursor
|
||
5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
|
||
6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
|
||
|
||
**Why Dependent Resource Model:**
|
||
- GitLab Discussions API doesn't provide a global `updated_after` stream
|
||
- Discussions are listed per-issue or per-MR, not as a top-level resource
|
||
- Treating discussions as dependent resources (refetch when parent updates) is simpler and more correct
|
||
|
||
**CLI Commands:**
|
||
```bash
|
||
# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
|
||
gi sync # orchestrates all steps
|
||
gi sync --no-embed # skip embedding step (fast ingest/debug)
|
||
gi sync --no-docs # skip document regeneration (debug)
|
||
|
||
# Force full re-sync (resets cursors)
|
||
gi sync --full
|
||
|
||
# Override stale 'running' run after operator review
|
||
gi sync --force
|
||
|
||
# Show sync status
|
||
gi sync-status
|
||
```
|
||
|
||
**Orchestration steps (in order):**
|
||
1. Acquire app lock with heartbeat
|
||
2. Ingest delta (issues, MRs) based on cursors
|
||
- For each upserted issue/MR, enqueue into `pending_discussion_fetches`
|
||
- INSERT into `dirty_sources` for each upserted issue/MR
|
||
3. Process `pending_discussion_fetches` queue (bounded per run, retryable):
|
||
- Fetch discussions for each queued parent
|
||
- On success: upsert discussions/notes, INSERT into `dirty_sources`, DELETE from queue
|
||
- On failure: increment `attempt_count`, record `last_error`, leave in queue for retry
|
||
- Bound processing: max N parents per sync run to avoid unbounded API calls
|
||
4. Apply rolling backfill window
|
||
5. Regenerate documents for entities in `dirty_sources` (process + delete from queue)
|
||
6. Embed documents with changed content_hash
|
||
7. FTS triggers auto-sync (no explicit step needed)
|
||
8. Release lock, record sync_run as succeeded
|
||
|
||
**Why queue-based discussion fetching:**
|
||
- One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync
|
||
- Primary resource cursors can advance independently
|
||
- Discussions can be retried without re-fetching all issues/MRs
|
||
- Bounded processing prevents unbounded API calls per sync run
|
||
|
||
Individual commands remain available for checkpoint testing and debugging:
|
||
- `gi ingest --type=issues`
|
||
- `gi ingest --type=merge_requests`
|
||
- `gi embed --all`
|
||
- `gi embed --retry-failed`
|
||
|
||
---
|
||
|
||
## CLI Command Reference
|
||
|
||
All commands support `--help` for detailed usage information.
|
||
|
||
### Setup & Diagnostics
|
||
|
||
| Command | CP | Description |
|
||
|---------|-----|-------------|
|
||
| `gi init` | 0 | Interactive setup wizard; creates gi.config.json |
|
||
| `gi auth-test` | 0 | Verify GitLab authentication |
|
||
| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) |
|
||
| `gi doctor --json` | 0 | JSON output for scripting |
|
||
| `gi version` | 0 | Show installed version |
|
||
|
||
### Data Ingestion
|
||
|
||
| Command | CP | Description |
|
||
|---------|-----|-------------|
|
||
| `gi ingest --type=issues` | 1 | Fetch issues from GitLab |
|
||
| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions |
|
||
| `gi generate-docs` | 3A | Extract documents from issues/MRs/discussions |
|
||
| `gi embed --all` | 3B | Generate embeddings for all documents |
|
||
| `gi embed --retry-failed` | 3 | Retry failed embeddings |
|
||
| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) |
|
||
| `gi sync --full` | 5 | Force complete re-sync (reset cursors) |
|
||
| `gi sync --force` | 5 | Override stale lock after operator review |
|
||
| `gi sync --no-embed` | 5 | Sync without embedding (faster) |
|
||
|
||
### Data Inspection
|
||
|
||
| Command | CP | Description |
|
||
|---------|-----|-------------|
|
||
| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues |
|
||
| `gi list mrs --limit=N` | 2 | List merge requests |
|
||
| `gi count issues` | 1 | Count issues |
|
||
| `gi count mrs` | 2 | Count merge requests |
|
||
| `gi count discussions --type=issue` | 1 | Count issue discussions |
|
||
| `gi count discussions` | 2 | Count all discussions |
|
||
| `gi count discussions --type=mr` | 2 | Count MR discussions |
|
||
| `gi count notes --type=issue` | 1 | Count issue notes (excluding system) |
|
||
| `gi count notes` | 2 | Count all notes (excluding system) |
|
||
| `gi show issue <iid> [--project=PATH]` | 1 | Show issue details (prompts if iid ambiguous across projects) |
|
||
| `gi show mr <iid> [--project=PATH]` | 2 | Show MR details with discussions |
|
||
| `gi stats` | 3 | Embedding coverage statistics |
|
||
| `gi stats --json` | 3 | JSON stats for scripting |
|
||
| `gi sync-status` | 1 | Show cursor positions and last sync |
|
||
|
||
### Search
|
||
|
||
| Command | CP | Description |
|
||
|---------|-----|-------------|
|
||
| `gi search "query"` | 4 | Hybrid semantic + lexical search |
|
||
| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) |
|
||
| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type |
|
||
| `gi search "query" --author=USERNAME` | 4 | Filter by author |
|
||
| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date |
|
||
| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) |
|
||
| `gi search "query" --project=PATH` | 4 | Filter by project |
|
||
| `gi search "query" --path=FILE` | 4 | Filter by file path |
|
||
| `gi search "query" --limit=N` | 4 | Limit results (default: 20, max: 100) |
|
||
| `gi search "query" --json` | 4 | JSON output for scripting |
|
||
| `gi search "query" --explain` | 4 | Show ranking breakdown |
|
||
|
||
### Database Management
|
||
|
||
| Command | CP | Description |
|
||
|---------|-----|-------------|
|
||
| `gi backup` | 0 | Create timestamped database backup |
|
||
| `gi reset --confirm` | 0 | Delete database and reset cursors |
|
||
|
||
---
|
||
|
||
## Error Handling
|
||
|
||
Common errors and their resolutions:
|
||
|
||
### Configuration Errors
|
||
|
||
| Error | Cause | Resolution |
|
||
|-------|-------|------------|
|
||
| `Config file not found` | No gi.config.json | Run `gi init` to create configuration |
|
||
| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually |
|
||
| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config |
|
||
|
||
### Authentication Errors
|
||
|
||
| Error | Cause | Resolution |
|
||
|-------|-------|------------|
|
||
| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` |
|
||
| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope |
|
||
| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope |
|
||
|
||
### GitLab API Errors
|
||
|
||
| Error | Cause | Resolution |
|
||
|-------|-------|------------|
|
||
| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) |
|
||
| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry |
|
||
| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity |
|
||
|
||
### Data Errors
|
||
|
||
| Error | Cause | Resolution |
|
||
|-------|-------|------------|
|
||
| `No documents indexed` | Sync not run | Run `gi sync` first |
|
||
| `No results found` | Query too specific | Try broader search terms |
|
||
| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale |
|
||
|
||
### Embedding Errors
|
||
|
||
| Error | Cause | Resolution |
|
||
|-------|-------|------------|
|
||
| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` |
|
||
| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` |
|
||
| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` |
|
||
|
||
### Operational Behavior
|
||
|
||
| Scenario | Behavior |
|
||
|----------|----------|
|
||
| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. |
|
||
| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
|
||
| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
|
||
| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
|
||
| **Embedding permanent failure** | After 3 retries, document stays in `embedding_metadata` with `last_error` populated. Use `gi embed --retry-failed` to retry later, or `gi stats` to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS. |
|
||
| **Orphaned records** | MVP: No automatic cleanup. `last_seen_at` field enables future detection of items deleted in GitLab. Post-MVP: `gi gc --dry-run` to identify orphans, `gi gc --confirm` to remove. |
|
||
|
||
---
|
||
|
||
## Database Management
|
||
|
||
### Database Location
|
||
|
||
The SQLite database is stored at an XDG-compliant location:
|
||
|
||
```
|
||
~/.local/share/gi/data.db
|
||
```
|
||
|
||
This can be overridden in `gi.config.json`:
|
||
|
||
```json
|
||
{
|
||
"storage": {
|
||
"dbPath": "/custom/path/to/data.db"
|
||
}
|
||
}
|
||
```
|
||
|
||
### Backup
|
||
|
||
Create a timestamped backup of the database:
|
||
|
||
```bash
|
||
gi backup
|
||
# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db
|
||
```
|
||
|
||
Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode).
|
||
|
||
### Reset
|
||
|
||
To completely reset the database and all sync cursors:
|
||
|
||
```bash
|
||
gi reset --confirm
|
||
```
|
||
|
||
This deletes:
|
||
- The database file
|
||
- All sync cursors
|
||
- All embeddings
|
||
|
||
You'll need to run `gi sync` again to repopulate.
|
||
|
||
### Schema Migrations
|
||
|
||
Database schema is version-tracked and migrations auto-apply on startup:
|
||
|
||
1. On first run, schema is created at latest version
|
||
2. On subsequent runs, pending migrations are applied automatically
|
||
3. Migration version is stored in `schema_version` table
|
||
4. Migrations are idempotent and reversible where possible
|
||
|
||
**Manual migration check:**
|
||
```bash
|
||
gi doctor --json | jq '.checks.database'
|
||
# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }
|
||
```
|
||
|
||
---
|
||
|
||
## Future Work (Post-MVP)
|
||
|
||
The following features are explicitly deferred to keep MVP scope focused:
|
||
|
||
| Feature | Description | Depends On |
|
||
|---------|-------------|------------|
|
||
| **File History** | Query "what decisions were made about src/auth/login.ts?" Requires mr_files table (MR→file linkage), commit-level indexing | MVP complete |
|
||
| **Personal Dashboard** | Filter by assigned/mentioned, integrate with gitlab-inbox tool | MVP complete |
|
||
| **Person Context** | Aggregate contributions by author, expertise inference | MVP complete |
|
||
| **Decision Graph** | LLM-assisted decision extraction, relationship visualization | MVP + LLM integration |
|
||
| **MCP Server** | Expose search as MCP tool for Claude Code integration | Checkpoint 4 |
|
||
| **Custom Tokenizer** | Better handling of code identifiers (snake_case, paths) | Checkpoint 4 |
|
||
|
||
**Checkpoint 6 (File History) Schema Preview:**
|
||
```sql
|
||
-- Deferred from MVP; added when file-history feature is built
|
||
CREATE TABLE mr_files (
|
||
id INTEGER PRIMARY KEY,
|
||
merge_request_id INTEGER REFERENCES merge_requests(id),
|
||
old_path TEXT,
|
||
new_path TEXT,
|
||
new_file BOOLEAN,
|
||
deleted_file BOOLEAN,
|
||
renamed_file BOOLEAN,
|
||
UNIQUE(merge_request_id, old_path, new_path)
|
||
);
|
||
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
|
||
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
|
||
|
||
-- DiffNote position data (for "show me comments on this file" queries)
|
||
-- Populated from notes.type='DiffNote' position object in GitLab API
|
||
CREATE TABLE note_positions (
|
||
note_id INTEGER PRIMARY KEY REFERENCES notes(id),
|
||
old_path TEXT,
|
||
new_path TEXT,
|
||
old_line INTEGER,
|
||
new_line INTEGER,
|
||
position_type TEXT -- 'text' | 'image' | etc.
|
||
);
|
||
CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
|
||
```
|
||
|
||
---
|
||
|
||
## Verification Strategy
|
||
|
||
Each checkpoint includes:
|
||
|
||
1. **Automated tests** - Unit tests for data transformations, integration tests for API calls
|
||
2. **CLI smoke tests** - Manual commands with expected outputs documented
|
||
3. **Data integrity checks** - Count verification against GitLab, schema validation
|
||
4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+)
|
||
|
||
---
|
||
|
||
## Risk Mitigation
|
||
|
||
| Risk | Mitigation |
|
||
|------|------------|
|
||
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
|
||
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
|
||
| SQLite scale limits | Monitor performance; Postgres migration path documented |
|
||
| Stale data | Incremental sync with change detection |
|
||
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
|
||
| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
|
||
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
|
||
| Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery |
|
||
| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
|
||
| Pathological discussions | Queue-based discussion fetching; one bad thread doesn't block entire sync |
|
||
| Empty search results with filters | Adaptive recall (topK 50→200 when filtered) |
|
||
|
||
**SQLite Performance Defaults (MVP):**
|
||
- Enable `PRAGMA journal_mode=WAL;` on every connection
|
||
- Enable `PRAGMA foreign_keys=ON;` on every connection
|
||
- Use explicit transactions for page/batch inserts
|
||
- Targeted indexes on `(project_id, updated_at)` for primary resources
|
||
|
||
---
|
||
|
||
## Schema Summary
|
||
|
||
| Table | Checkpoint | Purpose |
|
||
|-------|------------|---------|
|
||
| projects | 0 | Configured GitLab projects |
|
||
| sync_runs | 0 | Audit trail of sync operations (with heartbeat) |
|
||
| app_locks | 0 | Crash-safe single-flight lock |
|
||
| sync_cursors | 0 | Resumable sync state per primary resource |
|
||
| raw_payloads | 0 | Decoupled raw JSON storage (gitlab_id as TEXT) |
|
||
| schema_version | 0 | Database migration version tracking |
|
||
| issues | 1 | Normalized issues (unique by project+iid) |
|
||
| labels | 1 | Label definitions (unique by project + name) |
|
||
| issue_labels | 1 | Issue-label junction |
|
||
| discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) |
|
||
| notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) |
|
||
| merge_requests | 2 | Normalized MRs (unique by project+iid) |
|
||
| mr_labels | 2 | MR-label junction |
|
||
| documents | 3A | Unified searchable documents with truncation metadata |
|
||
| document_labels | 3A | Document-label junction for fast filtering |
|
||
| document_paths | 3A | Fast path filtering for documents (DiffNote file paths) |
|
||
| dirty_sources | 3A | Queue for incremental document regeneration |
|
||
| pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
|
||
| documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
|
||
| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) |
|
||
| embedding_metadata | 3B | Embedding provenance + error tracking |
|
||
| mr_files | 6 | MR file changes (deferred to post-MVP) |
|
||
|
||
---
|
||
|
||
## Resolved Decisions
|
||
|
||
| Question | Decision | Rationale |
|
||
|----------|----------|-----------|
|
||
| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability |
|
||
| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise |
|
||
| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature |
|
||
| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature |
|
||
| Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering |
|
||
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
|
||
| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient |
|
||
| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
|
||
| Discussions sync | **Resumable queue model** | Queue-based fetching allows one pathological thread to not block entire sync |
|
||
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
|
||
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
|
||
| Embedding truncation | **Note-boundary aware middle drop** | Never cut mid-note; preserves semantic coherence |
|
||
| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
|
||
| FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
|
||
| Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
|
||
| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
|
||
| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
|
||
| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
|
||
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly |
|
||
| Empty state UX | **Actionable messages** | Guide user to next step |
|
||
| raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings |
|
||
| GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items |
|
||
| Pagination | **X-Next-Page headers with empty-page fallback** | Headers are more robust than empty-page detection |
|
||
| Integration tests | **Mocked by default, live tests optional** | Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1 |
|
||
| Search recall with filters | **Adaptive topK (50→200 when filtered)** | Prevents "no results" when relevant docs exist outside top-50 |
|
||
| RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain |
|
||
| --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match |
|
||
| CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk |
|
||
| Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere |
|
||
| CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool |
|
||
| Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping |
|
||
| Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors |
|
||
| Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker |
|
||
| Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single |
|
||
|
||
---
|
||
|
||
## Next Steps
|
||
|
||
1. User approves this spec
|
||
2. Generate Checkpoint 0 PRD for project setup
|
||
3. Implement Checkpoint 0
|
||
4. Human validates → proceed to Checkpoint 1
|
||
5. Repeat for each checkpoint
|