docs: Add comprehensive documentation and planning artifacts

README.md provides complete user documentation:
- Installation via cargo install or build from source
- Quick start guide with example commands
- Configuration file format with all options documented
- Full command reference for init, auth-test, doctor, ingest,
  list, show, count, sync-status, migrate, and version
- Database schema overview covering projects, issues, milestones,
  assignees, labels, discussions, notes, and raw payloads
- Development setup with test, lint, and debug commands

SPEC.md updated from original TypeScript planning document:
- Added note clarifying this is historical (implementation uses Rust)
- Updated sqlite-vss references to sqlite-vec (deprecated library)
- Added architecture overview with Technology Choices rationale
- Expanded project structure showing all planned modules

docs/prd/ contains detailed checkpoint planning:
- checkpoint-0.md: Initial project vision and requirements
- checkpoint-1.md: Revised planning after technology decisions

These documents capture the evolution from initial concept through
the decision to use Rust for performance and type safety.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Taylor Eernisse
2026-01-26 11:27:40 -05:00
parent e065862f81
commit 986bc59f6a
4 changed files with 3377 additions and 14 deletions

292
SPEC.md
View File

@@ -1,5 +1,7 @@
# GitLab Knowledge Engine - Spec Document
> **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation.
## Executive Summary
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
@@ -122,7 +124,7 @@ npm link # Makes `gi` available globally
┌─────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ - SQLite + sqlite-vss + FTS5 (hybrid search) │
│ - SQLite + sqlite-vec + FTS5 (hybrid search) │
│ - Structured metadata in relational tables │
│ - Vector embeddings for semantic search │
│ - Full-text index for lexical search fallback │
@@ -139,12 +141,20 @@ npm link # Makes `gi` available globally
### Technology Choices
| Component | Recommendation | Rationale |
|-----------|---------------|-----------|
| Component | Choice | Rationale |
|-----------|--------|-----------|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension |
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
| CLI Framework | Commander.js or oclif | Standard, well-documented |
| CLI Framework | Commander.js | Simple, lightweight, well-documented |
| Logging | pino | Fast, JSON-structured, low overhead |
| Validation | Zod | TypeScript-first schema validation |
### Alternative Considered: sqlite-vss
- sqlite-vss was the original choice but is now deprecated
- No Apple Silicon support (no prebuilt ARM binaries)
- Replaced by sqlite-vec, which is pure C with no dependencies
- sqlite-vec uses `vec0` virtual table (vs `vss0`)
### Alternative Considered: Postgres + pgvector
- Pros: More scalable, better for production multi-user
@@ -153,6 +163,126 @@ npm link # Makes `gi` available globally
---
## Project Structure
```
gitlab-inbox/
├── src/
│ ├── cli/
│ │ ├── index.ts # CLI entry point (Commander.js)
│ │ └── commands/ # One file per command group
│ │ ├── init.ts
│ │ ├── sync.ts
│ │ ├── search.ts
│ │ ├── list.ts
│ │ └── doctor.ts
│ ├── core/
│ │ ├── config.ts # Config loading/validation (Zod)
│ │ ├── db.ts # Database connection + migrations
│ │ ├── errors.ts # Custom error classes
│ │ └── logger.ts # pino logger setup
│ ├── gitlab/
│ │ ├── client.ts # GitLab API client with rate limiting
│ │ ├── types.ts # GitLab API response types
│ │ └── transformers/ # Payload → normalized schema
│ │ ├── issue.ts
│ │ ├── merge-request.ts
│ │ └── discussion.ts
│ ├── ingestion/
│ │ ├── issues.ts
│ │ ├── merge-requests.ts
│ │ └── discussions.ts
│ ├── documents/
│ │ ├── extractor.ts # Document generation from entities
│ │ └── truncation.ts # Note-boundary aware truncation
│ ├── embedding/
│ │ ├── ollama.ts # Ollama client
│ │ └── pipeline.ts # Batch embedding orchestration
│ ├── search/
│ │ ├── hybrid.ts # RRF ranking logic
│ │ ├── fts.ts # FTS5 queries
│ │ └── vector.ts # sqlite-vec queries
│ └── types/
│ └── index.ts # Shared TypeScript types
├── tests/
│ ├── unit/
│ ├── integration/
│ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
│ └── fixtures/
│ └── golden-queries.json
├── migrations/ # Numbered SQL migration files
│ ├── 001_initial.sql
│ └── ...
├── gi.config.json # User config (gitignored)
├── package.json
├── tsconfig.json
├── vitest.config.ts
├── eslint.config.js
└── README.md
```
---
## Dependencies
### Runtime Dependencies
```json
{
"dependencies": {
"better-sqlite3": "latest",
"sqlite-vec": "latest",
"commander": "latest",
"zod": "latest",
"pino": "latest",
"pino-pretty": "latest",
"ora": "latest",
"chalk": "latest",
"cli-table3": "latest"
}
}
```
| Package | Purpose |
|---------|---------|
| better-sqlite3 | Synchronous SQLite driver (fast, native) |
| sqlite-vec | Vector search extension (pure C, cross-platform) |
| commander | CLI argument parsing |
| zod | Schema validation for config and inputs |
| pino | Structured JSON logging |
| pino-pretty | Dev-mode log formatting |
| ora | CLI spinners for progress indication |
| chalk | Terminal colors |
| cli-table3 | ASCII tables for list output |
### Dev Dependencies
```json
{
"devDependencies": {
"typescript": "latest",
"@types/better-sqlite3": "latest",
"@types/node": "latest",
"vitest": "latest",
"msw": "latest",
"eslint": "latest",
"@typescript-eslint/eslint-plugin": "latest",
"@typescript-eslint/parser": "latest",
"tsx": "latest"
}
}
```
| Package | Purpose |
|---------|---------|
| typescript | TypeScript compiler |
| vitest | Test runner |
| msw | Mock Service Worker for API mocking in tests |
| eslint | Linting |
| tsx | Run TypeScript directly during development |
---
## GitLab API Strategy
### Primary Resources (Bulk Fetch)
@@ -368,6 +498,98 @@ tests/integration/init.test.ts
- Decompression is handled transparently when reading payloads
- Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage
**Error Classes (src/core/errors.ts):**
```typescript
// Base error class with error codes for programmatic handling
export class GiError extends Error {
constructor(message: string, public readonly code: string) {
super(message);
this.name = 'GiError';
}
}
// Config errors
export class ConfigNotFoundError extends GiError {
constructor() {
super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
}
}
export class ConfigValidationError extends GiError {
constructor(details: string) {
super(`Invalid config: ${details}`, 'CONFIG_INVALID');
}
}
// GitLab API errors
export class GitLabAuthError extends GiError {
constructor() {
super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
}
}
export class GitLabNotFoundError extends GiError {
constructor(resource: string) {
super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
}
}
export class GitLabRateLimitError extends GiError {
constructor(public readonly retryAfter: number) {
super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
}
}
// Database errors
export class DatabaseLockError extends GiError {
constructor() {
super('Another sync is running. Use --force to override.', 'DB_LOCKED');
}
}
// Embedding errors
export class OllamaConnectionError extends GiError {
constructor() {
super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
}
}
export class EmbeddingError extends GiError {
constructor(documentId: number, reason: string) {
super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
}
}
```
**Logging Strategy (src/core/logger.ts):**
```typescript
import pino from 'pino';
// Logs go to stderr, results to stdout (allows clean JSON piping)
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
transport: process.env.NODE_ENV === 'production' ? undefined : {
target: 'pino-pretty',
options: { colorize: true, destination: 2 } // 2 = stderr
}
}, pino.destination(2));
```
**Log Levels:**
| Level | When to use |
|-------|-------------|
| debug | Detailed sync progress, API calls, SQL queries |
| info | Sync start/complete, document counts, search timing |
| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries |
| error | Failures that stop operations |
**Logging Conventions:**
- Always include structured context: `logger.info({ project, count }, 'Fetched issues')`
- Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')`
- All logs to stderr so `gi search --json` output stays clean on stdout
**DB Runtime Defaults (Checkpoint 0):**
- On every connection:
- `PRAGMA journal_mode=WAL;`
@@ -930,7 +1152,7 @@ tests/unit/embedding-client.test.ts
✓ batches requests (32 documents per batch)
tests/integration/embedding-storage.test.ts
✓ stores embedding in sqlite-vss
✓ stores embedding in sqlite-vec
✓ embedding rowid matches document id
✓ creates embedding_metadata record
✓ skips re-embedding when content_hash unchanged
@@ -959,16 +1181,46 @@ tests/integration/embedding-storage.test.ts
- Concurrency: configurable (default 4 workers)
- Retry with exponential backoff for transient failures (max 3 attempts)
- Per-document failure recording to enable targeted re-runs
- Vector storage in SQLite (sqlite-vss extension)
- Vector storage in SQLite (sqlite-vec extension)
- Progress tracking and resumability
- `gi search --mode=semantic` CLI command
**Ollama API Contract:**
```typescript
// POST http://localhost:11434/api/embed (batch endpoint - preferred)
interface OllamaEmbedRequest {
model: string; // "nomic-embed-text"
input: string[]; // array of texts to embed (up to 32)
}
interface OllamaEmbedResponse {
model: string;
embeddings: number[][]; // array of 768-dim vectors
}
// POST http://localhost:11434/api/embeddings (single text - fallback)
interface OllamaEmbeddingsRequest {
model: string;
prompt: string;
}
interface OllamaEmbeddingsResponse {
embedding: number[];
}
```
**Usage:**
- Use `/api/embed` for batching (up to 32 documents per request)
- Fall back to `/api/embeddings` for single documents or if batch fails
- Check Ollama availability with `GET http://localhost:11434/api/tags`
**Schema Additions (CP3B):**
```sql
-- sqlite-vss virtual table for vector search
-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0(
embedding(768)
CREATE VIRTUAL TABLE embeddings USING vec0(
embedding float[768]
);
-- Embedding provenance + change detection
@@ -1053,6 +1305,11 @@ If content exceeds 8000 tokens (~32000 chars):
6. Set `documents.truncated_reason = 'token_limit_middle_drop'`
7. Log a warning with document ID and original/truncated token count
**Edge Cases:**
- **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'`
- **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'`
- **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]`
**Why note-boundary truncation:**
- Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
- Keeping whole notes preserves semantic coherence for embeddings
@@ -1148,7 +1405,7 @@ Each query must have at least one expected URL appear in top 10 results.
**Scope:**
- Hybrid retrieval:
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
- Vector recall (sqlite-vec) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Query embedding generation (same Ollama pipeline as documents)
- Result ranking and scoring (document-level)
@@ -1178,6 +1435,7 @@ Each query must have at least one expected URL appear in top 10 results.
- If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200`
- This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
2. Query both vector index (top topK) and FTS5 (top topK)
- Vector recall via sqlite-vec + FTS lexical recall via fts5
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
3. Merge results by document_id
4. Combine with Reciprocal Rank Fusion (RRF):
@@ -1318,7 +1576,7 @@ tests/integration/incremental-sync.test.ts
✓ refetches discussions for updated MRs
✓ updates existing records (not duplicates)
✓ creates new records for new items
✓ re-embeds documents with changed content
✓ re-embeds documents with changed content_hash
tests/integration/sync-recovery.test.ts
✓ resumes from cursor after interrupted sync
@@ -1731,7 +1989,7 @@ Each checkpoint includes:
| dirty_sources | 3A | Queue for incremental document regeneration |
| pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
| documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
| embeddings | 3B | Vector embeddings (sqlite-vss, rowid=document_id) |
| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) |
| embedding_metadata | 3B | Embedding provenance + error tracking |
| mr_files | 6 | MR file changes (deferred to post-MVP) |
@@ -1759,7 +2017,7 @@ Each checkpoint includes:
| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly |
| Empty state UX | **Actionable messages** | Guide user to next step |
| raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings |
| GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items |
@@ -1769,6 +2027,12 @@ Each checkpoint includes:
| RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain |
| --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match |
| CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk |
| Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere |
| CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool |
| Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping |
| Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors |
| Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker |
| Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single |
---