docs: Add comprehensive documentation and planning artifacts

README.md provides complete user documentation: - Installation via cargo install or build from source - Quick start guide with example commands - Configuration file format with all options documented - Full command reference for init, auth-test, doctor, ingest, list, show, count, sync-status, migrate, and version - Database schema overview covering projects, issues, milestones, assignees, labels, discussions, notes, and raw payloads - Development setup with test, lint, and debug commands SPEC.md updated from original TypeScript planning document: - Added note clarifying this is historical (implementation uses Rust) - Updated sqlite-vss references to sqlite-vec (deprecated library) - Added architecture overview with Technology Choices rationale - Expanded project structure showing all planned modules docs/prd/ contains detailed checkpoint planning: - checkpoint-0.md: Initial project vision and requirements - checkpoint-1.md: Revised planning after technology decisions These documents capture the evolution from initial concept through the decision to use Rust for performance and type safety. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 11:27:40 -05:00
parent e065862f81
commit 986bc59f6a
4 changed files with 3377 additions and 14 deletions
--- a/SPEC.md
+++ b/SPEC.md
@@ -1,5 +1,7 @@
 # GitLab Knowledge Engine - Spec Document

+> **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation.
+
 ## Executive Summary

 A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
@@ -122,7 +124,7 @@ npm link  # Makes `gi` available globally
                              ▼
 ┌─────────────────────────────────────────────────────────────────┐
 │                      Storage Layer                               │
-│  - SQLite + sqlite-vss + FTS5 (hybrid search)                   │
+│  - SQLite + sqlite-vec + FTS5 (hybrid search)                   │
 │  - Structured metadata in relational tables                      │
 │  - Vector embeddings for semantic search                         │
 │  - Full-text index for lexical search fallback                  │
@@ -139,12 +141,20 @@ npm link  # Makes `gi` available globally

 ### Technology Choices

-| Component | Recommendation | Rationale |
-|-----------|---------------|-----------|
+| Component | Choice | Rationale |
+|-----------|--------|-----------|
 | Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
-| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
+| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension |
 | Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
-| CLI Framework | Commander.js or oclif | Standard, well-documented |
+| CLI Framework | Commander.js | Simple, lightweight, well-documented |
+| Logging | pino | Fast, JSON-structured, low overhead |
+| Validation | Zod | TypeScript-first schema validation |
+
+### Alternative Considered: sqlite-vss
+- sqlite-vss was the original choice but is now deprecated
+- No Apple Silicon support (no prebuilt ARM binaries)
+- Replaced by sqlite-vec, which is pure C with no dependencies
+- sqlite-vec uses `vec0` virtual table (vs `vss0`)

 ### Alternative Considered: Postgres + pgvector
 - Pros: More scalable, better for production multi-user
@@ -153,6 +163,126 @@ npm link  # Makes `gi` available globally

 ---

+## Project Structure
+
+```
+gitlab-inbox/
+├── src/
+│   ├── cli/
+│   │   ├── index.ts          # CLI entry point (Commander.js)
+│   │   └── commands/         # One file per command group
+│   │       ├── init.ts
+│   │       ├── sync.ts
+│   │       ├── search.ts
+│   │       ├── list.ts
+│   │       └── doctor.ts
+│   ├── core/
+│   │   ├── config.ts         # Config loading/validation (Zod)
+│   │   ├── db.ts             # Database connection + migrations
+│   │   ├── errors.ts         # Custom error classes
+│   │   └── logger.ts         # pino logger setup
+│   ├── gitlab/
+│   │   ├── client.ts         # GitLab API client with rate limiting
+│   │   ├── types.ts          # GitLab API response types
+│   │   └── transformers/     # Payload → normalized schema
+│   │       ├── issue.ts
+│   │       ├── merge-request.ts
+│   │       └── discussion.ts
+│   ├── ingestion/
+│   │   ├── issues.ts
+│   │   ├── merge-requests.ts
+│   │   └── discussions.ts
+│   ├── documents/
+│   │   ├── extractor.ts      # Document generation from entities
+│   │   └── truncation.ts     # Note-boundary aware truncation
+│   ├── embedding/
+│   │   ├── ollama.ts         # Ollama client
+│   │   └── pipeline.ts       # Batch embedding orchestration
+│   ├── search/
+│   │   ├── hybrid.ts         # RRF ranking logic
+│   │   ├── fts.ts            # FTS5 queries
+│   │   └── vector.ts         # sqlite-vec queries
+│   └── types/
+│       └── index.ts          # Shared TypeScript types
+├── tests/
+│   ├── unit/
+│   ├── integration/
+│   ├── live/                 # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
+│   └── fixtures/
+│       └── golden-queries.json
+├── migrations/               # Numbered SQL migration files
+│   ├── 001_initial.sql
+│   └── ...
+├── gi.config.json           # User config (gitignored)
+├── package.json
+├── tsconfig.json
+├── vitest.config.ts
+├── eslint.config.js
+└── README.md
+```
+
+---
+
+## Dependencies
+
+### Runtime Dependencies
+
+```json
+{
+  "dependencies": {
+    "better-sqlite3": "latest",
+    "sqlite-vec": "latest",
+    "commander": "latest",
+    "zod": "latest",
+    "pino": "latest",
+    "pino-pretty": "latest",
+    "ora": "latest",
+    "chalk": "latest",
+    "cli-table3": "latest"
+  }
+}
+```
+
+| Package | Purpose |
+|---------|---------|
+| better-sqlite3 | Synchronous SQLite driver (fast, native) |
+| sqlite-vec | Vector search extension (pure C, cross-platform) |
+| commander | CLI argument parsing |
+| zod | Schema validation for config and inputs |
+| pino | Structured JSON logging |
+| pino-pretty | Dev-mode log formatting |
+| ora | CLI spinners for progress indication |
+| chalk | Terminal colors |
+| cli-table3 | ASCII tables for list output |
+
+### Dev Dependencies
+
+```json
+{
+  "devDependencies": {
+    "typescript": "latest",
+    "@types/better-sqlite3": "latest",
+    "@types/node": "latest",
+    "vitest": "latest",
+    "msw": "latest",
+    "eslint": "latest",
+    "@typescript-eslint/eslint-plugin": "latest",
+    "@typescript-eslint/parser": "latest",
+    "tsx": "latest"
+  }
+}
+```
+
+| Package | Purpose |
+|---------|---------|
+| typescript | TypeScript compiler |
+| vitest | Test runner |
+| msw | Mock Service Worker for API mocking in tests |
+| eslint | Linting |
+| tsx | Run TypeScript directly during development |
+
+---
+
 ## GitLab API Strategy

 ### Primary Resources (Bulk Fetch)
@@ -368,6 +498,98 @@ tests/integration/init.test.ts
 - Decompression is handled transparently when reading payloads
 - Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage

+**Error Classes (src/core/errors.ts):**
+
+```typescript
+// Base error class with error codes for programmatic handling
+export class GiError extends Error {
+  constructor(message: string, public readonly code: string) {
+    super(message);
+    this.name = 'GiError';
+  }
+}
+
+// Config errors
+export class ConfigNotFoundError extends GiError {
+  constructor() {
+    super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
+  }
+}
+
+export class ConfigValidationError extends GiError {
+  constructor(details: string) {
+    super(`Invalid config: ${details}`, 'CONFIG_INVALID');
+  }
+}
+
+// GitLab API errors
+export class GitLabAuthError extends GiError {
+  constructor() {
+    super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
+  }
+}
+
+export class GitLabNotFoundError extends GiError {
+  constructor(resource: string) {
+    super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
+  }
+}
+
+export class GitLabRateLimitError extends GiError {
+  constructor(public readonly retryAfter: number) {
+    super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
+  }
+}
+
+// Database errors
+export class DatabaseLockError extends GiError {
+  constructor() {
+    super('Another sync is running. Use --force to override.', 'DB_LOCKED');
+  }
+}
+
+// Embedding errors
+export class OllamaConnectionError extends GiError {
+  constructor() {
+    super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
+  }
+}
+
+export class EmbeddingError extends GiError {
+  constructor(documentId: number, reason: string) {
+    super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
+  }
+}
+```
+
+**Logging Strategy (src/core/logger.ts):**
+
+```typescript
+import pino from 'pino';
+
+// Logs go to stderr, results to stdout (allows clean JSON piping)
+export const logger = pino({
+  level: process.env.LOG_LEVEL || 'info',
+  transport: process.env.NODE_ENV === 'production' ? undefined : {
+    target: 'pino-pretty',
+    options: { colorize: true, destination: 2 }  // 2 = stderr
+  }
+}, pino.destination(2));
+```
+
+**Log Levels:**
+| Level | When to use |
+|-------|-------------|
+| debug | Detailed sync progress, API calls, SQL queries |
+| info | Sync start/complete, document counts, search timing |
+| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries |
+| error | Failures that stop operations |
+
+**Logging Conventions:**
+- Always include structured context: `logger.info({ project, count }, 'Fetched issues')`
+- Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')`
+- All logs to stderr so `gi search --json` output stays clean on stdout
+
 **DB Runtime Defaults (Checkpoint 0):**
 - On every connection:
  - `PRAGMA journal_mode=WAL;`
@@ -930,7 +1152,7 @@ tests/unit/embedding-client.test.ts
  ✓ batches requests (32 documents per batch)

 tests/integration/embedding-storage.test.ts
-  ✓ stores embedding in sqlite-vss
+  ✓ stores embedding in sqlite-vec
  ✓ embedding rowid matches document id
  ✓ creates embedding_metadata record
  ✓ skips re-embedding when content_hash unchanged
@@ -959,16 +1181,46 @@ tests/integration/embedding-storage.test.ts
  - Concurrency: configurable (default 4 workers)
  - Retry with exponential backoff for transient failures (max 3 attempts)
  - Per-document failure recording to enable targeted re-runs
- Vector storage in SQLite (sqlite-vss extension)
+- Vector storage in SQLite (sqlite-vec extension)
 - Progress tracking and resumability
 - `gi search --mode=semantic` CLI command

+**Ollama API Contract:**
+
+```typescript
+// POST http://localhost:11434/api/embed (batch endpoint - preferred)
+interface OllamaEmbedRequest {
+  model: string;      // "nomic-embed-text"
+  input: string[];    // array of texts to embed (up to 32)
+}
+
+interface OllamaEmbedResponse {
+  model: string;
+  embeddings: number[][];  // array of 768-dim vectors
+}
+
+// POST http://localhost:11434/api/embeddings (single text - fallback)
+interface OllamaEmbeddingsRequest {
+  model: string;
+  prompt: string;
+}
+
+interface OllamaEmbeddingsResponse {
+  embedding: number[];
+}
+```
+
+**Usage:**
+- Use `/api/embed` for batching (up to 32 documents per request)
+- Fall back to `/api/embeddings` for single documents or if batch fails
+- Check Ollama availability with `GET http://localhost:11434/api/tags`
+
 **Schema Additions (CP3B):**
 ```sql
-- sqlite-vss virtual table for vector search
+-- sqlite-vec virtual table for vector search
 -- Storage rule: embeddings.rowid = documents.id
-CREATE VIRTUAL TABLE embeddings USING vss0(
-  embedding(768)
+CREATE VIRTUAL TABLE embeddings USING vec0(
+  embedding float[768]
 );

 -- Embedding provenance + change detection
@@ -1053,6 +1305,11 @@ If content exceeds 8000 tokens (~32000 chars):
 6. Set `documents.truncated_reason = 'token_limit_middle_drop'`
 7. Log a warning with document ID and original/truncated token count

+**Edge Cases:**
+- **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'`
+- **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'`
+- **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]`
+
 **Why note-boundary truncation:**
 - Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
 - Keeping whole notes preserves semantic coherence for embeddings
@@ -1148,7 +1405,7 @@ Each query must have at least one expected URL appear in top 10 results.

 **Scope:**
 - Hybrid retrieval:
-  - Vector recall (sqlite-vss) + FTS lexical recall (fts5)
+  - Vector recall (sqlite-vec) + FTS lexical recall (fts5)
  - Merge + rerank results using Reciprocal Rank Fusion (RRF)
 - Query embedding generation (same Ollama pipeline as documents)
 - Result ranking and scoring (document-level)
@@ -1178,6 +1435,7 @@ Each query must have at least one expected URL appear in top 10 results.
   - If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200`
   - This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
 2. Query both vector index (top topK) and FTS5 (top topK)
+   - Vector recall via sqlite-vec + FTS lexical recall via fts5
   - Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
 3. Merge results by document_id
 4. Combine with Reciprocal Rank Fusion (RRF):
@@ -1318,7 +1576,7 @@ tests/integration/incremental-sync.test.ts
  ✓ refetches discussions for updated MRs
  ✓ updates existing records (not duplicates)
  ✓ creates new records for new items
-  ✓ re-embeds documents with changed content
+  ✓ re-embeds documents with changed content_hash

 tests/integration/sync-recovery.test.ts
  ✓ resumes from cursor after interrupted sync
@@ -1731,7 +1989,7 @@ Each checkpoint includes:
 | dirty_sources | 3A | Queue for incremental document regeneration |
 | pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
 | documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
-| embeddings | 3B | Vector embeddings (sqlite-vss, rowid=document_id) |
+| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) |
 | embedding_metadata | 3B | Embedding provenance + error tracking |
 | mr_files | 6 | MR file changes (deferred to post-MVP) |

@@ -1759,7 +2017,7 @@ Each checkpoint includes:
 | JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
 | Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
 | `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
-| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
+| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly |
 | Empty state UX | **Actionable messages** | Guide user to next step |
 | raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings |
 | GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items |
@@ -1769,6 +2027,12 @@ Each checkpoint includes:
 | RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain |
 | --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match |
 | CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk |
+| Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere |
+| CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool |
+| Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping |
+| Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors |
+| Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker |
+| Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single |

 ---