docs: Add comprehensive documentation and planning artifacts
README.md provides complete user documentation: - Installation via cargo install or build from source - Quick start guide with example commands - Configuration file format with all options documented - Full command reference for init, auth-test, doctor, ingest, list, show, count, sync-status, migrate, and version - Database schema overview covering projects, issues, milestones, assignees, labels, discussions, notes, and raw payloads - Development setup with test, lint, and debug commands SPEC.md updated from original TypeScript planning document: - Added note clarifying this is historical (implementation uses Rust) - Updated sqlite-vss references to sqlite-vec (deprecated library) - Added architecture overview with Technology Choices rationale - Expanded project structure showing all planned modules docs/prd/ contains detailed checkpoint planning: - checkpoint-0.md: Initial project vision and requirements - checkpoint-1.md: Revised planning after technology decisions These documents capture the evolution from initial concept through the decision to use Rust for performance and type safety. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
292
SPEC.md
292
SPEC.md
@@ -1,5 +1,7 @@
|
||||
# GitLab Knowledge Engine - Spec Document
|
||||
|
||||
> **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
|
||||
@@ -122,7 +124,7 @@ npm link # Makes `gi` available globally
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Storage Layer │
|
||||
│ - SQLite + sqlite-vss + FTS5 (hybrid search) │
|
||||
│ - SQLite + sqlite-vec + FTS5 (hybrid search) │
|
||||
│ - Structured metadata in relational tables │
|
||||
│ - Vector embeddings for semantic search │
|
||||
│ - Full-text index for lexical search fallback │
|
||||
@@ -139,12 +141,20 @@ npm link # Makes `gi` available globally
|
||||
|
||||
### Technology Choices
|
||||
|
||||
| Component | Recommendation | Rationale |
|
||||
|-----------|---------------|-----------|
|
||||
| Component | Choice | Rationale |
|
||||
|-----------|--------|-----------|
|
||||
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
|
||||
| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
|
||||
| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension |
|
||||
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
|
||||
| CLI Framework | Commander.js or oclif | Standard, well-documented |
|
||||
| CLI Framework | Commander.js | Simple, lightweight, well-documented |
|
||||
| Logging | pino | Fast, JSON-structured, low overhead |
|
||||
| Validation | Zod | TypeScript-first schema validation |
|
||||
|
||||
### Alternative Considered: sqlite-vss
|
||||
- sqlite-vss was the original choice but is now deprecated
|
||||
- No Apple Silicon support (no prebuilt ARM binaries)
|
||||
- Replaced by sqlite-vec, which is pure C with no dependencies
|
||||
- sqlite-vec uses `vec0` virtual table (vs `vss0`)
|
||||
|
||||
### Alternative Considered: Postgres + pgvector
|
||||
- Pros: More scalable, better for production multi-user
|
||||
@@ -153,6 +163,126 @@ npm link # Makes `gi` available globally
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
gitlab-inbox/
|
||||
├── src/
|
||||
│ ├── cli/
|
||||
│ │ ├── index.ts # CLI entry point (Commander.js)
|
||||
│ │ └── commands/ # One file per command group
|
||||
│ │ ├── init.ts
|
||||
│ │ ├── sync.ts
|
||||
│ │ ├── search.ts
|
||||
│ │ ├── list.ts
|
||||
│ │ └── doctor.ts
|
||||
│ ├── core/
|
||||
│ │ ├── config.ts # Config loading/validation (Zod)
|
||||
│ │ ├── db.ts # Database connection + migrations
|
||||
│ │ ├── errors.ts # Custom error classes
|
||||
│ │ └── logger.ts # pino logger setup
|
||||
│ ├── gitlab/
|
||||
│ │ ├── client.ts # GitLab API client with rate limiting
|
||||
│ │ ├── types.ts # GitLab API response types
|
||||
│ │ └── transformers/ # Payload → normalized schema
|
||||
│ │ ├── issue.ts
|
||||
│ │ ├── merge-request.ts
|
||||
│ │ └── discussion.ts
|
||||
│ ├── ingestion/
|
||||
│ │ ├── issues.ts
|
||||
│ │ ├── merge-requests.ts
|
||||
│ │ └── discussions.ts
|
||||
│ ├── documents/
|
||||
│ │ ├── extractor.ts # Document generation from entities
|
||||
│ │ └── truncation.ts # Note-boundary aware truncation
|
||||
│ ├── embedding/
|
||||
│ │ ├── ollama.ts # Ollama client
|
||||
│ │ └── pipeline.ts # Batch embedding orchestration
|
||||
│ ├── search/
|
||||
│ │ ├── hybrid.ts # RRF ranking logic
|
||||
│ │ ├── fts.ts # FTS5 queries
|
||||
│ │ └── vector.ts # sqlite-vec queries
|
||||
│ └── types/
|
||||
│ └── index.ts # Shared TypeScript types
|
||||
├── tests/
|
||||
│ ├── unit/
|
||||
│ ├── integration/
|
||||
│ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1)
|
||||
│ └── fixtures/
|
||||
│ └── golden-queries.json
|
||||
├── migrations/ # Numbered SQL migration files
|
||||
│ ├── 001_initial.sql
|
||||
│ └── ...
|
||||
├── gi.config.json # User config (gitignored)
|
||||
├── package.json
|
||||
├── tsconfig.json
|
||||
├── vitest.config.ts
|
||||
├── eslint.config.js
|
||||
└── README.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
### Runtime Dependencies
|
||||
|
||||
```json
|
||||
{
|
||||
"dependencies": {
|
||||
"better-sqlite3": "latest",
|
||||
"sqlite-vec": "latest",
|
||||
"commander": "latest",
|
||||
"zod": "latest",
|
||||
"pino": "latest",
|
||||
"pino-pretty": "latest",
|
||||
"ora": "latest",
|
||||
"chalk": "latest",
|
||||
"cli-table3": "latest"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Package | Purpose |
|
||||
|---------|---------|
|
||||
| better-sqlite3 | Synchronous SQLite driver (fast, native) |
|
||||
| sqlite-vec | Vector search extension (pure C, cross-platform) |
|
||||
| commander | CLI argument parsing |
|
||||
| zod | Schema validation for config and inputs |
|
||||
| pino | Structured JSON logging |
|
||||
| pino-pretty | Dev-mode log formatting |
|
||||
| ora | CLI spinners for progress indication |
|
||||
| chalk | Terminal colors |
|
||||
| cli-table3 | ASCII tables for list output |
|
||||
|
||||
### Dev Dependencies
|
||||
|
||||
```json
|
||||
{
|
||||
"devDependencies": {
|
||||
"typescript": "latest",
|
||||
"@types/better-sqlite3": "latest",
|
||||
"@types/node": "latest",
|
||||
"vitest": "latest",
|
||||
"msw": "latest",
|
||||
"eslint": "latest",
|
||||
"@typescript-eslint/eslint-plugin": "latest",
|
||||
"@typescript-eslint/parser": "latest",
|
||||
"tsx": "latest"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Package | Purpose |
|
||||
|---------|---------|
|
||||
| typescript | TypeScript compiler |
|
||||
| vitest | Test runner |
|
||||
| msw | Mock Service Worker for API mocking in tests |
|
||||
| eslint | Linting |
|
||||
| tsx | Run TypeScript directly during development |
|
||||
|
||||
---
|
||||
|
||||
## GitLab API Strategy
|
||||
|
||||
### Primary Resources (Bulk Fetch)
|
||||
@@ -368,6 +498,98 @@ tests/integration/init.test.ts
|
||||
- Decompression is handled transparently when reading payloads
|
||||
- Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage
|
||||
|
||||
**Error Classes (src/core/errors.ts):**
|
||||
|
||||
```typescript
|
||||
// Base error class with error codes for programmatic handling
|
||||
export class GiError extends Error {
|
||||
constructor(message: string, public readonly code: string) {
|
||||
super(message);
|
||||
this.name = 'GiError';
|
||||
}
|
||||
}
|
||||
|
||||
// Config errors
|
||||
export class ConfigNotFoundError extends GiError {
|
||||
constructor() {
|
||||
super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND');
|
||||
}
|
||||
}
|
||||
|
||||
export class ConfigValidationError extends GiError {
|
||||
constructor(details: string) {
|
||||
super(`Invalid config: ${details}`, 'CONFIG_INVALID');
|
||||
}
|
||||
}
|
||||
|
||||
// GitLab API errors
|
||||
export class GitLabAuthError extends GiError {
|
||||
constructor() {
|
||||
super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED');
|
||||
}
|
||||
}
|
||||
|
||||
export class GitLabNotFoundError extends GiError {
|
||||
constructor(resource: string) {
|
||||
super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND');
|
||||
}
|
||||
}
|
||||
|
||||
export class GitLabRateLimitError extends GiError {
|
||||
constructor(public readonly retryAfter: number) {
|
||||
super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED');
|
||||
}
|
||||
}
|
||||
|
||||
// Database errors
|
||||
export class DatabaseLockError extends GiError {
|
||||
constructor() {
|
||||
super('Another sync is running. Use --force to override.', 'DB_LOCKED');
|
||||
}
|
||||
}
|
||||
|
||||
// Embedding errors
|
||||
export class OllamaConnectionError extends GiError {
|
||||
constructor() {
|
||||
super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE');
|
||||
}
|
||||
}
|
||||
|
||||
export class EmbeddingError extends GiError {
|
||||
constructor(documentId: number, reason: string) {
|
||||
super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED');
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Logging Strategy (src/core/logger.ts):**
|
||||
|
||||
```typescript
|
||||
import pino from 'pino';
|
||||
|
||||
// Logs go to stderr, results to stdout (allows clean JSON piping)
|
||||
export const logger = pino({
|
||||
level: process.env.LOG_LEVEL || 'info',
|
||||
transport: process.env.NODE_ENV === 'production' ? undefined : {
|
||||
target: 'pino-pretty',
|
||||
options: { colorize: true, destination: 2 } // 2 = stderr
|
||||
}
|
||||
}, pino.destination(2));
|
||||
```
|
||||
|
||||
**Log Levels:**
|
||||
| Level | When to use |
|
||||
|-------|-------------|
|
||||
| debug | Detailed sync progress, API calls, SQL queries |
|
||||
| info | Sync start/complete, document counts, search timing |
|
||||
| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries |
|
||||
| error | Failures that stop operations |
|
||||
|
||||
**Logging Conventions:**
|
||||
- Always include structured context: `logger.info({ project, count }, 'Fetched issues')`
|
||||
- Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')`
|
||||
- All logs to stderr so `gi search --json` output stays clean on stdout
|
||||
|
||||
**DB Runtime Defaults (Checkpoint 0):**
|
||||
- On every connection:
|
||||
- `PRAGMA journal_mode=WAL;`
|
||||
@@ -930,7 +1152,7 @@ tests/unit/embedding-client.test.ts
|
||||
✓ batches requests (32 documents per batch)
|
||||
|
||||
tests/integration/embedding-storage.test.ts
|
||||
✓ stores embedding in sqlite-vss
|
||||
✓ stores embedding in sqlite-vec
|
||||
✓ embedding rowid matches document id
|
||||
✓ creates embedding_metadata record
|
||||
✓ skips re-embedding when content_hash unchanged
|
||||
@@ -959,16 +1181,46 @@ tests/integration/embedding-storage.test.ts
|
||||
- Concurrency: configurable (default 4 workers)
|
||||
- Retry with exponential backoff for transient failures (max 3 attempts)
|
||||
- Per-document failure recording to enable targeted re-runs
|
||||
- Vector storage in SQLite (sqlite-vss extension)
|
||||
- Vector storage in SQLite (sqlite-vec extension)
|
||||
- Progress tracking and resumability
|
||||
- `gi search --mode=semantic` CLI command
|
||||
|
||||
**Ollama API Contract:**
|
||||
|
||||
```typescript
|
||||
// POST http://localhost:11434/api/embed (batch endpoint - preferred)
|
||||
interface OllamaEmbedRequest {
|
||||
model: string; // "nomic-embed-text"
|
||||
input: string[]; // array of texts to embed (up to 32)
|
||||
}
|
||||
|
||||
interface OllamaEmbedResponse {
|
||||
model: string;
|
||||
embeddings: number[][]; // array of 768-dim vectors
|
||||
}
|
||||
|
||||
// POST http://localhost:11434/api/embeddings (single text - fallback)
|
||||
interface OllamaEmbeddingsRequest {
|
||||
model: string;
|
||||
prompt: string;
|
||||
}
|
||||
|
||||
interface OllamaEmbeddingsResponse {
|
||||
embedding: number[];
|
||||
}
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
- Use `/api/embed` for batching (up to 32 documents per request)
|
||||
- Fall back to `/api/embeddings` for single documents or if batch fails
|
||||
- Check Ollama availability with `GET http://localhost:11434/api/tags`
|
||||
|
||||
**Schema Additions (CP3B):**
|
||||
```sql
|
||||
-- sqlite-vss virtual table for vector search
|
||||
-- sqlite-vec virtual table for vector search
|
||||
-- Storage rule: embeddings.rowid = documents.id
|
||||
CREATE VIRTUAL TABLE embeddings USING vss0(
|
||||
embedding(768)
|
||||
CREATE VIRTUAL TABLE embeddings USING vec0(
|
||||
embedding float[768]
|
||||
);
|
||||
|
||||
-- Embedding provenance + change detection
|
||||
@@ -1053,6 +1305,11 @@ If content exceeds 8000 tokens (~32000 chars):
|
||||
6. Set `documents.truncated_reason = 'token_limit_middle_drop'`
|
||||
7. Log a warning with document ID and original/truncated token count
|
||||
|
||||
**Edge Cases:**
|
||||
- **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'`
|
||||
- **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'`
|
||||
- **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]`
|
||||
|
||||
**Why note-boundary truncation:**
|
||||
- Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
|
||||
- Keeping whole notes preserves semantic coherence for embeddings
|
||||
@@ -1148,7 +1405,7 @@ Each query must have at least one expected URL appear in top 10 results.
|
||||
|
||||
**Scope:**
|
||||
- Hybrid retrieval:
|
||||
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
|
||||
- Vector recall (sqlite-vec) + FTS lexical recall (fts5)
|
||||
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
|
||||
- Query embedding generation (same Ollama pipeline as documents)
|
||||
- Result ranking and scoring (document-level)
|
||||
@@ -1178,6 +1435,7 @@ Each query must have at least one expected URL appear in top 10 results.
|
||||
- If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200`
|
||||
- This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
|
||||
2. Query both vector index (top topK) and FTS5 (top topK)
|
||||
- Vector recall via sqlite-vec + FTS lexical recall via fts5
|
||||
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
|
||||
3. Merge results by document_id
|
||||
4. Combine with Reciprocal Rank Fusion (RRF):
|
||||
@@ -1318,7 +1576,7 @@ tests/integration/incremental-sync.test.ts
|
||||
✓ refetches discussions for updated MRs
|
||||
✓ updates existing records (not duplicates)
|
||||
✓ creates new records for new items
|
||||
✓ re-embeds documents with changed content
|
||||
✓ re-embeds documents with changed content_hash
|
||||
|
||||
tests/integration/sync-recovery.test.ts
|
||||
✓ resumes from cursor after interrupted sync
|
||||
@@ -1731,7 +1989,7 @@ Each checkpoint includes:
|
||||
| dirty_sources | 3A | Queue for incremental document regeneration |
|
||||
| pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
|
||||
| documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
|
||||
| embeddings | 3B | Vector embeddings (sqlite-vss, rowid=document_id) |
|
||||
| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) |
|
||||
| embedding_metadata | 3B | Embedding provenance + error tracking |
|
||||
| mr_files | 6 | MR file changes (deferred to post-MVP) |
|
||||
|
||||
@@ -1759,7 +2017,7 @@ Each checkpoint includes:
|
||||
| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
|
||||
| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
|
||||
| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
|
||||
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
|
||||
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly |
|
||||
| Empty state UX | **Actionable messages** | Guide user to next step |
|
||||
| raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings |
|
||||
| GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items |
|
||||
@@ -1769,6 +2027,12 @@ Each checkpoint includes:
|
||||
| RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain |
|
||||
| --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match |
|
||||
| CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk |
|
||||
| Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere |
|
||||
| CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool |
|
||||
| Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping |
|
||||
| Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors |
|
||||
| Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker |
|
||||
| Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single |
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user