diff --git a/README.md b/README.md new file mode 100644 index 0000000..7338e16 --- /dev/null +++ b/README.md @@ -0,0 +1,252 @@ +# gi - GitLab Inbox + +A command-line tool for managing GitLab issues locally. Syncs issues, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying and filtering. + +## Features + +- **Local-first**: All data stored in SQLite for instant queries +- **Incremental sync**: Cursor-based sync only fetches changes since last sync +- **Multi-project**: Track issues across multiple GitLab projects +- **Rich filtering**: Filter by state, author, assignee, labels, milestone, due date +- **Raw payload storage**: Preserves original GitLab API responses for debugging + +## Installation + +```bash +cargo install --path . +``` + +Or build from source: + +```bash +cargo build --release +./target/release/gi --help +``` + +## Quick Start + +```bash +# Initialize configuration (interactive) +gi init + +# Verify authentication +gi auth-test + +# Sync issues from GitLab +gi ingest --type issues + +# List recent issues +gi list issues --limit 10 + +# Show issue details +gi show issue 123 --project group/repo +``` + +## Configuration + +Configuration is stored in `~/.config/gi/config.json` (or `$XDG_CONFIG_HOME/gi/config.json`). + +### Example Configuration + +```json +{ + "gitlab": { + "baseUrl": "https://gitlab.com", + "tokenEnvVar": "GITLAB_TOKEN" + }, + "projects": [ + { "path": "group/project" }, + { "path": "other-group/other-project" } + ], + "sync": { + "backfillDays": 14, + "staleLockMinutes": 10 + }, + "storage": { + "compressRawPayloads": true + } +} +``` + +### Configuration Options + +| Section | Field | Default | Description | +|---------|-------|---------|-------------| +| `gitlab` | `baseUrl` | — | GitLab instance URL (required) | +| `gitlab` | `tokenEnvVar` | `GITLAB_TOKEN` | Environment variable containing API token | +| `projects` | `path` | — | Project path (e.g., `group/project`) | +| `sync` | `backfillDays` | `14` | Days to backfill on initial sync | +| `sync` | `staleLockMinutes` | `10` | Minutes before sync lock considered stale | +| `sync` | `cursorRewindSeconds` | `2` | Seconds to rewind cursor for overlap safety | +| `storage` | `dbPath` | `~/.local/share/gi/gi.db` | Database file path | +| `storage` | `compressRawPayloads` | `true` | Compress stored API responses | + +### GitLab Token + +Create a personal access token with `read_api` scope: + +1. Go to GitLab → Settings → Access Tokens +2. Create token with `read_api` scope +3. Export it: `export GITLAB_TOKEN=glpat-xxxxxxxxxxxx` + +## Commands + +### `gi init` + +Initialize configuration and database interactively. + +```bash +gi init # Interactive setup +gi init --force # Overwrite existing config +gi init --non-interactive # Fail if prompts needed +``` + +### `gi auth-test` + +Verify GitLab authentication is working. + +```bash +gi auth-test +# Authenticated as @username (Full Name) +# GitLab: https://gitlab.com +``` + +### `gi doctor` + +Check environment health and configuration. + +```bash +gi doctor # Human-readable output +gi doctor --json # JSON output for scripting +``` + +### `gi ingest` + +Sync data from GitLab to local database. + +```bash +gi ingest --type issues # Sync all projects +gi ingest --type issues --project group/repo # Single project +gi ingest --type issues --force # Override stale lock +``` + +### `gi list issues` + +Query issues from local database. + +```bash +gi list issues # Recent issues (default 50) +gi list issues --limit 100 # More results +gi list issues --state opened # Only open issues +gi list issues --state closed # Only closed issues +gi list issues --author username # By author +gi list issues --assignee username # By assignee +gi list issues --label bug # By label (AND logic) +gi list issues --label bug --label urgent # Multiple labels +gi list issues --milestone "v1.0" # By milestone title +gi list issues --since 7d # Updated in last 7 days +gi list issues --since 2w # Updated in last 2 weeks +gi list issues --since 2024-01-01 # Updated since date +gi list issues --due-before 2024-12-31 # Due before date +gi list issues --has-due-date # Only issues with due dates +gi list issues --project group/repo # Filter by project +gi list issues --sort created --order asc # Sort options +gi list issues --open # Open first result in browser +gi list issues --json # JSON output +``` + +### `gi show issue` + +Display detailed issue information. + +```bash +gi show issue 123 # Show issue #123 +gi show issue 123 --project group/repo # Disambiguate if needed +``` + +### `gi count` + +Count entities in local database. + +```bash +gi count issues # Total issues +gi count discussions # Total discussions +gi count discussions --type issue # Issue discussions only +gi count notes # Total notes +``` + +### `gi sync-status` + +Show current sync state and watermarks. + +```bash +gi sync-status +``` + +### `gi migrate` + +Run pending database migrations. + +```bash +gi migrate +``` + +### `gi version` + +Show version information. + +```bash +gi version +``` + +## Database Schema + +Data is stored in SQLite with the following main tables: + +- **projects**: Tracked GitLab projects +- **issues**: Issue metadata (title, state, author, assignee info, due date, milestone) +- **milestones**: Project milestones with state and due dates +- **issue_assignees**: Many-to-many issue-assignee relationships +- **labels**: Project labels with colors +- **issue_labels**: Many-to-many issue-label relationships +- **discussions**: Issue/MR discussions +- **notes**: Individual notes within discussions +- **raw_payloads**: Compressed original API responses + +The database is stored at `~/.local/share/gi/gi.db` by default. + +## Global Options + +```bash +gi --config /path/to/config.json # Use alternate config +``` + +## Development + +```bash +# Run tests +cargo test + +# Run with debug logging +RUST_LOG=gi=debug gi list issues + +# Check formatting +cargo fmt --check + +# Lint +cargo clippy +``` + +## Tech Stack + +- **Rust** (2024 edition) +- **SQLite** via rusqlite (bundled) +- **clap** for CLI parsing +- **reqwest** for HTTP +- **tokio** for async runtime +- **serde** for serialization +- **tracing** for logging + +## License + +MIT diff --git a/SPEC.md b/SPEC.md index f5a1e14..44fc3b2 100644 --- a/SPEC.md +++ b/SPEC.md @@ -1,5 +1,7 @@ # GitLab Knowledge Engine - Spec Document +> **Note:** This is a historical planning document. The actual implementation uses Rust instead of TypeScript/Node.js. See [README.md](README.md) for current documentation. + ## Executive Summary A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability. @@ -122,7 +124,7 @@ npm link # Makes `gi` available globally ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Storage Layer │ -│ - SQLite + sqlite-vss + FTS5 (hybrid search) │ +│ - SQLite + sqlite-vec + FTS5 (hybrid search) │ │ - Structured metadata in relational tables │ │ - Vector embeddings for semantic search │ │ - Full-text index for lexical search fallback │ @@ -139,12 +141,20 @@ npm link # Makes `gi` available globally ### Technology Choices -| Component | Recommendation | Rationale | -|-----------|---------------|-----------| +| Component | Choice | Rationale | +|-----------|--------|-----------| | Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly | -| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in | +| Database | SQLite + sqlite-vec + FTS5 | Zero-config, portable, vector search via pure-C extension | | Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors | -| CLI Framework | Commander.js or oclif | Standard, well-documented | +| CLI Framework | Commander.js | Simple, lightweight, well-documented | +| Logging | pino | Fast, JSON-structured, low overhead | +| Validation | Zod | TypeScript-first schema validation | + +### Alternative Considered: sqlite-vss +- sqlite-vss was the original choice but is now deprecated +- No Apple Silicon support (no prebuilt ARM binaries) +- Replaced by sqlite-vec, which is pure C with no dependencies +- sqlite-vec uses `vec0` virtual table (vs `vss0`) ### Alternative Considered: Postgres + pgvector - Pros: More scalable, better for production multi-user @@ -153,6 +163,126 @@ npm link # Makes `gi` available globally --- +## Project Structure + +``` +gitlab-inbox/ +├── src/ +│ ├── cli/ +│ │ ├── index.ts # CLI entry point (Commander.js) +│ │ └── commands/ # One file per command group +│ │ ├── init.ts +│ │ ├── sync.ts +│ │ ├── search.ts +│ │ ├── list.ts +│ │ └── doctor.ts +│ ├── core/ +│ │ ├── config.ts # Config loading/validation (Zod) +│ │ ├── db.ts # Database connection + migrations +│ │ ├── errors.ts # Custom error classes +│ │ └── logger.ts # pino logger setup +│ ├── gitlab/ +│ │ ├── client.ts # GitLab API client with rate limiting +│ │ ├── types.ts # GitLab API response types +│ │ └── transformers/ # Payload → normalized schema +│ │ ├── issue.ts +│ │ ├── merge-request.ts +│ │ └── discussion.ts +│ ├── ingestion/ +│ │ ├── issues.ts +│ │ ├── merge-requests.ts +│ │ └── discussions.ts +│ ├── documents/ +│ │ ├── extractor.ts # Document generation from entities +│ │ └── truncation.ts # Note-boundary aware truncation +│ ├── embedding/ +│ │ ├── ollama.ts # Ollama client +│ │ └── pipeline.ts # Batch embedding orchestration +│ ├── search/ +│ │ ├── hybrid.ts # RRF ranking logic +│ │ ├── fts.ts # FTS5 queries +│ │ └── vector.ts # sqlite-vec queries +│ └── types/ +│ └── index.ts # Shared TypeScript types +├── tests/ +│ ├── unit/ +│ ├── integration/ +│ ├── live/ # Optional GitLab live tests (GITLAB_LIVE_TESTS=1) +│ └── fixtures/ +│ └── golden-queries.json +├── migrations/ # Numbered SQL migration files +│ ├── 001_initial.sql +│ └── ... +├── gi.config.json # User config (gitignored) +├── package.json +├── tsconfig.json +├── vitest.config.ts +├── eslint.config.js +└── README.md +``` + +--- + +## Dependencies + +### Runtime Dependencies + +```json +{ + "dependencies": { + "better-sqlite3": "latest", + "sqlite-vec": "latest", + "commander": "latest", + "zod": "latest", + "pino": "latest", + "pino-pretty": "latest", + "ora": "latest", + "chalk": "latest", + "cli-table3": "latest" + } +} +``` + +| Package | Purpose | +|---------|---------| +| better-sqlite3 | Synchronous SQLite driver (fast, native) | +| sqlite-vec | Vector search extension (pure C, cross-platform) | +| commander | CLI argument parsing | +| zod | Schema validation for config and inputs | +| pino | Structured JSON logging | +| pino-pretty | Dev-mode log formatting | +| ora | CLI spinners for progress indication | +| chalk | Terminal colors | +| cli-table3 | ASCII tables for list output | + +### Dev Dependencies + +```json +{ + "devDependencies": { + "typescript": "latest", + "@types/better-sqlite3": "latest", + "@types/node": "latest", + "vitest": "latest", + "msw": "latest", + "eslint": "latest", + "@typescript-eslint/eslint-plugin": "latest", + "@typescript-eslint/parser": "latest", + "tsx": "latest" + } +} +``` + +| Package | Purpose | +|---------|---------| +| typescript | TypeScript compiler | +| vitest | Test runner | +| msw | Mock Service Worker for API mocking in tests | +| eslint | Linting | +| tsx | Run TypeScript directly during development | + +--- + ## GitLab API Strategy ### Primary Resources (Bulk Fetch) @@ -368,6 +498,98 @@ tests/integration/init.test.ts - Decompression is handled transparently when reading payloads - Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage +**Error Classes (src/core/errors.ts):** + +```typescript +// Base error class with error codes for programmatic handling +export class GiError extends Error { + constructor(message: string, public readonly code: string) { + super(message); + this.name = 'GiError'; + } +} + +// Config errors +export class ConfigNotFoundError extends GiError { + constructor() { + super('Config file not found. Run "gi init" first.', 'CONFIG_NOT_FOUND'); + } +} + +export class ConfigValidationError extends GiError { + constructor(details: string) { + super(`Invalid config: ${details}`, 'CONFIG_INVALID'); + } +} + +// GitLab API errors +export class GitLabAuthError extends GiError { + constructor() { + super('GitLab authentication failed. Check your token.', 'GITLAB_AUTH_FAILED'); + } +} + +export class GitLabNotFoundError extends GiError { + constructor(resource: string) { + super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND'); + } +} + +export class GitLabRateLimitError extends GiError { + constructor(public readonly retryAfter: number) { + super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED'); + } +} + +// Database errors +export class DatabaseLockError extends GiError { + constructor() { + super('Another sync is running. Use --force to override.', 'DB_LOCKED'); + } +} + +// Embedding errors +export class OllamaConnectionError extends GiError { + constructor() { + super('Cannot connect to Ollama. Is it running?', 'OLLAMA_UNAVAILABLE'); + } +} + +export class EmbeddingError extends GiError { + constructor(documentId: number, reason: string) { + super(`Failed to embed document ${documentId}: ${reason}`, 'EMBEDDING_FAILED'); + } +} +``` + +**Logging Strategy (src/core/logger.ts):** + +```typescript +import pino from 'pino'; + +// Logs go to stderr, results to stdout (allows clean JSON piping) +export const logger = pino({ + level: process.env.LOG_LEVEL || 'info', + transport: process.env.NODE_ENV === 'production' ? undefined : { + target: 'pino-pretty', + options: { colorize: true, destination: 2 } // 2 = stderr + } +}, pino.destination(2)); +``` + +**Log Levels:** +| Level | When to use | +|-------|-------------| +| debug | Detailed sync progress, API calls, SQL queries | +| info | Sync start/complete, document counts, search timing | +| warn | Rate limits hit, Ollama unavailable (fallback to FTS), retries | +| error | Failures that stop operations | + +**Logging Conventions:** +- Always include structured context: `logger.info({ project, count }, 'Fetched issues')` +- Errors include err object: `logger.error({ err, documentId }, 'Embedding failed')` +- All logs to stderr so `gi search --json` output stays clean on stdout + **DB Runtime Defaults (Checkpoint 0):** - On every connection: - `PRAGMA journal_mode=WAL;` @@ -930,7 +1152,7 @@ tests/unit/embedding-client.test.ts ✓ batches requests (32 documents per batch) tests/integration/embedding-storage.test.ts - ✓ stores embedding in sqlite-vss + ✓ stores embedding in sqlite-vec ✓ embedding rowid matches document id ✓ creates embedding_metadata record ✓ skips re-embedding when content_hash unchanged @@ -959,16 +1181,46 @@ tests/integration/embedding-storage.test.ts - Concurrency: configurable (default 4 workers) - Retry with exponential backoff for transient failures (max 3 attempts) - Per-document failure recording to enable targeted re-runs -- Vector storage in SQLite (sqlite-vss extension) +- Vector storage in SQLite (sqlite-vec extension) - Progress tracking and resumability - `gi search --mode=semantic` CLI command +**Ollama API Contract:** + +```typescript +// POST http://localhost:11434/api/embed (batch endpoint - preferred) +interface OllamaEmbedRequest { + model: string; // "nomic-embed-text" + input: string[]; // array of texts to embed (up to 32) +} + +interface OllamaEmbedResponse { + model: string; + embeddings: number[][]; // array of 768-dim vectors +} + +// POST http://localhost:11434/api/embeddings (single text - fallback) +interface OllamaEmbeddingsRequest { + model: string; + prompt: string; +} + +interface OllamaEmbeddingsResponse { + embedding: number[]; +} +``` + +**Usage:** +- Use `/api/embed` for batching (up to 32 documents per request) +- Fall back to `/api/embeddings` for single documents or if batch fails +- Check Ollama availability with `GET http://localhost:11434/api/tags` + **Schema Additions (CP3B):** ```sql --- sqlite-vss virtual table for vector search +-- sqlite-vec virtual table for vector search -- Storage rule: embeddings.rowid = documents.id -CREATE VIRTUAL TABLE embeddings USING vss0( - embedding(768) +CREATE VIRTUAL TABLE embeddings USING vec0( + embedding float[768] ); -- Embedding provenance + change detection @@ -1053,6 +1305,11 @@ If content exceeds 8000 tokens (~32000 chars): 6. Set `documents.truncated_reason = 'token_limit_middle_drop'` 7. Log a warning with document ID and original/truncated token count +**Edge Cases:** +- **Single note > 32000 chars:** Truncate at character boundary, append `[truncated]`, set `truncated_reason = 'single_note_oversized'` +- **First + last note > 32000 chars:** Keep only first note (truncated if needed), set `truncated_reason = 'first_last_oversized'` +- **Only one note in discussion:** If it exceeds limit, truncate at char boundary with `[truncated]` + **Why note-boundary truncation:** - Cutting mid-note produces unreadable snippets ("...the authentication flow because--") - Keeping whole notes preserves semantic coherence for embeddings @@ -1148,7 +1405,7 @@ Each query must have at least one expected URL appear in top 10 results. **Scope:** - Hybrid retrieval: - - Vector recall (sqlite-vss) + FTS lexical recall (fts5) + - Vector recall (sqlite-vec) + FTS lexical recall (fts5) - Merge + rerank results using Reciprocal Rank Fusion (RRF) - Query embedding generation (same Ollama pipeline as documents) - Result ranking and scoring (document-level) @@ -1178,6 +1435,7 @@ Each query must have at least one expected URL appear in top 10 results. - If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200` - This prevents "no results" when relevant docs exist outside top-50 unfiltered recall 2. Query both vector index (top topK) and FTS5 (top topK) + - Vector recall via sqlite-vec + FTS lexical recall via fts5 - Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type) 3. Merge results by document_id 4. Combine with Reciprocal Rank Fusion (RRF): @@ -1318,7 +1576,7 @@ tests/integration/incremental-sync.test.ts ✓ refetches discussions for updated MRs ✓ updates existing records (not duplicates) ✓ creates new records for new items - ✓ re-embeds documents with changed content + ✓ re-embeds documents with changed content_hash tests/integration/sync-recovery.test.ts ✓ resumes from cursor after interrupted sync @@ -1731,7 +1989,7 @@ Each checkpoint includes: | dirty_sources | 3A | Queue for incremental document regeneration | | pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching | | documents_fts | 3A | Full-text search index (fts5 with porter stemmer) | -| embeddings | 3B | Vector embeddings (sqlite-vss, rowid=document_id) | +| embeddings | 3B | Vector embeddings (sqlite-vec vec0, rowid=document_id) | | embedding_metadata | 3B | Embedding provenance + error tracking | | mr_files | 6 | MR file changes (deferred to post-MVP) | @@ -1759,7 +2017,7 @@ Each checkpoint includes: | JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption | | Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable | | `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX | -| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly | +| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exits cleanly | | Empty state UX | **Actionable messages** | Guide user to next step | | raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings | | GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items | @@ -1769,6 +2027,12 @@ Each checkpoint includes: | RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain | | --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match | | CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk | +| Vector extension | **sqlite-vec (not sqlite-vss)** | sqlite-vss deprecated, no Apple Silicon support; sqlite-vec is pure C, runs anywhere | +| CLI framework | **Commander.js** | Simple, lightweight, sufficient for single-user CLI tool | +| Logging | **pino to stderr** | JSON-structured, fast; stderr keeps stdout clean for JSON output piping | +| Error handling | **Custom error class hierarchy** | GiError base with codes; specific classes for config/gitlab/db/embedding errors | +| Truncation edge cases | **Char-boundary cut for oversized notes** | Single notes > 32000 chars truncated at char boundary with `[truncated]` marker | +| Ollama API | **Use /api/embed for batching** | Batch up to 32 docs per request; fall back to /api/embeddings for single | --- diff --git a/docs/prd/checkpoint-0.md b/docs/prd/checkpoint-0.md new file mode 100644 index 0000000..12f99d1 --- /dev/null +++ b/docs/prd/checkpoint-0.md @@ -0,0 +1,1164 @@ +# Checkpoint 0: Project Setup - PRD + +**Version:** 1.0 +**Status:** Ready for Implementation +**Depends On:** None (first checkpoint) +**Enables:** Checkpoint 1 (Issue Ingestion) + +--- + +## Overview + +### Objective + +Scaffold the `gi` CLI tool with verified GitLab API connectivity, database infrastructure, and foundational CLI commands. This checkpoint establishes the project foundation that all subsequent checkpoints build upon. + +### Success Criteria + +| Criterion | Validation | +|-----------|------------| +| `gi init` writes config and validates against GitLab | `gi doctor` shows GitLab OK | +| `gi auth-test` succeeds with real PAT | Shows username and display name | +| Database migrations apply correctly | `gi doctor` shows DB OK | +| SQLite pragmas set correctly | WAL, FK, busy_timeout verified | +| App lock mechanism works | Concurrent runs blocked | +| Config resolves from XDG paths | Works from any directory | + +--- + +## Deliverables + +### 1. Project Structure + +Create the following directory structure: + +``` +gitlab-inbox/ +├── src/ +│ ├── cli/ +│ │ ├── index.ts # CLI entry point (Commander.js) +│ │ └── commands/ +│ │ ├── init.ts # gi init +│ │ ├── auth-test.ts # gi auth-test +│ │ ├── doctor.ts # gi doctor +│ │ ├── sync-status.ts # gi sync-status (stub for CP0) +│ │ ├── backup.ts # gi backup +│ │ └── reset.ts # gi reset +│ ├── core/ +│ │ ├── config.ts # Config loading/validation (Zod) +│ │ ├── db.ts # Database connection + migrations +│ │ ├── errors.ts # Custom error classes +│ │ ├── logger.ts # pino logger setup +│ │ └── paths.ts # XDG path resolution +│ ├── gitlab/ +│ │ ├── client.ts # GitLab API client with rate limiting +│ │ └── types.ts # GitLab API response types +│ └── types/ +│ └── index.ts # Shared TypeScript types +├── tests/ +│ ├── unit/ +│ │ ├── config.test.ts +│ │ ├── db.test.ts +│ │ ├── paths.test.ts +│ │ └── errors.test.ts +│ ├── integration/ +│ │ ├── gitlab-client.test.ts +│ │ ├── app-lock.test.ts +│ │ └── init.test.ts +│ ├── live/ # Gated by GITLAB_LIVE_TESTS=1 +│ │ └── gitlab-client.live.test.ts +│ └── fixtures/ +│ └── mock-responses/ +├── migrations/ +│ └── 001_initial.sql +├── package.json +├── tsconfig.json +├── vitest.config.ts +├── eslint.config.js +└── .gitignore +``` + +### 2. Config + Data Locations (XDG Compliant) + +| Location | Default Path | Override | +|----------|--------------|----------| +| Config | `~/.config/gi/config.json` | `GI_CONFIG_PATH` env var or `--config` flag | +| Database | `~/.local/share/gi/data.db` | `storage.dbPath` in config | +| Backups | `~/.local/share/gi/backups/` | `storage.backupDir` in config | +| Logs | stderr (not persisted) | `LOG_PATH` env var | + +**Config Resolution Order:** +1. `--config /path/to/config.json` (explicit CLI flag) +2. `GI_CONFIG_PATH` environment variable +3. `~/.config/gi/config.json` (XDG default) +4. `./gi.config.json` (local development fallback - useful during dev) + +**Implementation (`src/core/paths.ts`):** + +```typescript +import { homedir } from 'node:os'; +import { join } from 'node:path'; +import { existsSync } from 'node:fs'; + +export function getConfigPath(cliOverride?: string): string { + // 1. CLI flag override + if (cliOverride) return cliOverride; + + // 2. Environment variable + if (process.env.GI_CONFIG_PATH) return process.env.GI_CONFIG_PATH; + + // 3. XDG default + const xdgConfig = process.env.XDG_CONFIG_HOME || join(homedir(), '.config'); + const xdgPath = join(xdgConfig, 'gi', 'config.json'); + if (existsSync(xdgPath)) return xdgPath; + + // 4. Local fallback (for development) + const localPath = join(process.cwd(), 'gi.config.json'); + if (existsSync(localPath)) return localPath; + + // Return XDG path (will trigger not-found error if missing) + return xdgPath; +} + +export function getDataDir(): string { + const xdgData = process.env.XDG_DATA_HOME || join(homedir(), '.local', 'share'); + return join(xdgData, 'gi'); +} + +export function getDbPath(configOverride?: string): string { + if (configOverride) return configOverride; + return join(getDataDir(), 'data.db'); +} + +export function getBackupDir(configOverride?: string): string { + if (configOverride) return configOverride; + return join(getDataDir(), 'backups'); +} +``` + +### 3. Timestamp Convention (Global) + +**All `*_at` integer columns are milliseconds since Unix epoch (UTC).** + +| Context | Format | Example | +|---------|--------|---------| +| Database columns | INTEGER (ms epoch) | `1706313600000` | +| GitLab API responses | ISO 8601 string | `"2024-01-27T00:00:00.000Z"` | +| CLI display | ISO 8601 or relative | `2024-01-27` or `3 days ago` | +| Config durations | Seconds (with suffix in name) | `staleLockMinutes: 10` | + +**Conversion utilities (`src/core/time.ts`):** + +```typescript +// GitLab API → Database +export function isoToMs(isoString: string): number { + return new Date(isoString).getTime(); +} + +// Database → Display +export function msToIso(ms: number): string { + return new Date(ms).toISOString(); +} + +// Current time for database storage +export function nowMs(): number { + return Date.now(); +} +``` + +--- + +## Dependencies + +### Runtime Dependencies + +```json +{ + "dependencies": { + "better-sqlite3": "^11.0.0", + "sqlite-vec": "^0.1.0", + "commander": "^12.0.0", + "zod": "^3.23.0", + "pino": "^9.0.0", + "pino-pretty": "^11.0.0", + "ora": "^8.0.0", + "chalk": "^5.3.0", + "cli-table3": "^0.6.0", + "inquirer": "^9.0.0" + } +} +``` + +### Dev Dependencies + +```json +{ + "devDependencies": { + "typescript": "^5.4.0", + "@types/better-sqlite3": "^7.6.0", + "@types/node": "^20.0.0", + "vitest": "^1.6.0", + "msw": "^2.3.0", + "eslint": "^9.0.0", + "@typescript-eslint/eslint-plugin": "^7.0.0", + "@typescript-eslint/parser": "^7.0.0", + "tsx": "^4.0.0" + } +} +``` + +--- + +## Configuration Schema + +### Config File Structure + +```typescript +// src/types/config.ts +import { z } from 'zod'; + +export const ConfigSchema = z.object({ + gitlab: z.object({ + baseUrl: z.string().url(), + tokenEnvVar: z.string().default('GITLAB_TOKEN'), + }), + projects: z.array(z.object({ + path: z.string().min(1), + })).min(1), + sync: z.object({ + backfillDays: z.number().int().positive().default(14), + staleLockMinutes: z.number().int().positive().default(10), + heartbeatIntervalSeconds: z.number().int().positive().default(30), + cursorRewindSeconds: z.number().int().nonnegative().default(2), + primaryConcurrency: z.number().int().positive().default(4), + dependentConcurrency: z.number().int().positive().default(2), + }).default({}), + storage: z.object({ + dbPath: z.string().optional(), + backupDir: z.string().optional(), + compressRawPayloads: z.boolean().default(true), + }).default({}), + embedding: z.object({ + provider: z.literal('ollama').default('ollama'), + model: z.string().default('nomic-embed-text'), + baseUrl: z.string().url().default('http://localhost:11434'), + concurrency: z.number().int().positive().default(4), + }).default({}), +}); + +export type Config = z.infer; +``` + +### Example Config File + +```json +{ + "gitlab": { + "baseUrl": "https://gitlab.example.com", + "tokenEnvVar": "GITLAB_TOKEN" + }, + "projects": [ + { "path": "group/project-one" }, + { "path": "group/project-two" } + ], + "sync": { + "backfillDays": 14, + "staleLockMinutes": 10, + "heartbeatIntervalSeconds": 30, + "cursorRewindSeconds": 2, + "primaryConcurrency": 4, + "dependentConcurrency": 2 + }, + "storage": { + "compressRawPayloads": true + }, + "embedding": { + "provider": "ollama", + "model": "nomic-embed-text", + "baseUrl": "http://localhost:11434", + "concurrency": 4 + } +} +``` + +--- + +## Database Schema + +### Migration 001_initial.sql + +```sql +-- Schema version tracking +CREATE TABLE schema_version ( + version INTEGER PRIMARY KEY, + applied_at INTEGER NOT NULL, -- ms epoch UTC + description TEXT +); + +INSERT INTO schema_version (version, applied_at, description) +VALUES (1, strftime('%s', 'now') * 1000, 'Initial schema'); + +-- Projects table (configured targets) +CREATE TABLE projects ( + id INTEGER PRIMARY KEY, + gitlab_project_id INTEGER UNIQUE NOT NULL, + path_with_namespace TEXT NOT NULL, + default_branch TEXT, + web_url TEXT, + created_at INTEGER, -- ms epoch UTC + updated_at INTEGER, -- ms epoch UTC + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_projects_path ON projects(path_with_namespace); + +-- Sync tracking for reliability +CREATE TABLE sync_runs ( + id INTEGER PRIMARY KEY, + started_at INTEGER NOT NULL, -- ms epoch UTC + heartbeat_at INTEGER NOT NULL, -- ms epoch UTC + finished_at INTEGER, -- ms epoch UTC + status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' + command TEXT NOT NULL, -- 'init' | 'ingest issues' | 'sync' | etc. + error TEXT, + metrics_json TEXT -- JSON blob of per-run counters/timing +); + +-- metrics_json schema (informational, not enforced): +-- { +-- "apiCalls": number, +-- "rateLimitHits": number, +-- "pagesFetched": number, +-- "entitiesUpserted": number, +-- "discussionsFetched": number, +-- "notesUpserted": number, +-- "docsRegenerated": number, +-- "embeddingsCreated": number, +-- "durationMs": number +-- } + +-- Crash-safe single-flight lock (DB-enforced) +CREATE TABLE app_locks ( + name TEXT PRIMARY KEY, -- 'sync' + owner TEXT NOT NULL, -- random run token (UUIDv4) + acquired_at INTEGER NOT NULL, -- ms epoch UTC + heartbeat_at INTEGER NOT NULL -- ms epoch UTC +); + +-- Sync cursors for primary resources only +CREATE TABLE sync_cursors ( + project_id INTEGER NOT NULL REFERENCES projects(id), + resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests' + updated_at_cursor INTEGER, -- ms epoch UTC, last fully processed + tie_breaker_id INTEGER, -- last fully processed gitlab_id + PRIMARY KEY(project_id, resource_type) +); + +-- Raw payload storage (decoupled from entity tables) +CREATE TABLE raw_payloads ( + id INTEGER PRIMARY KEY, + source TEXT NOT NULL, -- 'gitlab' + project_id INTEGER REFERENCES projects(id), + resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion' + gitlab_id TEXT NOT NULL, -- TEXT: discussion IDs are strings + fetched_at INTEGER NOT NULL, -- ms epoch UTC + content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip' + payload_hash TEXT NOT NULL, -- SHA-256 of decoded JSON bytes (pre-compression) + payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON +); +CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id); +CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at); +CREATE UNIQUE INDEX uq_raw_payloads_dedupe + ON raw_payloads(project_id, resource_type, gitlab_id, payload_hash); +``` + +### SQLite Runtime Pragmas + +Set on every database connection: + +```typescript +// src/core/db.ts +import Database from 'better-sqlite3'; + +export function createConnection(dbPath: string): Database.Database { + const db = new Database(dbPath); + + // Production-grade defaults for single-user CLI + db.pragma('journal_mode = WAL'); + db.pragma('synchronous = NORMAL'); // Safe for WAL on local disk + db.pragma('foreign_keys = ON'); + db.pragma('busy_timeout = 5000'); // 5s wait on lock contention + db.pragma('temp_store = MEMORY'); // Small speed win + + return db; +} +``` + +--- + +## Error Classes + +```typescript +// src/core/errors.ts + +export class GiError extends Error { + constructor( + message: string, + public readonly code: string, + public readonly cause?: Error + ) { + super(message); + this.name = 'GiError'; + } +} + +// Config errors +export class ConfigNotFoundError extends GiError { + constructor(searchedPath: string) { + super( + `Config file not found at ${searchedPath}. Run "gi init" first.`, + 'CONFIG_NOT_FOUND' + ); + } +} + +export class ConfigValidationError extends GiError { + constructor(details: string) { + super(`Invalid config: ${details}`, 'CONFIG_INVALID'); + } +} + +// GitLab API errors +export class GitLabAuthError extends GiError { + constructor() { + super( + 'GitLab authentication failed. Check your token has read_api scope.', + 'GITLAB_AUTH_FAILED' + ); + } +} + +export class GitLabNotFoundError extends GiError { + constructor(resource: string) { + super(`GitLab resource not found: ${resource}`, 'GITLAB_NOT_FOUND'); + } +} + +export class GitLabRateLimitError extends GiError { + constructor(public readonly retryAfter: number) { + super(`Rate limited. Retry after ${retryAfter}s`, 'GITLAB_RATE_LIMITED'); + } +} + +export class GitLabNetworkError extends GiError { + constructor(baseUrl: string, cause?: Error) { + super( + `Cannot connect to GitLab at ${baseUrl}`, + 'GITLAB_NETWORK_ERROR', + cause + ); + } +} + +// Database errors +export class DatabaseLockError extends GiError { + constructor(owner: string, acquiredAt: number) { + super( + `Another sync is running (owner: ${owner}, started: ${new Date(acquiredAt).toISOString()}). Use --force to override if stale.`, + 'DB_LOCKED' + ); + } +} + +export class MigrationError extends GiError { + constructor(version: number, cause: Error) { + super( + `Migration ${version} failed: ${cause.message}`, + 'MIGRATION_FAILED', + cause + ); + } +} + +// Token errors +export class TokenNotSetError extends GiError { + constructor(envVar: string) { + super( + `GitLab token not set. Export ${envVar} environment variable.`, + 'TOKEN_NOT_SET' + ); + } +} +``` + +--- + +## Logging Configuration + +```typescript +// src/core/logger.ts +import pino from 'pino'; + +// Logs go to stderr, results to stdout (allows clean JSON piping) +export const logger = pino({ + level: process.env.LOG_LEVEL || 'info', + transport: process.env.NODE_ENV === 'production' ? undefined : { + target: 'pino-pretty', + options: { + colorize: true, + destination: 2, // stderr + translateTime: 'SYS:standard', + ignore: 'pid,hostname' + } + } +}, pino.destination(2)); + +// Create child loggers for components +export const dbLogger = logger.child({ component: 'db' }); +export const gitlabLogger = logger.child({ component: 'gitlab' }); +export const configLogger = logger.child({ component: 'config' }); +``` + +**Log Levels:** + +| Level | When to use | +|-------|-------------| +| `debug` | Detailed API calls, SQL queries, config resolution | +| `info` | Sync start/complete, project counts, major milestones | +| `warn` | Rate limits hit, retries, Ollama unavailable | +| `error` | Failures that stop operations | + +--- + +## CLI Commands (Checkpoint 0) + +### `gi init` + +Interactive setup wizard that creates config at XDG path. + +**Flow:** +1. Check if config already exists → prompt to overwrite +2. Prompt for GitLab base URL +3. Prompt for project paths (comma-separated or one at a time) +4. Prompt for token env var name (default: GITLAB_TOKEN) +5. **Validate before writing:** + - Token must be set in environment + - Test auth with `GET /api/v4/user` + - Validate each project path with `GET /api/v4/projects/:path` +6. Write config file +7. Initialize database with migrations +8. Insert validated projects into `projects` table + +**Flags:** +- `--config `: Write config to specific path +- `--force`: Skip overwrite confirmation +- `--non-interactive`: Fail if prompts would be shown (for scripting) + +**Exit codes:** +- `0`: Success +- `1`: Validation failed (token, auth, project not found) +- `2`: User cancelled + +### `gi auth-test` + +Verify GitLab authentication. + +**Output:** +``` +Authenticated as @johndoe (John Doe) +GitLab: https://gitlab.example.com (v16.8.0) +``` + +**Exit codes:** +- `0`: Auth successful +- `1`: Auth failed + +### `gi doctor` + +Check environment health. + +**Output:** +``` +gi doctor + + Config ✓ Loaded from ~/.config/gi/config.json + Database ✓ ~/.local/share/gi/data.db (schema v1) + GitLab ✓ https://gitlab.example.com (authenticated as @johndoe) + Projects ✓ 2 configured, 2 resolved + Ollama ⚠ Not running (semantic search unavailable) + +Status: Ready (lexical search available, semantic search requires Ollama) +``` + +**Flags:** +- `--json`: Output as JSON for scripting + +**JSON output schema:** +```typescript +interface DoctorResult { + success: boolean; // All required checks passed + checks: { + config: { status: 'ok' | 'error'; path?: string; error?: string }; + database: { status: 'ok' | 'error'; path?: string; schemaVersion?: number; error?: string }; + gitlab: { status: 'ok' | 'error'; url?: string; username?: string; error?: string }; + projects: { status: 'ok' | 'error'; configured?: number; resolved?: number; error?: string }; + ollama: { status: 'ok' | 'warning' | 'error'; url?: string; model?: string; error?: string }; + }; +} +``` + +### `gi version` + +Show version information. + +**Output:** +``` +gi version 0.1.0 +``` + +### `gi backup` + +Create timestamped database backup. + +**Output:** +``` +Created backup: ~/.local/share/gi/backups/data-2026-01-24T10-30-00.db +``` + +### `gi reset --confirm` + +Delete database and reset all state. + +**Output:** +``` +This will delete: + - Database: ~/.local/share/gi/data.db + - All sync cursors + - All cached data + +Type 'yes' to confirm: yes +Database reset. Run 'gi sync' to repopulate. +``` + +### `gi sync-status` + +Show sync state (stub in CP0, full implementation in CP1). + +**Output (CP0 stub):** +``` +No sync runs yet. Run 'gi sync' to start. +``` + +--- + +## GitLab Client + +### Core Client Implementation + +```typescript +// src/gitlab/client.ts +import { GitLabAuthError, GitLabNotFoundError, GitLabRateLimitError, GitLabNetworkError } from '../core/errors'; +import { gitlabLogger } from '../core/logger'; + +interface GitLabClientOptions { + baseUrl: string; + token: string; + requestsPerSecond?: number; +} + +interface GitLabUser { + id: number; + username: string; + name: string; +} + +interface GitLabProject { + id: number; + path_with_namespace: string; + default_branch: string; + web_url: string; + created_at: string; + updated_at: string; +} + +export class GitLabClient { + private baseUrl: string; + private token: string; + private rateLimiter: RateLimiter; + + constructor(options: GitLabClientOptions) { + this.baseUrl = options.baseUrl.replace(/\/$/, ''); + this.token = options.token; + this.rateLimiter = new RateLimiter(options.requestsPerSecond ?? 10); + } + + async getCurrentUser(): Promise { + return this.request('/api/v4/user'); + } + + async getProject(pathWithNamespace: string): Promise { + const encoded = encodeURIComponent(pathWithNamespace); + return this.request(`/api/v4/projects/${encoded}`); + } + + private async request(path: string, options: RequestInit = {}): Promise { + await this.rateLimiter.acquire(); + + const url = `${this.baseUrl}${path}`; + gitlabLogger.debug({ url }, 'GitLab request'); + + let response: Response; + try { + response = await fetch(url, { + ...options, + headers: { + 'PRIVATE-TOKEN': this.token, + 'Accept': 'application/json', + ...options.headers, + }, + }); + } catch (err) { + throw new GitLabNetworkError(this.baseUrl, err as Error); + } + + if (response.status === 401) { + throw new GitLabAuthError(); + } + + if (response.status === 404) { + throw new GitLabNotFoundError(path); + } + + if (response.status === 429) { + const retryAfter = parseInt(response.headers.get('Retry-After') || '60', 10); + throw new GitLabRateLimitError(retryAfter); + } + + if (!response.ok) { + throw new Error(`GitLab API error: ${response.status} ${response.statusText}`); + } + + return response.json() as Promise; + } +} + +// Simple rate limiter with jitter +class RateLimiter { + private lastRequest = 0; + private minInterval: number; + + constructor(requestsPerSecond: number) { + this.minInterval = 1000 / requestsPerSecond; + } + + async acquire(): Promise { + const now = Date.now(); + const elapsed = now - this.lastRequest; + + if (elapsed < this.minInterval) { + const jitter = Math.random() * 50; // 0-50ms jitter + await new Promise(resolve => setTimeout(resolve, this.minInterval - elapsed + jitter)); + } + + this.lastRequest = Date.now(); + } +} +``` + +--- + +## App Lock Mechanism + +Crash-safe single-flight lock using heartbeat pattern. + +```typescript +// src/core/lock.ts +import { randomUUID } from 'node:crypto'; +import Database from 'better-sqlite3'; +import { DatabaseLockError } from './errors'; +import { dbLogger } from './logger'; +import { nowMs } from './time'; + +interface LockOptions { + name: string; + staleLockMinutes: number; + heartbeatIntervalSeconds: number; +} + +export class AppLock { + private db: Database.Database; + private owner: string; + private name: string; + private staleLockMs: number; + private heartbeatIntervalMs: number; + private heartbeatTimer?: NodeJS.Timeout; + private released = false; + + constructor(db: Database.Database, options: LockOptions) { + this.db = db; + this.owner = randomUUID(); + this.name = options.name; + this.staleLockMs = options.staleLockMinutes * 60 * 1000; + this.heartbeatIntervalMs = options.heartbeatIntervalSeconds * 1000; + } + + acquire(force = false): boolean { + const now = nowMs(); + + return this.db.transaction(() => { + const existing = this.db.prepare( + 'SELECT owner, acquired_at, heartbeat_at FROM app_locks WHERE name = ?' + ).get(this.name) as { owner: string; acquired_at: number; heartbeat_at: number } | undefined; + + if (!existing) { + // No lock exists, acquire it + this.db.prepare( + 'INSERT INTO app_locks (name, owner, acquired_at, heartbeat_at) VALUES (?, ?, ?, ?)' + ).run(this.name, this.owner, now, now); + this.startHeartbeat(); + dbLogger.info({ owner: this.owner }, 'Lock acquired (new)'); + return true; + } + + const isStale = (now - existing.heartbeat_at) > this.staleLockMs; + + if (isStale || force) { + // Lock is stale or force override, take it + this.db.prepare( + 'UPDATE app_locks SET owner = ?, acquired_at = ?, heartbeat_at = ? WHERE name = ?' + ).run(this.owner, now, now, this.name); + this.startHeartbeat(); + dbLogger.info({ owner: this.owner, previousOwner: existing.owner, wasStale: isStale }, 'Lock acquired (override)'); + return true; + } + + if (existing.owner === this.owner) { + // Re-entrant, update heartbeat + this.db.prepare( + 'UPDATE app_locks SET heartbeat_at = ? WHERE name = ?' + ).run(now, this.name); + return true; + } + + // Lock held by another active process + throw new DatabaseLockError(existing.owner, existing.acquired_at); + })(); + } + + release(): void { + if (this.released) return; + this.released = true; + + if (this.heartbeatTimer) { + clearInterval(this.heartbeatTimer); + } + + this.db.prepare('DELETE FROM app_locks WHERE name = ? AND owner = ?') + .run(this.name, this.owner); + + dbLogger.info({ owner: this.owner }, 'Lock released'); + } + + private startHeartbeat(): void { + this.heartbeatTimer = setInterval(() => { + if (this.released) return; + + this.db.prepare('UPDATE app_locks SET heartbeat_at = ? WHERE name = ? AND owner = ?') + .run(nowMs(), this.name, this.owner); + + dbLogger.debug({ owner: this.owner }, 'Heartbeat updated'); + }, this.heartbeatIntervalMs); + + // Don't prevent process from exiting + this.heartbeatTimer.unref(); + } +} +``` + +--- + +## Raw Payload Handling + +### Compression and Deduplication + +```typescript +// src/core/payloads.ts +import { createHash } from 'node:crypto'; +import { gzipSync, gunzipSync } from 'node:zlib'; +import Database from 'better-sqlite3'; +import { nowMs } from './time'; + +interface StorePayloadOptions { + projectId: number | null; + resourceType: string; + gitlabId: string; + payload: unknown; + compress: boolean; +} + +export function storePayload( + db: Database.Database, + options: StorePayloadOptions +): number | null { + const jsonBytes = Buffer.from(JSON.stringify(options.payload)); + const payloadHash = createHash('sha256').update(jsonBytes).digest('hex'); + + // Check for duplicate (same content already stored) + const existing = db.prepare(` + SELECT id FROM raw_payloads + WHERE project_id IS ? AND resource_type = ? AND gitlab_id = ? AND payload_hash = ? + `).get(options.projectId, options.resourceType, options.gitlabId, payloadHash) as { id: number } | undefined; + + if (existing) { + // Duplicate content, return existing ID + return existing.id; + } + + const encoding = options.compress ? 'gzip' : 'identity'; + const payloadBytes = options.compress ? gzipSync(jsonBytes) : jsonBytes; + + const result = db.prepare(` + INSERT INTO raw_payloads + (source, project_id, resource_type, gitlab_id, fetched_at, content_encoding, payload_hash, payload) + VALUES ('gitlab', ?, ?, ?, ?, ?, ?, ?) + `).run( + options.projectId, + options.resourceType, + options.gitlabId, + nowMs(), + encoding, + payloadHash, + payloadBytes + ); + + return result.lastInsertRowid as number; +} + +export function readPayload( + db: Database.Database, + id: number +): unknown { + const row = db.prepare( + 'SELECT content_encoding, payload FROM raw_payloads WHERE id = ?' + ).get(id) as { content_encoding: string; payload: Buffer } | undefined; + + if (!row) return null; + + const jsonBytes = row.content_encoding === 'gzip' + ? gunzipSync(row.payload) + : row.payload; + + return JSON.parse(jsonBytes.toString()); +} +``` + +--- + +## Automated Tests + +### Unit Tests + +**`tests/unit/config.test.ts`** +```typescript +describe('Config', () => { + it('loads config from file path'); + it('throws ConfigNotFoundError if file missing'); + it('throws ConfigValidationError if required fields missing'); + it('validates project paths are non-empty strings'); + it('applies default values for optional fields'); + it('loads from XDG path by default'); + it('respects GI_CONFIG_PATH override'); + it('respects --config flag override'); +}); +``` + +**`tests/unit/db.test.ts`** +```typescript +describe('Database', () => { + it('creates database file if not exists'); + it('applies migrations in order'); + it('sets WAL journal mode'); + it('enables foreign keys'); + it('sets busy_timeout=5000'); + it('sets synchronous=NORMAL'); + it('sets temp_store=MEMORY'); + it('tracks schema version'); +}); +``` + +**`tests/unit/paths.test.ts`** +```typescript +describe('Path Resolution', () => { + it('uses XDG_CONFIG_HOME if set'); + it('falls back to ~/.config/gi if XDG not set'); + it('prefers --config flag over environment'); + it('prefers environment over XDG default'); + it('falls back to local gi.config.json in dev'); +}); +``` + +### Integration Tests + +**`tests/integration/gitlab-client.test.ts`** (mocked) +```typescript +describe('GitLab Client', () => { + it('authenticates with valid PAT'); + it('returns 401 for invalid PAT'); + it('fetches project by path'); + it('handles rate limiting (429) with Retry-After'); + it('respects rate limit (requests per second)'); + it('adds jitter to rate limiting'); +}); +``` + +**`tests/integration/app-lock.test.ts`** +```typescript +describe('App Lock', () => { + it('acquires lock successfully'); + it('updates heartbeat during operation'); + it('detects stale lock and recovers'); + it('refuses concurrent acquisition'); + it('allows force override'); + it('releases lock on completion'); +}); +``` + +**`tests/integration/init.test.ts`** +```typescript +describe('gi init', () => { + it('creates config file with valid structure'); + it('validates GitLab URL format'); + it('validates GitLab connection before writing config'); + it('validates each project path exists in GitLab'); + it('fails if token not set'); + it('fails if GitLab auth fails'); + it('fails if any project path not found'); + it('prompts before overwriting existing config'); + it('respects --force to skip confirmation'); + it('generates config with sensible defaults'); + it('creates data directory if missing'); +}); +``` + +### Live Tests (Gated) + +**`tests/live/gitlab-client.live.test.ts`** +```typescript +// Only runs when GITLAB_LIVE_TESTS=1 +describe('GitLab Client (Live)', () => { + it('authenticates with real PAT'); + it('fetches real project by path'); + it('handles actual rate limiting'); +}); +``` + +--- + +## Manual Smoke Tests + +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi --help` | Command list | Shows all available commands | +| `gi version` | Version number | Shows installed version | +| `gi init` | Interactive prompts | Creates valid config | +| `gi init` (config exists) | Confirmation prompt | Warns before overwriting | +| `gi init --force` | No prompt | Overwrites without asking | +| `gi auth-test` | `Authenticated as @username` | Shows GitLab username | +| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit, clear error | +| `gi doctor` | Status table | All required checks pass | +| `gi doctor --json` | JSON object | Valid JSON, `success: true` | +| `gi backup` | Backup path | Creates timestamped backup | +| `gi sync-status` | No runs message | Stub output works | + +--- + +## Definition of Done + +### Gate (Must Pass) + +- [ ] `gi init` writes config to XDG path and validates projects against GitLab +- [ ] `gi auth-test` succeeds with real PAT (live test, can be manual) +- [ ] `gi doctor` reports DB ok + GitLab ok (Ollama may warn if not running) +- [ ] DB migrations apply; WAL + FK enabled; busy_timeout + synchronous set +- [ ] App lock mechanism works (concurrent runs blocked) +- [ ] All unit tests pass +- [ ] All integration tests pass (mocked) +- [ ] ESLint passes with no errors +- [ ] TypeScript compiles with strict mode + +### Hardening (Optional Before CP1) + +- [ ] Additional negative-path tests (overwrite prompts, JSON outputs) +- [ ] Edge cases: empty project list, invalid URLs, network timeouts +- [ ] Config migration from old paths (if upgrading) +- [ ] Live tests pass against real GitLab instance + +--- + +## Implementation Order + +1. **Project scaffold** (5 min) + - package.json, tsconfig.json, vitest.config.ts, eslint.config.js + - Directory structure + - .gitignore + +2. **Core utilities** (30 min) + - `src/core/paths.ts` - XDG path resolution + - `src/core/time.ts` - Timestamp utilities + - `src/core/errors.ts` - Error classes + - `src/core/logger.ts` - pino setup + +3. **Config loading** (30 min) + - `src/core/config.ts` - Zod schema, load/validate + - Unit tests for config + +4. **Database** (45 min) + - `src/core/db.ts` - Connection, pragmas, migrations + - `migrations/001_initial.sql` + - Unit tests for DB + - App lock mechanism + +5. **GitLab client** (30 min) + - `src/gitlab/client.ts` - API client with rate limiting + - `src/gitlab/types.ts` - Response types + - Integration tests (mocked) + +6. **Raw payload handling** (20 min) + - `src/core/payloads.ts` - Compression, deduplication, storage + +7. **CLI commands** (60 min) + - `src/cli/index.ts` - Commander setup + - `gi init` - Full implementation + - `gi auth-test` - Simple + - `gi doctor` - Health checks + - `gi version` - Version display + - `gi backup` - Database backup + - `gi reset` - Database reset + - `gi sync-status` - Stub + +8. **Final validation** (15 min) + - Run all tests + - Manual smoke tests + - ESLint + TypeScript check + +--- + +## Risks & Mitigations + +| Risk | Mitigation | +|------|------------| +| sqlite-vec installation fails | Document manual install steps; degrade to FTS-only | +| better-sqlite3 native compilation | Provide prebuilt binaries in package | +| XDG paths not writable | Fall back to cwd; show clear error | +| GitLab API changes | Pin to known API version; document tested version | + +--- + +## References + +- [SPEC.md](../SPEC.md) - Full system specification +- [GitLab API Docs](https://docs.gitlab.com/ee/api/) - API reference +- [better-sqlite3](https://github.com/WiseLibs/better-sqlite3) - SQLite driver +- [sqlite-vec](https://github.com/asg017/sqlite-vec) - Vector extension +- [Commander.js](https://github.com/tj/commander.js) - CLI framework +- [Zod](https://zod.dev) - Schema validation +- [pino](https://getpino.io) - Logging diff --git a/docs/prd/checkpoint-1.md b/docs/prd/checkpoint-1.md new file mode 100644 index 0000000..6f59080 --- /dev/null +++ b/docs/prd/checkpoint-1.md @@ -0,0 +1,1683 @@ +# Checkpoint 1: Issue Ingestion - PRD + +**Version:** 2.0 +**Status:** Ready for Implementation +**Depends On:** Checkpoint 0 (Project Setup) +**Enables:** Checkpoint 2 (MR Ingestion) + +--- + +## Overview + +### Objective + +Ingest all issues, labels, and issue discussions from configured GitLab repositories with resumable cursor-based incremental sync. This checkpoint establishes the core data ingestion pattern that will be reused for MRs in Checkpoint 2. + +### Success Criteria + +| Criterion | Validation | +|-----------|------------| +| `gi ingest --type=issues` fetches all issues | `gi count issues` matches GitLab UI | +| Labels extracted from issue payloads (name-only) | `labels` table populated | +| Label linkage reflects current GitLab state | Removed labels are unlinked on re-sync | +| Issue discussions fetched per-issue (dependent sync) | For issues whose `updated_at` advanced, discussions and notes upserted | +| Cursor-based sync is resumable | Re-running fetches 0 new items | +| Discussion sync skips unchanged issues | Per-issue watermark prevents redundant fetches | +| Sync tracking records all runs | `sync_runs` table has complete audit trail | +| Single-flight lock prevents concurrent runs | Second sync fails with clear error | + +--- + +## Internal Gates + +CP1 is validated incrementally via internal gates: + +| Gate | Scope | Validation | +|------|-------|------------| +| **Gate A** | Issues only | Cursor + upsert + raw payloads + list/count/show working | +| **Gate B** | Labels correct | Stale-link removal verified; label count matches GitLab | +| **Gate C** | Dependent discussion sync | Watermark prevents redundant refetch; concurrency bounded | +| **Gate D** | Resumability proof | Kill mid-run, rerun; confirm bounded redo and no redundant discussion refetch | + +--- + +## Deliverables + +### 1. Project Structure Additions + +Add the following to the existing Rust structure from Checkpoint 0: + +``` +gitlab-inbox/ +├── src/ +│ ├── cli/ +│ │ └── commands/ +│ │ ├── ingest.rs # gi ingest --type=issues|merge_requests +│ │ ├── list.rs # gi list issues|mrs +│ │ ├── count.rs # gi count issues|mrs|discussions|notes +│ │ └── show.rs # gi show issue|mr +│ ├── gitlab/ +│ │ ├── types.rs # Add GitLabIssue, GitLabDiscussion, GitLabNote +│ │ └── transformers/ +│ │ ├── mod.rs +│ │ ├── issue.rs # GitLab → normalized issue +│ │ └── discussion.rs # GitLab → normalized discussion/notes +│ └── ingestion/ +│ ├── mod.rs +│ ├── orchestrator.rs # Coordinates issue + dependent discussion sync +│ ├── issues.rs # Issue fetcher with pagination +│ └── discussions.rs # Discussion fetcher (per-issue) +├── tests/ +│ ├── issue_transformer_tests.rs +│ ├── discussion_transformer_tests.rs +│ ├── pagination_tests.rs +│ ├── issue_ingestion_tests.rs +│ ├── label_linkage_tests.rs # Verifies stale link removal +│ ├── discussion_watermark_tests.rs +│ └── fixtures/ +│ ├── gitlab_issue.json +│ ├── gitlab_issues_page.json +│ ├── gitlab_discussion.json +│ └── gitlab_discussions_page.json +└── migrations/ + └── 002_issues.sql +``` + +### 2. GitLab API Endpoints + +**Issues (Bulk Fetch):** +``` +GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100 +``` + +**Issue Discussions (Per-Issue Fetch):** +``` +GET /projects/:id/issues/:iid/discussions?per_page=100&page=N +``` + +**Required Query Parameters:** +- `scope=all` - Include all issues, not just authored by current user +- `state=all` - Include closed issues (GitLab may default to open only) + +**MVP Note (Labels):** +- CP1 stores labels by **name only** for maximum compatibility and stability. +- Label color/description ingestion is deferred (post-CP1) via Labels API if needed. +- This avoids relying on optional/variant payload shapes that differ across GitLab versions. + +**Pagination:** +- Follow `x-next-page` header until empty/absent +- Fall back to empty-page detection if headers missing (robustness) +- Per-page maximum: 100 + +--- + +## Database Schema + +### Migration 002_issues.sql + +```sql +-- Issues table +CREATE TABLE issues ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + iid INTEGER NOT NULL, + title TEXT, + description TEXT, + state TEXT, -- 'opened' | 'closed' + author_username TEXT, + created_at INTEGER, -- ms epoch UTC + updated_at INTEGER, -- ms epoch UTC + last_seen_at INTEGER NOT NULL, -- ms epoch UTC, updated on every upsert + -- Prevents re-fetching discussions on cursor rewind / reruns unless issue changed. + -- Set to issue.updated_at after successfully syncing all discussions for this issue. + discussions_synced_for_updated_at INTEGER, + web_url TEXT, + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); +CREATE INDEX idx_issues_author ON issues(author_username); +CREATE INDEX idx_issues_discussions_sync ON issues(project_id, discussions_synced_for_updated_at); +CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid); + +-- Labels (derived from issue payloads) +-- CP1: Name-only for stability. Color/description deferred to Labels API integration. +-- Uniqueness is (project_id, name) since gitlab_id isn't always available. +CREATE TABLE labels ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER, -- optional (populated if Labels API used later) + project_id INTEGER NOT NULL REFERENCES projects(id), + name TEXT NOT NULL, + color TEXT, -- nullable, populated later if needed + description TEXT -- nullable, populated later if needed +); +CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name); +CREATE INDEX idx_labels_name ON labels(name); + +-- Issue-Label junction +-- IMPORTANT: On issue update, DELETE existing links then INSERT current set. +-- This ensures removed labels are unlinked (not just added). +CREATE TABLE issue_labels ( + issue_id INTEGER REFERENCES issues(id) ON DELETE CASCADE, + label_id INTEGER REFERENCES labels(id) ON DELETE CASCADE, + PRIMARY KEY(issue_id, label_id) +); +CREATE INDEX idx_issue_labels_label ON issue_labels(label_id); + +-- Discussion threads for issues +CREATE TABLE discussions ( + id INTEGER PRIMARY KEY, + gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g., "6a9c1750b37d...") + project_id INTEGER NOT NULL REFERENCES projects(id), + issue_id INTEGER REFERENCES issues(id), + merge_request_id INTEGER, -- FK added in CP2 via ALTER TABLE + noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' + individual_note INTEGER NOT NULL, -- 1 = standalone comment, 0 = threaded + first_note_at INTEGER, -- ms epoch UTC, for ordering discussions + last_note_at INTEGER, -- ms epoch UTC, for "recently active" queries + last_seen_at INTEGER NOT NULL, -- ms epoch UTC, updated on every upsert + resolvable INTEGER, -- MR discussions can be resolved + resolved INTEGER, + raw_payload_id INTEGER REFERENCES raw_payloads(id), + CHECK ( + (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR + (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) + ) +); +CREATE UNIQUE INDEX uq_discussions_project_discussion_id ON discussions(project_id, gitlab_discussion_id); +CREATE INDEX idx_discussions_issue ON discussions(issue_id); +CREATE INDEX idx_discussions_mr ON discussions(merge_request_id); +CREATE INDEX idx_discussions_last_note ON discussions(last_note_at); + +-- Notes belong to discussions (preserving thread context) +CREATE TABLE notes ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + discussion_id INTEGER NOT NULL REFERENCES discussions(id), + project_id INTEGER NOT NULL REFERENCES projects(id), + note_type TEXT, -- 'DiscussionNote' | 'DiffNote' | null + is_system INTEGER NOT NULL DEFAULT 0, -- 1 for system notes (assignments, label changes) + author_username TEXT, + body TEXT, + created_at INTEGER, -- ms epoch UTC + updated_at INTEGER, -- ms epoch UTC + last_seen_at INTEGER NOT NULL, -- ms epoch UTC, updated on every upsert + position INTEGER, -- derived from array order in API response (0-indexed) + resolvable INTEGER, + resolved INTEGER, + resolved_by TEXT, + resolved_at INTEGER, -- ms epoch UTC + -- DiffNote position metadata (populated for MR DiffNotes in CP2) + position_old_path TEXT, + position_new_path TEXT, + position_old_line INTEGER, + position_new_line INTEGER, + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_notes_discussion ON notes(discussion_id); +CREATE INDEX idx_notes_author ON notes(author_username); +CREATE INDEX idx_notes_system ON notes(is_system); + +-- Update schema version +INSERT INTO schema_version (version, applied_at, description) +VALUES (2, strftime('%s', 'now') * 1000, 'Issues, labels, discussions, notes'); +``` + +--- + +## GitLab Types + +### Type Definitions + +```rust +// src/gitlab/types.rs (additions) + +use serde::Deserialize; + +/// GitLab issue from the API. +#[derive(Debug, Clone, Deserialize)] +pub struct GitLabIssue { + pub id: i64, // GitLab global ID + pub iid: i64, // Project-scoped issue number + pub project_id: i64, + pub title: String, + pub description: Option, + pub state: String, // "opened" | "closed" + pub created_at: String, // ISO 8601 + pub updated_at: String, // ISO 8601 + pub closed_at: Option, + pub author: GitLabAuthor, + pub labels: Vec, // Array of label names (CP1 canonical) + pub web_url: String, + // NOTE: labels_details is intentionally NOT modeled for CP1. + // The field name and shape varies across GitLab versions. + // Color/description can be fetched via Labels API if needed later. +} + +#[derive(Debug, Clone, Deserialize)] +pub struct GitLabAuthor { + pub id: i64, + pub username: String, + pub name: String, +} + +/// GitLab discussion (thread of notes). +#[derive(Debug, Clone, Deserialize)] +pub struct GitLabDiscussion { + pub id: String, // String ID like "6a9c1750b37d..." + pub individual_note: bool, // true = standalone comment + pub notes: Vec, +} + +/// GitLab note (comment). +#[derive(Debug, Clone, Deserialize)] +pub struct GitLabNote { + pub id: i64, + #[serde(rename = "type")] + pub note_type: Option, // "DiscussionNote" | "DiffNote" | null + pub body: String, + pub author: GitLabAuthor, + pub created_at: String, // ISO 8601 + pub updated_at: String, // ISO 8601 + pub system: bool, // true for system-generated notes + #[serde(default)] + pub resolvable: bool, + #[serde(default)] + pub resolved: bool, + pub resolved_by: Option, + pub resolved_at: Option, + /// DiffNote specific (null for non-DiffNote) + pub position: Option, +} + +#[derive(Debug, Clone, Deserialize)] +pub struct GitLabNotePosition { + pub old_path: Option, + pub new_path: Option, + pub old_line: Option, + pub new_line: Option, +} +``` + +--- + +## Transformers + +### Issue Transformer + +```rust +// src/gitlab/transformers/issue.rs + +use crate::core::time::{iso_to_ms, now_ms}; +use crate::gitlab::types::GitLabIssue; + +/// Normalized issue ready for database insertion. +#[derive(Debug, Clone)] +pub struct NormalizedIssue { + pub gitlab_id: i64, + pub project_id: i64, // Local DB project ID + pub iid: i64, + pub title: String, + pub description: Option, + pub state: String, + pub author_username: String, + pub created_at: i64, // ms epoch + pub updated_at: i64, // ms epoch + pub last_seen_at: i64, // ms epoch + pub web_url: String, +} + +/// Normalized label ready for database insertion. +/// CP1: Name-only for stability. +#[derive(Debug, Clone)] +pub struct NormalizedLabel { + pub project_id: i64, + pub name: String, +} + +/// Transform GitLab issue to normalized schema. +pub fn transform_issue(gitlab_issue: &GitLabIssue, local_project_id: i64) -> NormalizedIssue { + NormalizedIssue { + gitlab_id: gitlab_issue.id, + project_id: local_project_id, + iid: gitlab_issue.iid, + title: gitlab_issue.title.clone(), + description: gitlab_issue.description.clone(), + state: gitlab_issue.state.clone(), + author_username: gitlab_issue.author.username.clone(), + created_at: iso_to_ms(&gitlab_issue.created_at), + updated_at: iso_to_ms(&gitlab_issue.updated_at), + last_seen_at: now_ms(), + web_url: gitlab_issue.web_url.clone(), + } +} + +/// Extract labels from GitLab issue (CP1: name-only). +pub fn extract_labels(gitlab_issue: &GitLabIssue, local_project_id: i64) -> Vec { + gitlab_issue + .labels + .iter() + .map(|name| NormalizedLabel { + project_id: local_project_id, + name: name.clone(), + }) + .collect() +} +``` + +### Discussion Transformer + +```rust +// src/gitlab/transformers/discussion.rs + +use crate::core::time::{iso_to_ms, now_ms}; +use crate::gitlab::types::GitLabDiscussion; + +/// Normalized discussion ready for database insertion. +#[derive(Debug, Clone)] +pub struct NormalizedDiscussion { + pub gitlab_discussion_id: String, + pub project_id: i64, + pub issue_id: i64, + pub noteable_type: String, // "Issue" + pub individual_note: bool, + pub first_note_at: Option, + pub last_note_at: Option, + pub last_seen_at: i64, + pub resolvable: bool, + pub resolved: bool, +} + +/// Normalized note ready for database insertion. +#[derive(Debug, Clone)] +pub struct NormalizedNote { + pub gitlab_id: i64, + pub project_id: i64, + pub note_type: Option, + pub is_system: bool, + pub author_username: String, + pub body: String, + pub created_at: i64, + pub updated_at: i64, + pub last_seen_at: i64, + pub position: i32, // Array index in notes[] + pub resolvable: bool, + pub resolved: bool, + pub resolved_by: Option, + pub resolved_at: Option, +} + +/// Transform GitLab discussion to normalized schema. +pub fn transform_discussion( + gitlab_discussion: &GitLabDiscussion, + local_project_id: i64, + local_issue_id: i64, +) -> NormalizedDiscussion { + let note_times: Vec = gitlab_discussion + .notes + .iter() + .map(|n| iso_to_ms(&n.created_at)) + .collect(); + + // Check if any note is resolvable + let resolvable = gitlab_discussion.notes.iter().any(|n| n.resolvable); + let resolved = resolvable + && gitlab_discussion + .notes + .iter() + .all(|n| !n.resolvable || n.resolved); + + NormalizedDiscussion { + gitlab_discussion_id: gitlab_discussion.id.clone(), + project_id: local_project_id, + issue_id: local_issue_id, + noteable_type: "Issue".to_string(), + individual_note: gitlab_discussion.individual_note, + first_note_at: note_times.iter().min().copied(), + last_note_at: note_times.iter().max().copied(), + last_seen_at: now_ms(), + resolvable, + resolved, + } +} + +/// Transform GitLab notes to normalized schema. +pub fn transform_notes( + gitlab_discussion: &GitLabDiscussion, + local_project_id: i64, +) -> Vec { + gitlab_discussion + .notes + .iter() + .enumerate() + .map(|(index, note)| NormalizedNote { + gitlab_id: note.id, + project_id: local_project_id, + note_type: note.note_type.clone(), + is_system: note.system, + author_username: note.author.username.clone(), + body: note.body.clone(), + created_at: iso_to_ms(¬e.created_at), + updated_at: iso_to_ms(¬e.updated_at), + last_seen_at: now_ms(), + position: index as i32, + resolvable: note.resolvable, + resolved: note.resolved, + resolved_by: note.resolved_by.as_ref().map(|a| a.username.clone()), + resolved_at: note.resolved_at.as_ref().map(|s| iso_to_ms(s)), + }) + .collect() +} +``` + +--- + +## GitLab Client Additions + +### Pagination with Async Streams + +```rust +// src/gitlab/client.rs (additions) + +use crate::gitlab::types::{GitLabDiscussion, GitLabIssue}; +use reqwest::header::HeaderMap; +use std::pin::Pin; +use futures::Stream; + +impl GitLabClient { + /// Paginate through issues for a project. + /// Returns a stream of issues that handles pagination automatically. + pub fn paginate_issues( + &self, + gitlab_project_id: i64, + updated_after: Option, + cursor_rewind_seconds: u32, + ) -> Pin> + Send + '_>> { + Box::pin(async_stream::try_stream! { + let mut page = 1u32; + let per_page = 100u32; + + loop { + let mut params = vec![ + ("scope", "all".to_string()), + ("state", "all".to_string()), + ("order_by", "updated_at".to_string()), + ("sort", "asc".to_string()), + ("per_page", per_page.to_string()), + ("page", page.to_string()), + ]; + + if let Some(ts) = updated_after { + // Apply cursor rewind for safety, clamping to 0 to avoid underflow + let rewind_ms = (cursor_rewind_seconds as i64) * 1000; + let rewound = (ts - rewind_ms).max(0); + if let Some(dt) = chrono::DateTime::from_timestamp_millis(rewound) { + params.push(("updated_after", dt.to_rfc3339())); + } + // If conversion fails (shouldn't happen with max(0)), omit the param + // and fetch all issues (safe fallback). + } + + let (issues, headers) = self + .request_with_headers::>( + &format!("/api/v4/projects/{gitlab_project_id}/issues"), + ¶ms, + ) + .await?; + + for issue in issues.iter() { + yield issue.clone(); + } + + // Check for next page + let next_page = headers + .get("x-next-page") + .and_then(|v| v.to_str().ok()) + .and_then(|s| s.parse::().ok()); + + match next_page { + Some(np) if !issues.is_empty() => page = np, + _ => break, + } + } + }) + } + + /// Paginate through discussions for an issue. + pub fn paginate_issue_discussions( + &self, + gitlab_project_id: i64, + issue_iid: i64, + ) -> Pin> + Send + '_>> { + Box::pin(async_stream::try_stream! { + let mut page = 1u32; + let per_page = 100u32; + + loop { + let params = vec![ + ("per_page", per_page.to_string()), + ("page", page.to_string()), + ]; + + let (discussions, headers) = self + .request_with_headers::>( + &format!("/api/v4/projects/{gitlab_project_id}/issues/{issue_iid}/discussions"), + ¶ms, + ) + .await?; + + for discussion in discussions.iter() { + yield discussion.clone(); + } + + // Check for next page + let next_page = headers + .get("x-next-page") + .and_then(|v| v.to_str().ok()) + .and_then(|s| s.parse::().ok()); + + match next_page { + Some(np) if !discussions.is_empty() => page = np, + _ => break, + } + } + }) + } + + /// Make request and return response with headers for pagination. + async fn request_with_headers( + &self, + path: &str, + params: &[(&str, String)], + ) -> Result<(T, HeaderMap)> { + self.rate_limiter.lock().await.acquire().await; + + let url = format!("{}{}", self.base_url, path); + tracing::debug!(url = %url, "GitLab request"); + + let response = self + .client + .get(&url) + .header("PRIVATE-TOKEN", &self.token) + .query(params) + .send() + .await + .map_err(|e| GiError::GitLabNetworkError { + base_url: self.base_url.clone(), + source: Some(e), + })?; + + let headers = response.headers().clone(); + let data = self.handle_response(response, path).await?; + Ok((data, headers)) + } +} +``` + +**Note:** Requires adding `async-stream` and `futures` to Cargo.toml: + +```toml +# Cargo.toml additions +async-stream = "0.3" +futures = "0.3" +``` + +--- + +## Orchestration: Dependent Discussion Sync + +### Canonical Pattern (CP1) + +When `gi ingest --type=issues` runs, it follows this orchestration: + +1. **Ingest issues** (cursor-based, with incremental cursor updates per page) + +2. **Collect touched issues** - For each issue that passed cursor tuple filtering, record: + - `local_issue_id` + - `issue_iid` + - `issue_updated_at` + - `discussions_synced_for_updated_at` (from DB) + +3. **Filter for discussion sync** - Enqueue issues where: + ``` + issue.updated_at > issues.discussions_synced_for_updated_at + ``` + This prevents re-fetching discussions for issues that haven't changed, even with cursor rewind. + +4. **Execute discussion sync** with bounded concurrency (`dependent_concurrency` from config) + +5. **Update watermark** - After each issue's discussions are successfully ingested: + ```sql + UPDATE issues SET discussions_synced_for_updated_at = ? WHERE id = ? + ``` + +**Invariant:** A rerun MUST NOT refetch discussions for issues whose `updated_at` has not advanced, even with cursor rewind. + +--- + +## Ingestion Logic + +### Runtime Strategy + +**Decision:** Use single-threaded Tokio runtime (`flavor = "current_thread"`) for CP1. + +**Rationale:** +- `rusqlite::Connection` is `!Send`, which conflicts with multi-threaded runtimes +- Single-threaded runtime avoids Send bounds entirely +- Concurrency for discussion fetches uses `tokio::task::spawn_local` + `LocalSet` +- Keeps code simple; can upgrade to channel-based DB writer in CP2 if needed + +```rust +// src/main.rs +#[tokio::main(flavor = "current_thread")] +async fn main() -> Result<()> { + // ... +} +``` + +### Issue Ingestion + +```rust +// src/ingestion/issues.rs + +use futures::StreamExt; +use rusqlite::Connection; +use tracing::{debug, info}; + +use crate::core::config::Config; +use crate::core::error::Result; +use crate::core::payloads::store_payload; +use crate::core::time::now_ms; +use crate::gitlab::client::GitLabClient; +use crate::gitlab::transformers::issue::{extract_labels, transform_issue}; + +/// Result of issue ingestion. +#[derive(Debug, Default)] +pub struct IngestIssuesResult { + pub fetched: usize, + pub upserted: usize, + pub labels_created: usize, + /// Issues that need discussion sync (updated_at advanced) + pub issues_needing_discussion_sync: Vec, +} + +/// Info needed to sync discussions for an issue. +#[derive(Debug, Clone)] +pub struct IssueForDiscussionSync { + pub local_issue_id: i64, + pub iid: i64, + pub updated_at: i64, +} + +/// Ingest issues for a project. +/// Returns list of issues that need discussion sync. +pub async fn ingest_issues( + conn: &Connection, + client: &GitLabClient, + config: &Config, + project_id: i64, // Local DB project ID + gitlab_project_id: i64, // GitLab project ID +) -> Result { + let mut result = IngestIssuesResult::default(); + + // Get current cursor + let cursor = get_cursor(conn, project_id)?; + let cursor_updated_at = cursor.0; + let cursor_gitlab_id = cursor.1; + + let mut last_updated_at: Option = None; + let mut last_gitlab_id: Option = None; + let mut issues_in_page: Vec<(i64, i64, i64)> = Vec::new(); // (local_id, iid, updated_at) + + // Fetch issues with pagination + let mut stream = client.paginate_issues( + gitlab_project_id, + cursor_updated_at, + config.sync.cursor_rewind_seconds, + ); + + while let Some(issue_result) = stream.next().await { + let issue = issue_result?; + result.fetched += 1; + + let issue_updated_at = crate::core::time::iso_to_ms(&issue.updated_at); + + // Apply cursor filtering for tuple semantics + if let (Some(cursor_ts), Some(cursor_id)) = (cursor_updated_at, cursor_gitlab_id) { + if issue_updated_at < cursor_ts { + continue; + } + if issue_updated_at == cursor_ts && issue.id <= cursor_id { + continue; + } + } + + // Begin transaction for this issue (atomicity + performance) + let tx = conn.unchecked_transaction()?; + + // Store raw payload + let payload_id = store_payload( + &tx, + project_id, + "issue", + &issue.id.to_string(), + &issue, + config.storage.compress_raw_payloads, + )?; + + // Transform and upsert issue + let normalized = transform_issue(&issue, project_id); + let changes = upsert_issue(&tx, &normalized, payload_id)?; + if changes > 0 { + result.upserted += 1; + } + + // Get local issue ID for label linking + let local_issue_id = get_local_issue_id(&tx, normalized.gitlab_id)?; + + // Clear existing label links (ensures removed labels are unlinked) + clear_issue_labels(&tx, local_issue_id)?; + + // Extract and upsert labels (name-only for CP1) + let labels = extract_labels(&issue, project_id); + for label in &labels { + let created = upsert_label(&tx, label)?; + if created { + result.labels_created += 1; + } + + // Link issue to label + let label_id = get_label_id(&tx, project_id, &label.name)?; + link_issue_label(&tx, local_issue_id, label_id)?; + } + + tx.commit()?; + + // Track for discussion sync eligibility + issues_in_page.push((local_issue_id, issue.iid, issue_updated_at)); + + // Track for cursor update + last_updated_at = Some(issue_updated_at); + last_gitlab_id = Some(issue.id); + + // Incremental cursor update every 100 issues (page boundary) + // This ensures crashes don't cause massive refetch + if result.fetched % 100 == 0 { + if let (Some(updated_at), Some(gitlab_id)) = (last_updated_at, last_gitlab_id) { + update_cursor(conn, project_id, "issues", updated_at, gitlab_id)?; + } + } + } + + // Final cursor update + if let (Some(updated_at), Some(gitlab_id)) = (last_updated_at, last_gitlab_id) { + update_cursor(conn, project_id, "issues", updated_at, gitlab_id)?; + } + + // Determine which issues need discussion sync (updated_at advanced) + for (local_issue_id, iid, updated_at) in issues_in_page { + let synced_at = get_discussions_synced_at(conn, local_issue_id)?; + if synced_at.is_none() || updated_at > synced_at.unwrap() { + result.issues_needing_discussion_sync.push(IssueForDiscussionSync { + local_issue_id, + iid, + updated_at, + }); + } + } + + info!( + project_id, + fetched = result.fetched, + upserted = result.upserted, + labels_created = result.labels_created, + need_discussion_sync = result.issues_needing_discussion_sync.len(), + "Issue ingestion complete" + ); + + Ok(result) +} + +fn get_cursor(conn: &Connection, project_id: i64) -> Result<(Option, Option)> { + let mut stmt = conn.prepare( + "SELECT updated_at_cursor, tie_breaker_id FROM sync_cursors + WHERE project_id = ? AND resource_type = 'issues'" + )?; + + let result = stmt.query_row([project_id], |row| { + Ok((row.get::<_, Option>(0)?, row.get::<_, Option>(1)?)) + }); + + match result { + Ok(cursor) => Ok(cursor), + Err(rusqlite::Error::QueryReturnedNoRows) => Ok((None, None)), + Err(e) => Err(e.into()), + } +} + +fn get_discussions_synced_at(conn: &Connection, issue_id: i64) -> Result> { + let result = conn.query_row( + "SELECT discussions_synced_for_updated_at FROM issues WHERE id = ?", + [issue_id], + |row| row.get(0), + ); + + match result { + Ok(ts) => Ok(ts), + Err(rusqlite::Error::QueryReturnedNoRows) => Ok(None), + Err(e) => Err(e.into()), + } +} + +fn upsert_issue( + conn: &Connection, + issue: &crate::gitlab::transformers::issue::NormalizedIssue, + payload_id: Option, +) -> Result { + let changes = conn.execute( + "INSERT INTO issues ( + gitlab_id, project_id, iid, title, description, state, + author_username, created_at, updated_at, last_seen_at, web_url, raw_payload_id + ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12) + ON CONFLICT(gitlab_id) DO UPDATE SET + title = excluded.title, + description = excluded.description, + state = excluded.state, + updated_at = excluded.updated_at, + last_seen_at = excluded.last_seen_at, + raw_payload_id = excluded.raw_payload_id", + rusqlite::params![ + issue.gitlab_id, + issue.project_id, + issue.iid, + issue.title, + issue.description, + issue.state, + issue.author_username, + issue.created_at, + issue.updated_at, + issue.last_seen_at, + issue.web_url, + payload_id, + ], + )?; + Ok(changes) +} + +fn get_local_issue_id(conn: &Connection, gitlab_id: i64) -> Result { + Ok(conn.query_row( + "SELECT id FROM issues WHERE gitlab_id = ?", + [gitlab_id], + |row| row.get(0), + )?) +} + +fn clear_issue_labels(conn: &Connection, issue_id: i64) -> Result<()> { + conn.execute("DELETE FROM issue_labels WHERE issue_id = ?", [issue_id])?; + Ok(()) +} + +fn upsert_label( + conn: &Connection, + label: &crate::gitlab::transformers::issue::NormalizedLabel, +) -> Result { + // CP1: Name-only labels. Color/description columns remain NULL. + let changes = conn.execute( + "INSERT INTO labels (project_id, name) + VALUES (?1, ?2) + ON CONFLICT(project_id, name) DO NOTHING", + rusqlite::params![label.project_id, label.name], + )?; + Ok(changes > 0) +} + +fn get_label_id(conn: &Connection, project_id: i64, name: &str) -> Result { + Ok(conn.query_row( + "SELECT id FROM labels WHERE project_id = ? AND name = ?", + rusqlite::params![project_id, name], + |row| row.get(0), + )?) +} + +fn link_issue_label(conn: &Connection, issue_id: i64, label_id: i64) -> Result<()> { + conn.execute( + "INSERT OR IGNORE INTO issue_labels (issue_id, label_id) VALUES (?, ?)", + [issue_id, label_id], + )?; + Ok(()) +} + +fn update_cursor( + conn: &Connection, + project_id: i64, + resource_type: &str, + updated_at: i64, + gitlab_id: i64, +) -> Result<()> { + conn.execute( + "INSERT INTO sync_cursors (project_id, resource_type, updated_at_cursor, tie_breaker_id) + VALUES (?1, ?2, ?3, ?4) + ON CONFLICT(project_id, resource_type) DO UPDATE SET + updated_at_cursor = excluded.updated_at_cursor, + tie_breaker_id = excluded.tie_breaker_id", + rusqlite::params![project_id, resource_type, updated_at, gitlab_id], + )?; + Ok(()) +} +``` + +### Discussion Ingestion + +```rust +// src/ingestion/discussions.rs + +use futures::StreamExt; +use rusqlite::Connection; +use tracing::debug; + +use crate::core::config::Config; +use crate::core::error::Result; +use crate::core::payloads::store_payload; +use crate::gitlab::client::GitLabClient; +use crate::gitlab::transformers::discussion::{transform_discussion, transform_notes}; + +/// Result of discussion ingestion for a single issue. +#[derive(Debug, Default)] +pub struct IngestDiscussionsResult { + pub discussions_fetched: usize, + pub discussions_upserted: usize, + pub notes_upserted: usize, + pub system_notes_count: usize, +} + +/// Ingest discussions for a single issue. +/// Called only when issue.updated_at > discussions_synced_for_updated_at. +pub async fn ingest_issue_discussions( + conn: &Connection, + client: &GitLabClient, + config: &Config, + project_id: i64, + gitlab_project_id: i64, + issue_iid: i64, + local_issue_id: i64, + issue_updated_at: i64, +) -> Result { + let mut result = IngestDiscussionsResult::default(); + + let mut stream = client.paginate_issue_discussions(gitlab_project_id, issue_iid); + + while let Some(discussion_result) = stream.next().await { + let discussion = discussion_result?; + result.discussions_fetched += 1; + + // Begin transaction for this discussion (atomicity + performance) + let tx = conn.unchecked_transaction()?; + + // Store raw payload for discussion + let discussion_payload_id = store_payload( + &tx, + project_id, + "discussion", + &discussion.id, + &discussion, + config.storage.compress_raw_payloads, + )?; + + // Transform and upsert discussion + let normalized = transform_discussion(&discussion, project_id, local_issue_id); + upsert_discussion(&tx, &normalized, discussion_payload_id)?; + result.discussions_upserted += 1; + + // Get local discussion ID + let local_discussion_id = get_local_discussion_id(&tx, project_id, &discussion.id)?; + + // Transform and upsert notes + let notes = transform_notes(&discussion, project_id); + for note in ¬es { + // Store raw payload for note + let gitlab_note = discussion.notes.iter().find(|n| n.id == note.gitlab_id); + let note_payload_id = if let Some(gn) = gitlab_note { + store_payload( + &tx, + project_id, + "note", + ¬e.gitlab_id.to_string(), + gn, + config.storage.compress_raw_payloads, + )? + } else { + None + }; + + upsert_note(&tx, local_discussion_id, note, note_payload_id)?; + result.notes_upserted += 1; + + if note.is_system { + result.system_notes_count += 1; + } + } + + tx.commit()?; + } + + // Mark discussions as synced for this issue version + mark_discussions_synced(conn, local_issue_id, issue_updated_at)?; + + debug!( + project_id, + issue_iid, + discussions = result.discussions_fetched, + notes = result.notes_upserted, + "Issue discussions ingested" + ); + + Ok(result) +} + +fn upsert_discussion( + conn: &Connection, + discussion: &crate::gitlab::transformers::discussion::NormalizedDiscussion, + payload_id: Option, +) -> Result<()> { + conn.execute( + "INSERT INTO discussions ( + gitlab_discussion_id, project_id, issue_id, noteable_type, + individual_note, first_note_at, last_note_at, last_seen_at, + resolvable, resolved, raw_payload_id + ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11) + ON CONFLICT(project_id, gitlab_discussion_id) DO UPDATE SET + first_note_at = excluded.first_note_at, + last_note_at = excluded.last_note_at, + last_seen_at = excluded.last_seen_at, + resolvable = excluded.resolvable, + resolved = excluded.resolved, + raw_payload_id = excluded.raw_payload_id", + rusqlite::params![ + discussion.gitlab_discussion_id, + discussion.project_id, + discussion.issue_id, + discussion.noteable_type, + discussion.individual_note as i32, + discussion.first_note_at, + discussion.last_note_at, + discussion.last_seen_at, + discussion.resolvable as i32, + discussion.resolved as i32, + payload_id, + ], + )?; + Ok(()) +} + +fn get_local_discussion_id(conn: &Connection, project_id: i64, gitlab_id: &str) -> Result { + Ok(conn.query_row( + "SELECT id FROM discussions WHERE project_id = ? AND gitlab_discussion_id = ?", + rusqlite::params![project_id, gitlab_id], + |row| row.get(0), + )?) +} + +fn upsert_note( + conn: &Connection, + discussion_id: i64, + note: &crate::gitlab::transformers::discussion::NormalizedNote, + payload_id: Option, +) -> Result<()> { + conn.execute( + "INSERT INTO notes ( + gitlab_id, discussion_id, project_id, note_type, is_system, + author_username, body, created_at, updated_at, last_seen_at, + position, resolvable, resolved, resolved_by, resolved_at, + raw_payload_id + ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13, ?14, ?15, ?16) + ON CONFLICT(gitlab_id) DO UPDATE SET + body = excluded.body, + updated_at = excluded.updated_at, + last_seen_at = excluded.last_seen_at, + resolved = excluded.resolved, + resolved_by = excluded.resolved_by, + resolved_at = excluded.resolved_at, + raw_payload_id = excluded.raw_payload_id", + rusqlite::params![ + note.gitlab_id, + discussion_id, + note.project_id, + note.note_type, + note.is_system as i32, + note.author_username, + note.body, + note.created_at, + note.updated_at, + note.last_seen_at, + note.position, + note.resolvable as i32, + note.resolved as i32, + note.resolved_by, + note.resolved_at, + payload_id, + ], + )?; + Ok(()) +} + +fn mark_discussions_synced(conn: &Connection, issue_id: i64, issue_updated_at: i64) -> Result<()> { + conn.execute( + "UPDATE issues SET discussions_synced_for_updated_at = ? WHERE id = ?", + rusqlite::params![issue_updated_at, issue_id], + )?; + Ok(()) +} +``` + +--- + +## CLI Commands + +### `gi ingest --type=issues` + +Fetch and store all issues from configured projects. + +**Clap Definition:** + +```rust +// src/cli/mod.rs (addition to Commands enum) + +#[derive(Subcommand)] +pub enum Commands { + // ... existing commands ... + + /// Ingest data from GitLab + Ingest { + /// Resource type to ingest + #[arg(long, value_parser = ["issues", "merge_requests"])] + r#type: String, + + /// Filter to single project + #[arg(long)] + project: Option, + + /// Override stale sync lock + #[arg(long)] + force: bool, + }, + + /// List entities + List { + /// Entity type to list + #[arg(value_parser = ["issues", "mrs"])] + entity: String, + + /// Maximum results + #[arg(long, default_value = "20")] + limit: usize, + + /// Filter by project path + #[arg(long)] + project: Option, + + /// Filter by state + #[arg(long, value_parser = ["opened", "closed", "all"])] + state: Option, + }, + + /// Count entities + Count { + /// Entity type to count + #[arg(value_parser = ["issues", "mrs", "discussions", "notes"])] + entity: String, + + /// Filter by noteable type + #[arg(long, value_parser = ["issue", "mr"])] + r#type: Option, + }, + + /// Show entity details + Show { + /// Entity type + #[arg(value_parser = ["issue", "mr"])] + entity: String, + + /// Entity IID + iid: i64, + + /// Project path (required if ambiguous) + #[arg(long)] + project: Option, + }, +} +``` + +**Output:** +``` +Ingesting issues... + + group/project-one: 1,234 issues fetched, 45 new labels + +Fetching discussions (312 issues with updates)... + + group/project-one: 312 issues → 1,234 discussions, 5,678 notes + +Total: 1,234 issues, 1,234 discussions, 5,678 notes (excluding 1,234 system notes) +Skipped discussion sync for 922 unchanged issues. +``` + +### `gi list issues` + +**Output:** +``` +Issues (showing 20 of 3,801) + + #1234 Authentication redesign opened @johndoe 3 days ago + #1233 Fix memory leak in cache closed @janedoe 5 days ago + #1232 Add dark mode support opened @bobsmith 1 week ago + ... +``` + +### `gi count issues` + +**Output:** +``` +Issues: 3,801 +``` + +### `gi show issue ` + +**Output:** +``` +Issue #1234: Authentication redesign +━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + +Project: group/project-one +State: opened +Author: @johndoe +Created: 2024-01-15 +Updated: 2024-03-20 +Labels: enhancement, auth +URL: https://gitlab.example.com/group/project-one/-/issues/1234 + +Description: + We need to redesign the authentication flow to support... + +Discussions (5): + + @janedoe (2024-01-16): + I agree we should move to JWT-based auth... + + @johndoe (2024-01-16): + What about refresh token strategy? + + @bobsmith (2024-01-17): + Have we considered OAuth2? +``` + +--- + +## Automated Tests + +### Unit Tests + +```rust +// tests/issue_transformer_tests.rs + +#[cfg(test)] +mod tests { + use gi::gitlab::transformers::issue::*; + use gi::gitlab::types::*; + + #[test] + fn transforms_gitlab_issue_to_normalized_schema() { /* ... */ } + + #[test] + fn extracts_labels_from_issue_payload() { /* ... */ } + + #[test] + fn handles_missing_optional_fields_gracefully() { /* ... */ } + + #[test] + fn converts_iso_timestamps_to_ms_epoch() { /* ... */ } + + #[test] + fn sets_last_seen_at_to_current_time() { /* ... */ } +} +``` + +```rust +// tests/discussion_transformer_tests.rs + +#[cfg(test)] +mod tests { + use gi::gitlab::transformers::discussion::*; + + #[test] + fn transforms_discussion_payload_to_normalized_schema() { /* ... */ } + + #[test] + fn extracts_notes_array_from_discussion() { /* ... */ } + + #[test] + fn sets_individual_note_flag_correctly() { /* ... */ } + + #[test] + fn flags_system_notes_with_is_system_true() { /* ... */ } + + #[test] + fn preserves_note_order_via_position_field() { /* ... */ } + + #[test] + fn computes_first_note_at_and_last_note_at_correctly() { /* ... */ } + + #[test] + fn computes_resolvable_and_resolved_status() { /* ... */ } +} +``` + +```rust +// tests/pagination_tests.rs + +#[cfg(test)] +mod tests { + #[tokio::test] + async fn fetches_all_pages_when_multiple_exist() { /* ... */ } + + #[tokio::test] + async fn respects_per_page_parameter() { /* ... */ } + + #[tokio::test] + async fn follows_x_next_page_header_until_empty() { /* ... */ } + + #[tokio::test] + async fn falls_back_to_empty_page_stop_if_headers_missing() { /* ... */ } + + #[tokio::test] + async fn applies_cursor_rewind_for_tuple_semantics() { /* ... */ } + + #[tokio::test] + async fn clamps_negative_rewind_to_zero() { /* ... */ } +} +``` + +```rust +// tests/label_linkage_tests.rs + +#[cfg(test)] +mod tests { + #[test] + fn clears_existing_labels_before_linking_new_set() { /* ... */ } + + #[test] + fn removes_stale_label_links_on_issue_update() { /* ... */ } + + #[test] + fn handles_issue_with_all_labels_removed() { /* ... */ } + + #[test] + fn preserves_labels_that_still_exist() { /* ... */ } +} +``` + +```rust +// tests/discussion_watermark_tests.rs + +#[cfg(test)] +mod tests { + #[tokio::test] + async fn skips_discussion_fetch_when_updated_at_unchanged() { /* ... */ } + + #[tokio::test] + async fn fetches_discussions_when_updated_at_advanced() { /* ... */ } + + #[tokio::test] + async fn updates_watermark_after_successful_discussion_sync() { /* ... */ } + + #[tokio::test] + async fn does_not_update_watermark_on_discussion_sync_failure() { /* ... */ } +} +``` + +### Integration Tests + +```rust +// tests/issue_ingestion_tests.rs + +#[cfg(test)] +mod tests { + use tempfile::TempDir; + use wiremock::{MockServer, Mock, ResponseTemplate}; + use wiremock::matchers::{method, path_regex}; + + #[tokio::test] + async fn inserts_issues_into_database() { /* ... */ } + + #[tokio::test] + async fn creates_labels_from_issue_payloads() { /* ... */ } + + #[tokio::test] + async fn links_issues_to_labels_via_junction_table() { /* ... */ } + + #[tokio::test] + async fn removes_stale_label_links_on_resync() { /* ... */ } + + #[tokio::test] + async fn stores_raw_payload_for_each_issue() { /* ... */ } + + #[tokio::test] + async fn stores_raw_payload_for_each_discussion() { /* ... */ } + + #[tokio::test] + async fn updates_cursor_incrementally_per_page() { /* ... */ } + + #[tokio::test] + async fn resumes_from_cursor_on_subsequent_runs() { /* ... */ } + + #[tokio::test] + async fn handles_issues_with_no_labels() { /* ... */ } + + #[tokio::test] + async fn upserts_existing_issues_on_refetch() { /* ... */ } + + #[tokio::test] + async fn skips_discussion_refetch_for_unchanged_issues() { /* ... */ } +} +``` + +--- + +## Manual Smoke Tests + +| Command | Expected Output | Pass Criteria | +|---------|-----------------|---------------| +| `gi ingest --type=issues` | Progress bar, final count | Completes without error | +| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author | +| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project | +| `gi count issues` | `Issues: N` | Count matches GitLab UI | +| `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL | +| `gi show issue 123` (ambiguous) | Prompt or error | Asks for `--project` clarification | +| `gi count discussions --type=issue` | `Issue Discussions: N` | Non-zero count | +| `gi count notes --type=issue` | `Issue Notes: N (excluding M system)` | Non-zero count | +| `gi sync-status` | Last sync time, cursor positions | Shows successful last run | +| `gi ingest --type=issues` (re-run) | `0 new issues` | Cursor prevents re-fetch | +| `gi ingest --type=issues` (re-run) | `Skipped discussion sync for N unchanged issues` | Watermark prevents refetch | +| `gi ingest --type=issues` (concurrent) | Lock error | Second run fails with clear message | +| Remove label from issue in GitLab, re-sync | Label link removed | Junction table reflects GitLab state | + +--- + +## Data Integrity Checks + +After successful ingestion, verify: + +- [ ] `SELECT COUNT(*) FROM issues` matches GitLab issue count for configured projects +- [ ] Every issue has a corresponding `raw_payloads` row +- [ ] Every discussion has a corresponding `raw_payloads` row +- [ ] Labels in `issue_labels` junction all exist in `labels` table +- [ ] `issue_labels` count per issue matches GitLab UI label count +- [ ] `sync_cursors` has entry for each `(project_id, 'issues')` pair +- [ ] Re-running `gi ingest --type=issues` fetches 0 new items (cursor is current) +- [ ] Re-running skips discussion sync for unchanged issues (watermark works) +- [ ] `SELECT COUNT(*) FROM discussions WHERE noteable_type='Issue'` is non-zero +- [ ] Every discussion has at least one note +- [ ] `individual_note = 1` discussions have exactly one note +- [ ] `SELECT COUNT(*) FROM notes WHERE is_system = 1` matches system note count in CLI output +- [ ] After removing a label in GitLab and re-syncing, the link is removed from `issue_labels` + +--- + +## Definition of Done + +### Gate A: Issues Only (Must Pass First) + +- [ ] `gi ingest --type=issues` fetches all issues from configured projects +- [ ] Issues stored with correct schema, including `last_seen_at` +- [ ] Cursor-based sync is resumable (re-run fetches only new/updated) +- [ ] Incremental cursor updates every 100 issues +- [ ] Raw payloads stored for each issue +- [ ] `gi list issues` and `gi count issues` work + +### Gate B: Labels Correct (Must Pass) + +- [ ] Labels extracted and stored (name-only) +- [ ] Label links created correctly +- [ ] **Stale label links removed on re-sync** (verified with test) +- [ ] Label count per issue matches GitLab + +### Gate C: Dependent Discussion Sync (Must Pass) + +- [ ] Discussions fetched for issues with `updated_at` advancement +- [ ] Notes stored with `is_system` flag correctly set +- [ ] Raw payloads stored for discussions and notes +- [ ] `discussions_synced_for_updated_at` watermark updated after sync +- [ ] **Unchanged issues skip discussion refetch** (verified with test) +- [ ] Bounded concurrency (`dependent_concurrency` respected) + +### Gate D: Resumability Proof (Must Pass) + +- [ ] Kill mid-run, rerun; bounded redo (cursor progress preserved) +- [ ] No redundant discussion refetch after crash recovery +- [ ] Single-flight lock prevents concurrent runs + +### Final Gate (Must Pass) + +- [ ] All unit tests pass (`cargo test`) +- [ ] All integration tests pass (mocked with wiremock) +- [ ] `cargo clippy` passes with no warnings +- [ ] `cargo fmt --check` passes +- [ ] Compiles with `--release` + +### Hardening (Optional Before CP2) + +- [ ] Edge cases: issues with 0 labels, 0 discussions +- [ ] Large pagination (100+ pages) +- [ ] Rate limit handling under sustained load +- [ ] Live tests pass against real GitLab instance +- [ ] Performance: 1000+ issues ingested in <5 min + +--- + +## Implementation Order + +1. **Runtime decision** (5 min) + - Confirm `#[tokio::main(flavor = "current_thread")]` + - Add note about upgrade path for CP2 if needed + +2. **Cargo.toml updates** (5 min) + - Add `async-stream = "0.3"` and `futures = "0.3"` + +3. **Database migration** (15 min) + - `migrations/002_issues.sql` with `discussions_synced_for_updated_at` column + - `raw_payload_id` on discussions table + - Update `MIGRATIONS` const in `src/core/db.rs` + +4. **GitLab types** (15 min) + - Add types to `src/gitlab/types.rs` (no `labels_details`) + - Test deserialization with fixtures + +5. **Transformers** (25 min) + - `src/gitlab/transformers/mod.rs` + - `src/gitlab/transformers/issue.rs` (simplified NormalizedLabel) + - `src/gitlab/transformers/discussion.rs` + - Unit tests + +6. **GitLab client pagination** (25 min) + - Add `paginate_issues()` with underflow protection + - Add `paginate_issue_discussions()` + - Add `request_with_headers()` helper + +7. **Issue ingestion** (45 min) + - `src/ingestion/mod.rs` + - `src/ingestion/issues.rs` with: + - Transaction batching + - `clear_issue_labels()` before linking + - Incremental cursor updates + - Return `issues_needing_discussion_sync` + - Unit + integration tests including label stale-link removal + +8. **Discussion ingestion** (30 min) + - `src/ingestion/discussions.rs` with: + - Transaction batching + - `raw_payload_id` storage + - `mark_discussions_synced()` watermark update + - Integration tests including watermark behavior + +9. **Orchestrator** (30 min) + - `src/ingestion/orchestrator.rs` + - Coordinates issue sync → filter for discussion needs → bounded discussion sync + - Integration tests + +10. **CLI commands** (45 min) + - `gi ingest --type=issues` + - `gi list issues` + - `gi count issues|discussions|notes` + - `gi show issue ` + - Enhanced `gi sync-status` + +11. **Final validation** (20 min) + - `cargo test` + - `cargo clippy` + - Gate A/B/C/D verification + - Manual smoke tests + - Data integrity checks + +--- + +## Risks & Mitigations + +| Risk | Mitigation | +|------|------------| +| GitLab rate limiting during large sync | Respect `Retry-After`, exponential backoff, configurable concurrency | +| Discussion API N+1 problem (thousands of calls) | `dependent_concurrency` config limits parallel requests; watermark prevents refetch | +| Cursor drift if GitLab timestamp behavior changes | Rolling backfill window catches missed items | +| Large issues with 100+ discussions | Paginate discussions, bound memory usage | +| System notes pollute data | `is_system` flag allows filtering | +| Label deduplication across projects | Unique constraint on `(project_id, name)` | +| Stale label links accumulate | `clear_issue_labels()` before linking ensures correctness | +| Async stream complexity | Use `async-stream` crate for ergonomic generators | +| rusqlite + async runtime Send/locking pitfalls | Single-threaded runtime (`current_thread`) avoids Send bounds | +| Crash causes massive refetch | Incremental cursor updates every 100 issues | +| Cursor rewind causes discussion refetch | Per-issue watermark (`discussions_synced_for_updated_at`) | +| Timestamp underflow on rewind | Clamp to 0 with `.max(0)` | + +--- + +## API Call Estimation + +For a project with 3,000 issues: +- Issue list: `ceil(3000/100) = 30` calls +- Issue discussions (first run): `3000 × 1.2 average pages = 3,600` calls +- Issue discussions (subsequent runs, 10% updates): `300 × 1.2 = 360` calls +- **Total first run: ~3,630 calls per project** +- **Total subsequent run: ~390 calls per project** (90% savings from watermark) + +At 10 requests/second: +- First run: ~6 minutes per project +- Subsequent run: ~40 seconds per project + +--- + +## References + +- [SPEC.md](../../SPEC.md) - Full system specification +- [checkpoint-0.md](checkpoint-0.md) - Project setup PRD +- [GitLab Issues API](https://docs.gitlab.com/ee/api/issues.html) +- [GitLab Discussions API](https://docs.gitlab.com/ee/api/discussions.html) +- [async-stream crate](https://docs.rs/async-stream) - Async generators for Rust +- [wiremock](https://docs.rs/wiremock) - HTTP mocking for tests