From 7702d2a493aa2d374d5e2da4226800138d2ca49d Mon Sep 17 00:00:00 2001 From: teernisse Date: Tue, 20 Jan 2026 13:11:36 -0500 Subject: [PATCH] initial --- PRD.md | 563 +++++++++++++++++++++++++++++++++++++++++++++++++ SPEC.md | 641 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 1204 insertions(+) create mode 100644 PRD.md create mode 100644 SPEC.md diff --git a/PRD.md b/PRD.md new file mode 100644 index 0000000..1239cf8 --- /dev/null +++ b/PRD.md @@ -0,0 +1,563 @@ +# GitLab Inbox - Product Requirements Document + +## Overview + +**Product Name**: GitLab Inbox +**Version**: 1.0 +**Author**: Taylor Eernisse +**Date**: January 16, 2026 + +### Problem Statement + +Managing GitLab activity with ADHD is overwhelming. The native GitLab interface creates cognitive overload through: + +- **Information scatter**: Issues, MRs, and activity are spread across multiple pages +- **Missing reply awareness**: Hard to know when someone has responded to your question (not fully covered by /todos alone) +- **Context loss**: Difficult to find the right tab or remember which conversation you were tracking +- **No unified "what's next"**: Multiple clicks required to understand what needs attention + +### Solution + +A local, always-open "inbox" application that presents GitLab notifications in an ADHD-friendly interface with explicit "handled" tracking, snooze capabilities, watchlist for awaiting replies, and progress visibility. + +--- + +## Target User + +**Primary Persona**: Software developer with ADHD working on 1-2 GitLab projects who needs to track conversations and respond to mentions, reviews, and assignments without cognitive overload. + +**Key Characteristics**: +- Needs clear "what's next" visibility +- Benefits from external accountability (seeing who's waiting) +- Motivated by progress tracking (watching a list shrink) +- Prefers always-open tools over on-demand checks +- Struggles with context switching and finding the right place +- Needs a "not now but not forgotten" path that doesn't require willpower + +--- + +## Goals + +### User Goals +1. Know immediately when someone has replied or needs my attention +2. Quickly navigate to the right place in GitLab to respond +3. Track what I've handled today for satisfaction and progress awareness +4. Reduce cognitive load of manually tracking conversations +5. Defer items temporarily without losing accountability (snooze) +6. Know when someone has replied to something I'm waiting on + +### Product Goals +1. Reduce time-to-awareness for GitLab notifications +2. Eliminate the need to manually poll GitLab for updates +3. Provide ADHD-friendly UX patterns (clear actions, progress visibility, minimal decisions) +4. Enable keyboard-first operation to reduce friction + +### Non-Goals (v1.0) +- Replacing GitLab for any write operations (commenting, reviewing, merging) +- Supporting multiple GitLab instances +- Team/shared usage +- Mobile support + +--- + +## Core Features + +### 1. Inbox View (Primary) + +**Description**: Display all GitLab todos (notifications) that need attention. + +**Data Source**: GitLab `/todos` API endpoint + +**Display Elements** (per item): +| Element | Description | +|---------|-------------| +| Action Badge | Type indicator: mentioned, assigned, review_requested, build_failed, etc. | +| Target Title | MR or Issue title | +| Author | Who triggered this todo (name + avatar) | +| Time | Relative time since created ("2h ago", "3 days") | +| Project | Project name for context | + +**Interactions**: +- **Click item / Enter** → Opens target URL in browser (GitLab) +- **Mark Handled** → Moves item to Done Today (local state only) +- **Snooze** → Hides item until a chosen time (local state only) +- **Dismiss** → `POST /todos/:id/mark_as_done` (marks as done in GitLab) + +**Filtering**: Items marked as "handled" or "snoozed" locally are hidden from Inbox. + +### 2. Snoozed View + +**Description**: Items temporarily deferred until their wake time. + +**Purpose**: +- "Not now but not forgotten" path +- Reduces inbox dread by shrinking the visible list +- Enables focus sessions: clear the deck, then pull from Snoozed intentionally + +**Snooze Options**: +- Later today (3 hours) +- Tomorrow morning (9am local) +- Next weekday (Mon-Fri, 9am) +- Custom date/time + +**Behavior**: +- Snoozed items are hidden from Inbox +- When wake time passes, item returns to Inbox with a "Woke up" indicator +- Snoozed view shows all snoozed items with their wake times + +### 3. Watchlist (Awaiting Reply) + +**Description**: Targets you're explicitly waiting on (MRs/Issues/etc.). Alerts when there is new activity since you last checked. + +**Purpose**: +- GitLab todos don't guarantee "someone replied" notifications +- Explicit watch semantics for "I'm waiting on Bob" tracking +- Gain "external accountability" symmetry + +**Data Sources**: +- Primary: /todos (fast path for items that generate new todos) +- Secondary: per-target `updated_at`/notes polling for watched items (small set) + +**Interactions**: +- Mark Handled → optionally "Add to Watchlist" toggle +- Watch item shows "Last seen" timestamp and "New activity" indicator +- Click to open target in GitLab +- Remove from watchlist when no longer waiting + +### 4. Done Today View + +**Description**: Items marked as handled during the current day. + +**Purpose**: +- ADHD-friendly progress visibility +- Satisfaction from watching list shrink +- Review of daily accomplishments + +**Behavior**: +- Stored as date-bucketed ledger keyed by local date (YYYY-MM-DD) +- "Done Today" shows bucket for current local date +- Option to clear today's bucket only +- Historical buckets retained for potential "Done Yesterday" or weekly views + +### 5. Manual Refresh + +**Description**: Button to fetch latest todos on demand. + +**Purpose**: Immediate update when user knows something changed. + +### 6. Background Polling (v1.1) + +**Description**: Automatic periodic refresh of todos. + +**Configuration**: +- Base interval (default: 60s) +- Backoff on failure (exponential, capped at 15m) with jitter +- 429 handling (respect `Retry-After` header; otherwise back off) + +**Indicator**: +- Last successful refresh time +- Next scheduled refresh +- Current backoff state (if any) + +### 7. Keyboard Shortcuts (v1.0) + +**Description**: Keyboard-first operation for reduced friction. + +| Key | Action | +|-----|--------| +| `j` / `k` | Navigate down / up | +| `Enter` | Open selected item in GitLab | +| `h` | Mark handled | +| `s` | Snooze (opens snooze picker) | +| `d` | Dismiss (mark as done in GitLab) | +| `w` | Add to / remove from watchlist | +| `/` | Focus search/filter | + +### 8. Focus Mode (Optional) + +**Description**: Show only the next N items (default 3) to reduce decision load. + +**Purpose**: +- Convert "overwhelm" into "sequence" +- Reduce choices, increase throughput +- ADHD-optimized: work the queue, don't manage the list + +**Behavior**: +- Primary action emphasized ("Open", then "Handled"/"Snooze") +- Toggle: Focus / All Items +- Focus queue is top N unhandled items by creation date + +--- + +## Technical Architecture + +### Tech Stack + +| Component | Technology | +|-----------|------------| +| Framework | TanStack Start | +| Styling | Tailwind CSS | +| Runtime | Node.js (local) | +| State Persistence | JSON file with atomic writes (local) | +| Secret Storage | OS keychain (preferred) or encrypted local store | +| GitLab Integration | REST API with Personal Access Token | + +### Deployment + +- **Local only**: Runs on localhost +- **No external hosting**: No cloud deployment, no auth flows +- **Single user**: No multi-tenancy + +### GitLab API + +**Authentication**: Personal Access Token (PAT) with `read_api` scope + +**Primary Endpoints**: +``` +GET /api/v4/todos?state=pending&per_page=100 + +POST /api/v4/todos/:id/mark_as_done +``` + +**Response Structure** (relevant fields): +```typescript +interface GitLabTodo { + id: number; + action_name: + | 'assigned' + | 'mentioned' + | 'build_failed' + | 'marked' + | 'approval_required' + | 'unmergeable' + | 'directly_addressed' + | 'merge_train_removed' + | 'member_access_requested' + | string; // forward-compatible for new action types + target_type: 'MergeRequest' | 'Issue' | 'Commit' | 'Epic' | 'DesignManagement::Design' | string; + target: { + id: number; + iid: number; + title: string; + web_url?: string; // optional; may not be present for all target types + }; + target_url: string; // canonical "Open" URL - use this for navigation + author: { + id: number; + name: string; + avatar_url: string; + }; + project: { + id: number; + name: string; + path_with_namespace: string; + }; + created_at: string; +} +``` + +### Local State + +```typescript +interface LocalState { + schemaVersion: number; // for migrations + + handledByDate: { + [localDate: string]: { // YYYY-MM-DD in local time + [todoId: number]: { + handledAt: string; // ISO timestamp + todo: GitLabTodo; // Snapshot for Done Today display + } + } + }; + + snoozedTodos: { + [todoId: number]: { + wakeAt: string; // ISO timestamp + snoozedAt: string; // ISO timestamp + todo: GitLabTodo; // snapshot + } + }; + + watchlist: { + [watchKey: string]: { // e.g., "MergeRequest:123" or "Issue:456" + targetType: string; + projectId?: number; + targetId: number; + targetIid?: number; + targetUrl: string; + lastSeenUpdatedAt?: string; // ISO - when we last observed the target + lastCheckedAt?: string; // ISO - when we last polled + addedAt: string; // ISO + muted?: boolean; + } + }; +} +``` + +**Storage**: `~/.config/gitlab-inbox/state.json` + +**Persistence Strategy**: +- Atomic writes: write to `state.json.tmp`, then rename to `state.json` +- Keep `state.json.bak` as last-known-good before each write +- Validate JSON schema on load; if invalid, fall back to backup and surface warning +- Schema version for forward migrations + +--- + +## User Interface + +### Layout + +``` ++--------------------------------------------------+ +| GitLab Inbox [Focus] [Refresh] 🟢 2m ago | +| [Inbox] [Snoozed] [Watchlist] [Done Today] | ++--------------------------------------------------+ +| | +| > [mentioned] Fix login bug | +| Alice Smith · infra-frontend · 2h ago | +| [Snooze] [Handle] [Open] | +| | +| [review_requested] Add caching layer | +| Bob Jones · api-service · 1d ago | +| [Snooze] [Handle] [Open] | +| | +| [assigned] Update documentation | +| Carol White · docs · 3d ago | +| [Snooze] [Handle] [Open] | +| | ++--------------------------------------------------+ +| j/k: navigate Enter: open h: handle s: snooze | ++--------------------------------------------------+ +``` + +### Action Badge Colors + +| Action | Color | Meaning | +|--------|-------|---------| +| mentioned | Blue | Someone mentioned you | +| assigned | Purple | Assigned to you | +| approval_required | Yellow | Needs your approval | +| build_failed | Red | Pipeline failure | +| directly_addressed | Cyan | Direct @ mention | +| unmergeable | Orange | MR has conflicts | +| marked | Gray | Marked as todo | +| merge_train_removed | Red | Removed from merge train | +| member_access_requested | Teal | Access request | +| (unknown) | Gray | Forward-compatible fallback | + +### States + +- **Loading**: Skeleton cards while fetching +- **Empty**: "All clear! No pending items." message +- **Error**: Connection error with retry button +- **Stale**: Visual indicator if data is old (> 5 min since last *successful* refresh) +- **Backoff**: Indicator showing retry status when experiencing errors + +--- + +## User Flows + +### Flow 1: Morning Check-in +1. Open GitLab Inbox (already running in background tab) +2. See list of todos sorted by newest first +3. Press `Enter` to open in GitLab +4. Handle the item (reply, review, etc.) +5. Return to Inbox, press `h` to mark handled +6. Item moves to Done Today +7. Repeat until Inbox is empty + +### Flow 2: Triage with Snooze +1. See inbox with 12 items +2. Quickly triage: handle 3, snooze 5 until tomorrow, dismiss 2 already-resolved +3. Inbox now shows 2 items to focus on +4. Tomorrow: snoozed items wake up and return to inbox + +### Flow 3: Awaiting Reply +1. Handle a todo (you replied to someone's question) +2. Toggle "Add to Watchlist" when marking handled +3. Item appears in Watchlist view +4. Later: see "New activity" indicator when they respond +5. Open, read response, remove from watchlist + +### Flow 4: Focus Session +1. Enable Focus Mode +2. See only top 3 items +3. Work through them sequentially +4. As items complete, next ones appear +5. Reduced decision fatigue + +### Flow 5: End-of-Day Review +1. Navigate to Done Today view +2. See all items handled today +3. Satisfaction from visible progress + +--- + +## Success Metrics + +| Metric | Target | Measurement | +|--------|--------|-------------| +| Time to awareness | < 2 min | Time from GitLab event to user seeing it | +| Daily items handled | Increased | Compare to baseline (manual tracking) | +| Context switches | Reduced | Fewer GitLab tabs open simultaneously | +| Snooze usage | Regular | Items snoozed vs dismissed (healthy ratio = snooze used) | +| Reply awareness | High | Watchlist items caught before manual check | +| User satisfaction | Qualitative | Does this reduce ADHD-related friction? | + +--- + +## Risks and Mitigations + +| Risk | Impact | Mitigation | +|------|--------|------------| +| GitLab API rate limits | Polling blocked | Configurable interval, backoff + jitter, respect 429/Retry-After | +| Token expiration/rotation | App stops working | Clear error state + setup flow; surface expiry guidance and re-auth path | +| State file corruption | Lose handled/snoozed/watch state | Atomic writes (tmp+rename), schema validation on load, keep last-known-good backup | +| GitLab API changes | App breaks | Pin to known API version, monitor deprecations, forward-compatible types | +| Token leakage | Security incident | Store in OS keychain, not in repo-adjacent files | + +--- + +## Future Considerations (Post v1.0) + +- **Grouping**: By project, by action type +- **Stale highlighting**: Visual alert for items waiting > X days +- **Desktop notifications**: OS-level alerts for new high-priority items +- **Quick actions**: Approve MR, close issue directly from app +- **Multiple GitLab instances**: Connect to both gitlab.com and self-hosted +- **Done history**: View handled items from yesterday, this week + +--- + +## Implementation Phases + +### Phase 0: Setup & Auth +- First-run setup wizard (GitLab URL + token) +- Token storage implementation (keychain/encrypted local) +- Connectivity check (`/todos`, auth failure UX) +- Clear error states for invalid/expired tokens + +### Phase 1: Foundation +- Initialize TanStack Start project +- Set up Tailwind CSS +- Create GitLab API client with PAT auth +- Fetch and display todos in basic list (using `target_url` for navigation) +- Implement click-to-open + +### Phase 2: Core Workflow +- Add local storage with atomic writes + backup +- Implement date-bucketed handled state +- Implement "Mark Handled" action +- Create Done Today view +- Add keyboard shortcuts (minimal set: j/k/Enter/h/s/d) +- Add Snooze + Snoozed view +- Filter handled/snoozed todos from Inbox + +### Phase 3: Reliability & Awareness +- Background polling with configurable interval +- Backoff/jitter + 429 handling +- Last successful refresh tracking +- Watchlist ("Awaiting Reply") implementation +- Per-target polling for watched items (small set) +- Add manual refresh button +- Relative time display +- Action type badges with colors +- Loading and error states +- Connection status indicator + +### Phase 4: Polish +- Focus Mode implementation +- Snooze time picker refinement +- Keyboard shortcut help overlay +- State migration handling (schemaVersion) +- Edge case handling (DST, timezone changes) + +--- + +## Appendix + +### Environment Configuration + +**Primary configuration**: +- URL + settings in: `~/.config/gitlab-inbox/config.json` +- Token stored in OS keychain (preferred) + +**Optional (dev-only) `.env.local` support**: +```env +GITLAB_URL=https://gitlab.yourcompany.com +GITLAB_TOKEN=glpat-xxxxxxxxxxxx +``` + +**Config file structure**: +```json +{ + "gitlabUrl": "https://gitlab.yourcompany.com", + "pollingInterval": 60, + "focusModeCount": 3 +} +``` + +### Creating a GitLab PAT + +1. Go to GitLab → User Settings → Access Tokens +2. Create token with `read_api` scope +3. Set expiration (note: tokens expire at midnight UTC on expiry date) +4. Save token via setup wizard (stored in keychain) +5. Token never leaves local machine + +### Project Structure + +``` +gitlab-inbox/ +├── app/ +│ ├── routes/ +│ │ ├── __root.tsx +│ │ ├── index.tsx # Inbox view +│ │ ├── snoozed.tsx # Snoozed view +│ │ ├── watchlist.tsx # Watchlist view +│ │ ├── done.tsx # Done Today view +│ │ └── setup.tsx # First-run setup +│ ├── components/ +│ │ ├── TodoCard.tsx +│ │ ├── TodoList.tsx +│ │ ├── ActionBadge.tsx +│ │ ├── Header.tsx +│ │ ├── SnoozePicker.tsx +│ │ ├── FocusMode.tsx +│ │ └── KeyboardHelp.tsx +│ ├── lib/ +│ │ ├── gitlab.ts # API client +│ │ ├── storage.ts # Atomic state persistence +│ │ ├── keychain.ts # Token storage +│ │ ├── polling.ts # Polling state machine +│ │ ├── snooze.ts # Snooze logic + wake checking +│ │ ├── watchlist.ts # Watchlist polling +│ │ └── types.ts +│ └── app.tsx +├── package.json +├── tailwind.config.ts +└── vite.config.ts +``` + +### Test Strategy + +**Unit Tests**: +- State normalization and migration +- Snooze wake time calculations +- Date bucketing logic (timezone handling) +- Polling backoff calculations + +**Integration Tests** (mocked GitLab API): +- `/todos` response parsing +- `mark_as_done` endpoint calls +- Error handling (401, 429, network errors) +- State persistence round-trip (write + read) +- Backup recovery on corruption + +**Manual Testing**: +- First-run setup flow +- Keyboard navigation +- Snooze + wake cycle +- Watchlist activity detection diff --git a/SPEC.md b/SPEC.md new file mode 100644 index 0000000..3279dc8 --- /dev/null +++ b/SPEC.md @@ -0,0 +1,641 @@ +# GitLab Knowledge Engine - Spec Document + +## Executive Summary + +A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP. + +--- + +## Discovery Summary + +### Pain Points Identified +1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find +2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions + +### Constraints +| Constraint | Detail | +|------------|--------| +| Hosting | Self-hosted only, no external APIs | +| Compute | Local dev machine (M-series Mac assumed) | +| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) | +| Build Method | AI agents will implement; user is TypeScript expert for review | + +### Target Use Cases (Priority Order) +1. **MVP: Semantic Search** - "Find discussions about authentication redesign" +2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?" +3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?" +4. **Future: Person Context** - "What's @johndoe's background in this project?" + +--- + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ GitLab API │ +│ (Issues, MRs, Notes) │ +└─────────────────────────────────────────────────────────────────┘ + (Commit-level indexing explicitly post-MVP) + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Data Ingestion Layer │ +│ - Incremental sync (PAT-based polling) │ +│ - Rate limiting / backoff │ +│ - Raw JSON storage for replay │ +│ - Dependent resource fetching (notes, MR changes) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Data Processing Layer │ +│ - Normalize artifacts to unified schema │ +│ - Extract searchable documents (canonical text + metadata) │ +│ - Content hashing for change detection │ +│ - Build relationship graph (issue↔MR↔note↔file) │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Storage Layer │ +│ - SQLite + sqlite-vss + FTS5 (hybrid search) │ +│ - Structured metadata in relational tables │ +│ - Vector embeddings for semantic search │ +│ - Full-text index for lexical search fallback │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Query Interface │ +│ - CLI for human testing │ +│ - JSON API for AI agent testing │ +│ - Semantic search with filters (author, date, type, label) │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Technology Choices + +| Component | Recommendation | Rationale | +|-----------|---------------|-----------| +| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly | +| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in | +| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors | +| CLI Framework | Commander.js or oclif | Standard, well-documented | + +### Alternative Considered: Postgres + pgvector +- Pros: More scalable, better for production multi-user +- Cons: Requires running Postgres, heavier setup +- Decision: Start with SQLite for simplicity; migration path exists if needed + +--- + +## Checkpoint Structure + +Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding. + +### Checkpoint 0: Project Setup +**Deliverable:** Scaffolded project with GitLab API connection verified + +**Tests:** +1. Run `gitlab-engine auth-test` → returns authenticated user info +2. Run `gitlab-engine doctor` → verifies: + - Can reach GitLab baseUrl + - PAT is present and can read configured projects + - SQLite opens DB and migrations apply + - Ollama reachable OR embedding disabled with clear warning + +**Scope:** +- Project structure (TypeScript, ESLint, Vitest) +- GitLab API client with PAT authentication +- Environment and project configuration +- Basic CLI scaffold with `auth-test` command +- `doctor` command for environment verification +- Projects table and initial sync + +**Configuration (MVP):** +```json +// gitlab-engine.config.json +{ + "gitlab": { + "baseUrl": "https://gitlab.example.com", + "tokenEnvVar": "GITLAB_TOKEN" + }, + "projects": [ + { "path": "group/project-one" }, + { "path": "group/project-two" } + ], + "embedding": { + "provider": "ollama", + "model": "nomic-embed-text", + "baseUrl": "http://localhost:11434" + } +} +``` + +**DB Runtime Defaults (Checkpoint 0):** +- On every connection: + - `PRAGMA journal_mode=WAL;` + - `PRAGMA foreign_keys=ON;` + +**Schema (Checkpoint 0):** +```sql +-- Projects table (configured targets) +CREATE TABLE projects ( + id INTEGER PRIMARY KEY, + gitlab_project_id INTEGER UNIQUE NOT NULL, + path_with_namespace TEXT NOT NULL, + default_branch TEXT, + web_url TEXT, + created_at INTEGER, + updated_at INTEGER, + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_projects_path ON projects(path_with_namespace); + +-- Sync tracking for reliability +CREATE TABLE sync_runs ( + id INTEGER PRIMARY KEY, + started_at INTEGER NOT NULL, + finished_at INTEGER, + status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' + command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc. + error TEXT +); + +-- Sync cursors for primary resources only +-- Notes and MR changes are dependent resources (fetched via parent updates) +CREATE TABLE sync_cursors ( + project_id INTEGER NOT NULL REFERENCES projects(id), + resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests' + updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch) + tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering) + PRIMARY KEY(project_id, resource_type) +); + +-- Raw payload storage (decoupled from entity tables) +CREATE TABLE raw_payloads ( + id INTEGER PRIMARY KEY, + source TEXT NOT NULL, -- 'gitlab' + resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' + gitlab_id INTEGER NOT NULL, + fetched_at INTEGER NOT NULL, + json TEXT NOT NULL +); +CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id); +``` + +--- + +### Checkpoint 1: Issue Ingestion +**Deliverable:** All issues from target repos stored locally + +**Test:** Run `gitlab-engine ingest --type=issues` → count matches GitLab; run `gitlab-engine list issues --limit=10` → displays issues correctly + +**Scope:** +- Issue fetcher with pagination handling +- Raw JSON storage in raw_payloads table +- Normalized issue schema in SQLite +- Labels ingestion derived from issue payload: + - Always persist label names from `labels: string[]` + - Optionally request `with_labels_details=true` to capture color/description when available +- Incremental sync support (run tracking + per-project cursor) +- Basic list/count CLI commands + +**Reliability/Idempotency Rules:** +- Every ingest/sync creates a `sync_runs` row +- Single-flight: refuse to start if an existing run is `running` (unless `--force`) +- Cursor advances only after successful transaction commit per page/batch +- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC` +- Use explicit transactions for batch inserts + +**Schema Preview:** +```sql +CREATE TABLE issues ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + iid INTEGER NOT NULL, + title TEXT, + description TEXT, + state TEXT, + author_username TEXT, + created_at INTEGER, + updated_at INTEGER, + web_url TEXT, + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); +CREATE INDEX idx_issues_author ON issues(author_username); + +-- Labels are derived from issue payloads (string array) +-- Uniqueness is (project_id, name) since gitlab_id isn't always available +CREATE TABLE labels ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER, -- optional (only if available) + project_id INTEGER NOT NULL REFERENCES projects(id), + name TEXT NOT NULL, + color TEXT, + description TEXT +); +CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name); +CREATE INDEX idx_labels_name ON labels(name); + +CREATE TABLE issue_labels ( + issue_id INTEGER REFERENCES issues(id), + label_id INTEGER REFERENCES labels(id), + PRIMARY KEY(issue_id, label_id) +); +CREATE INDEX idx_issue_labels_label ON issue_labels(label_id); +``` + +--- + +### Checkpoint 2: MR + Comments + File Links Ingestion +**Deliverable:** All MRs, discussion threads, and file-change links stored locally + +**Test:** Run `gitlab-engine ingest --type=merge_requests` → count matches; run `gitlab-engine show mr 1234` → displays MR with comments and files changed + +**Scope:** +- MR fetcher with pagination +- Notes fetcher (issue notes + MR notes) as a dependent resource: + - During initial ingest: fetch notes for every issue/MR + - During sync: refetch notes only for issues/MRs updated since cursor +- MR changes/diffs fetcher as a dependent resource: + - During initial ingest: fetch changes for every MR + - During sync: refetch changes only for MRs updated since cursor +- Relationship linking (note → parent issue/MR via foreign keys, MR → files) +- Extended CLI commands for MR display + +**Schema Additions:** +```sql +CREATE TABLE merge_requests ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + iid INTEGER NOT NULL, + title TEXT, + description TEXT, + state TEXT, + author_username TEXT, + source_branch TEXT, + target_branch TEXT, + created_at INTEGER, + updated_at INTEGER, + merged_at INTEGER, + web_url TEXT, + raw_payload_id INTEGER REFERENCES raw_payloads(id) +); +CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); +CREATE INDEX idx_mrs_author ON merge_requests(author_username); + +-- Notes with explicit parent foreign keys for referential integrity +CREATE TABLE notes ( + id INTEGER PRIMARY KEY, + gitlab_id INTEGER UNIQUE NOT NULL, + project_id INTEGER NOT NULL REFERENCES projects(id), + issue_id INTEGER REFERENCES issues(id), + merge_request_id INTEGER REFERENCES merge_requests(id), + noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' + noteable_iid INTEGER NOT NULL, -- parent IID (from API path) + author_username TEXT, + body TEXT, + created_at INTEGER, + updated_at INTEGER, + system BOOLEAN, + raw_payload_id INTEGER REFERENCES raw_payloads(id), + -- Exactly one parent FK must be set + CHECK ( + (noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR + (noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL) + ) +); +CREATE INDEX idx_notes_issue ON notes(issue_id); +CREATE INDEX idx_notes_mr ON notes(merge_request_id); +CREATE INDEX idx_notes_author ON notes(author_username); + +-- File linkage for "what MRs touched this file?" queries (with rename support) +CREATE TABLE mr_files ( + id INTEGER PRIMARY KEY, + merge_request_id INTEGER REFERENCES merge_requests(id), + old_path TEXT, + new_path TEXT, + new_file BOOLEAN, + deleted_file BOOLEAN, + renamed_file BOOLEAN, + UNIQUE(merge_request_id, old_path, new_path) +); +CREATE INDEX idx_mr_files_old_path ON mr_files(old_path); +CREATE INDEX idx_mr_files_new_path ON mr_files(new_path); + +-- MR labels (reuse same labels table) +CREATE TABLE mr_labels ( + merge_request_id INTEGER REFERENCES merge_requests(id), + label_id INTEGER REFERENCES labels(id), + PRIMARY KEY(merge_request_id, label_id) +); +CREATE INDEX idx_mr_labels_label ON mr_labels(label_id); +``` + +--- + +### Checkpoint 3: Embedding Generation +**Deliverable:** Vector embeddings generated for all text content + +**Test:** Run `gitlab-engine embed --all` → progress indicator; run `gitlab-engine stats` → shows embedding coverage percentage + +**Scope:** +- Ollama integration (nomic-embed-text model) +- Embedding generation pipeline (batch processing) +- Vector storage in SQLite (sqlite-vss extension) +- Progress tracking and resumability +- Document extraction layer: + - Canonical "search documents" derived from issues/MRs/notes + - Stable content hashing for change detection (SHA-256 of content_text) + - Single embedding per document (chunking deferred to post-MVP) +- Denormalized metadata for fast filtering (author, labels, dates) +- Fast label filtering via `document_labels` join table + +**Schema Additions:** +```sql +-- Unified searchable documents (derived from issues/MRs/notes) +CREATE TABLE documents ( + id INTEGER PRIMARY KEY, + source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'note' + source_id INTEGER NOT NULL, -- local DB id in the source table + project_id INTEGER NOT NULL REFERENCES projects(id), + author_username TEXT, + label_names TEXT, -- JSON array (display/debug only) + created_at INTEGER, + updated_at INTEGER, + url TEXT, + title TEXT, -- null for notes + content_text TEXT NOT NULL, -- canonical text for embedding/snippets + content_hash TEXT NOT NULL, -- SHA-256 for change detection + UNIQUE(source_type, source_id) +); +CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at); +CREATE INDEX idx_documents_author ON documents(author_username); +CREATE INDEX idx_documents_source ON documents(source_type, source_id); + +-- Fast label filtering for documents (indexed exact-match) +CREATE TABLE document_labels ( + document_id INTEGER NOT NULL REFERENCES documents(id), + label_name TEXT NOT NULL, + PRIMARY KEY(document_id, label_name) +); +CREATE INDEX idx_document_labels_label ON document_labels(label_name); + +-- sqlite-vss virtual table +-- Storage rule: embeddings.rowid = documents.id +CREATE VIRTUAL TABLE embeddings USING vss0( + embedding(768) +); + +-- Embedding provenance + change detection +-- document_id is PRIMARY KEY and equals embeddings.rowid +CREATE TABLE embedding_metadata ( + document_id INTEGER PRIMARY KEY REFERENCES documents(id), + model TEXT NOT NULL, -- 'nomic-embed-text' + dims INTEGER NOT NULL, -- 768 + content_hash TEXT NOT NULL, -- copied from documents.content_hash + created_at INTEGER NOT NULL +); +``` + +**Storage Rule (MVP):** +- Insert embedding with `rowid = documents.id` +- Upsert `embedding_metadata` by `document_id` +- This alignment simplifies joins and eliminates rowid mapping fragility + +**Document Extraction Rules:** +- Issue → title + "\n\n" + description +- MR → title + "\n\n" + description +- Note → body (skip system notes unless they contain meaningful content) + +--- + +### Checkpoint 4: Semantic Search +**Deliverable:** Working semantic search across all indexed content + +**Tests:** +1. Run `gitlab-engine search "authentication redesign"` → returns ranked results with snippets +2. Golden queries: curated list of 10 queries with expected result *containment* (e.g., "at least one of these 3 known URLs appears in top 10") +3. `gitlab-engine search "..." --json` validates against JSON schema (stable fields present) + +**Scope:** +- Hybrid retrieval: + - Vector recall (sqlite-vss) + FTS lexical recall (fts5) + - Merge + rerank results using Reciprocal Rank Fusion (RRF) +- Result ranking and scoring (document-level) +- Search filters: `--type=issue|mr|note`, `--author=username`, `--after=date`, `--label=name` + - Label filtering operates on `document_labels` (indexed, exact-match) +- Output formatting: ranked list with title, snippet, score, URL +- JSON output mode for AI agent consumption + +**Schema Additions:** +```sql +-- Full-text search for hybrid retrieval +CREATE VIRTUAL TABLE documents_fts USING fts5( + title, + content_text, + content='documents', + content_rowid='id' +); + +-- Triggers to keep FTS in sync +CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN + INSERT INTO documents_fts(rowid, title, content_text) + VALUES (new.id, new.title, new.content_text); +END; + +CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN + INSERT INTO documents_fts(documents_fts, rowid, title, content_text) + VALUES('delete', old.id, old.title, old.content_text); +END; + +CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN + INSERT INTO documents_fts(documents_fts, rowid, title, content_text) + VALUES('delete', old.id, old.title, old.content_text); + INSERT INTO documents_fts(rowid, title, content_text) + VALUES (new.id, new.title, new.content_text); +END; +``` + +**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:** +1. Query both vector index (top 50) and FTS5 (top 50) +2. Merge results by document_id +3. Combine with Reciprocal Rank Fusion (RRF): + - For each retriever list, assign ranks (1..N) + - `rrf_score = Σ 1 / (k + rank)` with k=60 (tunable) + - RRF is simpler than weighted sums and doesn't require score normalization +4. Apply filters (type, author, date, label) +5. Return top K + +**Why RRF over Weighted Sums:** +- FTS5 BM25 scores and vector distances use different scales +- Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization +- RRF operates on ranks, not scores, making it robust to scale differences +- Well-established in information retrieval literature + +**CLI Interface:** +```bash +# Basic semantic search +gitlab-engine search "why did we choose Redis" + +# Pure FTS search (fallback if embeddings unavailable) +gitlab-engine search "redis" --mode=lexical + +# Filtered search +gitlab-engine search "authentication" --type=mr --after=2024-01-01 + +# Filter by label +gitlab-engine search "performance" --label=bug --label=critical + +# JSON output for programmatic use +gitlab-engine search "payment processing" --json +``` + +--- + +### Checkpoint 5: Incremental Sync +**Deliverable:** Efficient ongoing synchronization with GitLab + +**Test:** Make a change in GitLab; run `gitlab-engine sync` → only fetches changed items; verify change appears in search + +**Scope:** +- Delta sync based on stable cursor (updated_at + tie-breaker id) +- Dependent resources sync strategy (notes, MR changes) +- Webhook handler (optional, if webhook access granted) +- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) +- Sync status reporting + +**Correctness Rules (MVP):** +1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` +2. Cursor advances only after successful DB commit for that page +3. Dependent resources: + - For each updated issue/MR, refetch its notes (sorted by `updated_at`) + - For each updated MR, refetch its file changes +4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` +5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) + +**Why Dependent Resource Model:** +- GitLab Notes API doesn't provide a clean global `updated_after` stream +- Notes are listed per-issue or per-MR, not as a top-level resource +- Treating notes as dependent resources (refetch when parent updates) is simpler and more correct +- Same applies to MR changes/diffs + +**CLI Commands:** +```bash +# Full sync (respects cursors, only fetches new/updated) +gitlab-engine sync + +# Force full re-sync (resets cursors) +gitlab-engine sync --full + +# Override stale 'running' run after operator review +gitlab-engine sync --force + +# Show sync status +gitlab-engine sync-status +``` + +--- + +## Future Checkpoints (Post-MVP) + +### Checkpoint 6: File/Feature History View +- Map commits to MRs to discussions +- Query: "Show decision history for src/auth/login.ts" +- Ship `gitlab-engine file-history ` as a first-class feature here +- This command is deferred from MVP to sharpen checkpoint focus + +### Checkpoint 7: Personal Dashboard +- Filter by assigned/mentioned +- Integrate with existing gitlab-inbox tool + +### Checkpoint 8: Person Context +- Aggregate contributions by author +- Expertise inference from activity + +### Checkpoint 9: Decision Graph +- Extract decisions from discussions (LLM-assisted) +- Visualize decision relationships + +--- + +## Verification Strategy + +Each checkpoint includes: + +1. **Automated tests** - Unit tests for data transformations, integration tests for API calls +2. **CLI smoke tests** - Manual commands with expected outputs documented +3. **Data integrity checks** - Count verification against GitLab, schema validation +4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+) + +--- + +## Risk Mitigation + +| Risk | Mitigation | +|------|------------| +| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync | +| Embedding model quality | Start with nomic-embed-text; architecture allows model swap | +| SQLite scale limits | Monitor performance; Postgres migration path documented | +| Stale data | Incremental sync with change detection | +| Mid-sync failures | Cursor-based resumption, sync_runs audit trail | +| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | +| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) | + +**SQLite Performance Defaults (MVP):** +- Enable `PRAGMA journal_mode=WAL;` on every connection +- Enable `PRAGMA foreign_keys=ON;` on every connection +- Use explicit transactions for page/batch inserts +- Targeted indexes on `(project_id, updated_at)` for primary resources + +--- + +## Schema Summary + +| Table | Checkpoint | Purpose | +|-------|------------|---------| +| projects | 0 | Configured GitLab projects | +| sync_runs | 0 | Audit trail of sync operations | +| sync_cursors | 0 | Resumable sync state per primary resource | +| raw_payloads | 0 | Decoupled raw JSON storage | +| issues | 1 | Normalized issues | +| labels | 1 | Label definitions (unique by project + name) | +| issue_labels | 1 | Issue-label junction | +| merge_requests | 2 | Normalized MRs | +| notes | 2 | Issue and MR comments (with parent FKs) | +| mr_files | 2 | MR file changes (with rename tracking) | +| mr_labels | 2 | MR-label junction | +| documents | 3 | Unified searchable documents | +| document_labels | 3 | Document-label junction for fast filtering | +| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) | +| embedding_metadata | 3 | Embedding provenance + change detection | +| documents_fts | 4 | Full-text search index (fts5) | + +--- + +## Resolved Decisions + +| Question | Decision | Rationale | +|----------|----------|-----------| +| Commit/file linkage | **Include MR→file links** | Enables "what MRs touched this file?" without full commit history | +| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering | +| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available | +| Sync method | **Polling for MVP** | Decide on webhooks after using the system | +| Notes sync | **Dependent resource** | Notes API is per-parent, not global; refetch on parent update | +| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | +| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts | +| file-history CLI | **Post-MVP (CP6)** | Sharpens MVP checkpoint focus | + +--- + +## Next Steps + +1. User approves this spec +2. Generate Checkpoint 0 PRD for project setup +3. Implement Checkpoint 0 +4. Human validates → proceed to Checkpoint 1 +5. Repeat for each checkpoint