This commit is contained in:
teernisse
2026-01-20 13:11:36 -05:00
commit 7702d2a493
2 changed files with 1204 additions and 0 deletions

563
PRD.md Normal file
View File

@@ -0,0 +1,563 @@
# GitLab Inbox - Product Requirements Document
## Overview
**Product Name**: GitLab Inbox
**Version**: 1.0
**Author**: Taylor Eernisse
**Date**: January 16, 2026
### Problem Statement
Managing GitLab activity with ADHD is overwhelming. The native GitLab interface creates cognitive overload through:
- **Information scatter**: Issues, MRs, and activity are spread across multiple pages
- **Missing reply awareness**: Hard to know when someone has responded to your question (not fully covered by /todos alone)
- **Context loss**: Difficult to find the right tab or remember which conversation you were tracking
- **No unified "what's next"**: Multiple clicks required to understand what needs attention
### Solution
A local, always-open "inbox" application that presents GitLab notifications in an ADHD-friendly interface with explicit "handled" tracking, snooze capabilities, watchlist for awaiting replies, and progress visibility.
---
## Target User
**Primary Persona**: Software developer with ADHD working on 1-2 GitLab projects who needs to track conversations and respond to mentions, reviews, and assignments without cognitive overload.
**Key Characteristics**:
- Needs clear "what's next" visibility
- Benefits from external accountability (seeing who's waiting)
- Motivated by progress tracking (watching a list shrink)
- Prefers always-open tools over on-demand checks
- Struggles with context switching and finding the right place
- Needs a "not now but not forgotten" path that doesn't require willpower
---
## Goals
### User Goals
1. Know immediately when someone has replied or needs my attention
2. Quickly navigate to the right place in GitLab to respond
3. Track what I've handled today for satisfaction and progress awareness
4. Reduce cognitive load of manually tracking conversations
5. Defer items temporarily without losing accountability (snooze)
6. Know when someone has replied to something I'm waiting on
### Product Goals
1. Reduce time-to-awareness for GitLab notifications
2. Eliminate the need to manually poll GitLab for updates
3. Provide ADHD-friendly UX patterns (clear actions, progress visibility, minimal decisions)
4. Enable keyboard-first operation to reduce friction
### Non-Goals (v1.0)
- Replacing GitLab for any write operations (commenting, reviewing, merging)
- Supporting multiple GitLab instances
- Team/shared usage
- Mobile support
---
## Core Features
### 1. Inbox View (Primary)
**Description**: Display all GitLab todos (notifications) that need attention.
**Data Source**: GitLab `/todos` API endpoint
**Display Elements** (per item):
| Element | Description |
|---------|-------------|
| Action Badge | Type indicator: mentioned, assigned, review_requested, build_failed, etc. |
| Target Title | MR or Issue title |
| Author | Who triggered this todo (name + avatar) |
| Time | Relative time since created ("2h ago", "3 days") |
| Project | Project name for context |
**Interactions**:
- **Click item / Enter** → Opens target URL in browser (GitLab)
- **Mark Handled** → Moves item to Done Today (local state only)
- **Snooze** → Hides item until a chosen time (local state only)
- **Dismiss** → `POST /todos/:id/mark_as_done` (marks as done in GitLab)
**Filtering**: Items marked as "handled" or "snoozed" locally are hidden from Inbox.
### 2. Snoozed View
**Description**: Items temporarily deferred until their wake time.
**Purpose**:
- "Not now but not forgotten" path
- Reduces inbox dread by shrinking the visible list
- Enables focus sessions: clear the deck, then pull from Snoozed intentionally
**Snooze Options**:
- Later today (3 hours)
- Tomorrow morning (9am local)
- Next weekday (Mon-Fri, 9am)
- Custom date/time
**Behavior**:
- Snoozed items are hidden from Inbox
- When wake time passes, item returns to Inbox with a "Woke up" indicator
- Snoozed view shows all snoozed items with their wake times
### 3. Watchlist (Awaiting Reply)
**Description**: Targets you're explicitly waiting on (MRs/Issues/etc.). Alerts when there is new activity since you last checked.
**Purpose**:
- GitLab todos don't guarantee "someone replied" notifications
- Explicit watch semantics for "I'm waiting on Bob" tracking
- Gain "external accountability" symmetry
**Data Sources**:
- Primary: /todos (fast path for items that generate new todos)
- Secondary: per-target `updated_at`/notes polling for watched items (small set)
**Interactions**:
- Mark Handled → optionally "Add to Watchlist" toggle
- Watch item shows "Last seen" timestamp and "New activity" indicator
- Click to open target in GitLab
- Remove from watchlist when no longer waiting
### 4. Done Today View
**Description**: Items marked as handled during the current day.
**Purpose**:
- ADHD-friendly progress visibility
- Satisfaction from watching list shrink
- Review of daily accomplishments
**Behavior**:
- Stored as date-bucketed ledger keyed by local date (YYYY-MM-DD)
- "Done Today" shows bucket for current local date
- Option to clear today's bucket only
- Historical buckets retained for potential "Done Yesterday" or weekly views
### 5. Manual Refresh
**Description**: Button to fetch latest todos on demand.
**Purpose**: Immediate update when user knows something changed.
### 6. Background Polling (v1.1)
**Description**: Automatic periodic refresh of todos.
**Configuration**:
- Base interval (default: 60s)
- Backoff on failure (exponential, capped at 15m) with jitter
- 429 handling (respect `Retry-After` header; otherwise back off)
**Indicator**:
- Last successful refresh time
- Next scheduled refresh
- Current backoff state (if any)
### 7. Keyboard Shortcuts (v1.0)
**Description**: Keyboard-first operation for reduced friction.
| Key | Action |
|-----|--------|
| `j` / `k` | Navigate down / up |
| `Enter` | Open selected item in GitLab |
| `h` | Mark handled |
| `s` | Snooze (opens snooze picker) |
| `d` | Dismiss (mark as done in GitLab) |
| `w` | Add to / remove from watchlist |
| `/` | Focus search/filter |
### 8. Focus Mode (Optional)
**Description**: Show only the next N items (default 3) to reduce decision load.
**Purpose**:
- Convert "overwhelm" into "sequence"
- Reduce choices, increase throughput
- ADHD-optimized: work the queue, don't manage the list
**Behavior**:
- Primary action emphasized ("Open", then "Handled"/"Snooze")
- Toggle: Focus / All Items
- Focus queue is top N unhandled items by creation date
---
## Technical Architecture
### Tech Stack
| Component | Technology |
|-----------|------------|
| Framework | TanStack Start |
| Styling | Tailwind CSS |
| Runtime | Node.js (local) |
| State Persistence | JSON file with atomic writes (local) |
| Secret Storage | OS keychain (preferred) or encrypted local store |
| GitLab Integration | REST API with Personal Access Token |
### Deployment
- **Local only**: Runs on localhost
- **No external hosting**: No cloud deployment, no auth flows
- **Single user**: No multi-tenancy
### GitLab API
**Authentication**: Personal Access Token (PAT) with `read_api` scope
**Primary Endpoints**:
```
GET /api/v4/todos?state=pending&per_page=100
POST /api/v4/todos/:id/mark_as_done
```
**Response Structure** (relevant fields):
```typescript
interface GitLabTodo {
id: number;
action_name:
| 'assigned'
| 'mentioned'
| 'build_failed'
| 'marked'
| 'approval_required'
| 'unmergeable'
| 'directly_addressed'
| 'merge_train_removed'
| 'member_access_requested'
| string; // forward-compatible for new action types
target_type: 'MergeRequest' | 'Issue' | 'Commit' | 'Epic' | 'DesignManagement::Design' | string;
target: {
id: number;
iid: number;
title: string;
web_url?: string; // optional; may not be present for all target types
};
target_url: string; // canonical "Open" URL - use this for navigation
author: {
id: number;
name: string;
avatar_url: string;
};
project: {
id: number;
name: string;
path_with_namespace: string;
};
created_at: string;
}
```
### Local State
```typescript
interface LocalState {
schemaVersion: number; // for migrations
handledByDate: {
[localDate: string]: { // YYYY-MM-DD in local time
[todoId: number]: {
handledAt: string; // ISO timestamp
todo: GitLabTodo; // Snapshot for Done Today display
}
}
};
snoozedTodos: {
[todoId: number]: {
wakeAt: string; // ISO timestamp
snoozedAt: string; // ISO timestamp
todo: GitLabTodo; // snapshot
}
};
watchlist: {
[watchKey: string]: { // e.g., "MergeRequest:123" or "Issue:456"
targetType: string;
projectId?: number;
targetId: number;
targetIid?: number;
targetUrl: string;
lastSeenUpdatedAt?: string; // ISO - when we last observed the target
lastCheckedAt?: string; // ISO - when we last polled
addedAt: string; // ISO
muted?: boolean;
}
};
}
```
**Storage**: `~/.config/gitlab-inbox/state.json`
**Persistence Strategy**:
- Atomic writes: write to `state.json.tmp`, then rename to `state.json`
- Keep `state.json.bak` as last-known-good before each write
- Validate JSON schema on load; if invalid, fall back to backup and surface warning
- Schema version for forward migrations
---
## User Interface
### Layout
```
+--------------------------------------------------+
| GitLab Inbox [Focus] [Refresh] 🟢 2m ago |
| [Inbox] [Snoozed] [Watchlist] [Done Today] |
+--------------------------------------------------+
| |
| > [mentioned] Fix login bug |
| Alice Smith · infra-frontend · 2h ago |
| [Snooze] [Handle] [Open] |
| |
| [review_requested] Add caching layer |
| Bob Jones · api-service · 1d ago |
| [Snooze] [Handle] [Open] |
| |
| [assigned] Update documentation |
| Carol White · docs · 3d ago |
| [Snooze] [Handle] [Open] |
| |
+--------------------------------------------------+
| j/k: navigate Enter: open h: handle s: snooze |
+--------------------------------------------------+
```
### Action Badge Colors
| Action | Color | Meaning |
|--------|-------|---------|
| mentioned | Blue | Someone mentioned you |
| assigned | Purple | Assigned to you |
| approval_required | Yellow | Needs your approval |
| build_failed | Red | Pipeline failure |
| directly_addressed | Cyan | Direct @ mention |
| unmergeable | Orange | MR has conflicts |
| marked | Gray | Marked as todo |
| merge_train_removed | Red | Removed from merge train |
| member_access_requested | Teal | Access request |
| (unknown) | Gray | Forward-compatible fallback |
### States
- **Loading**: Skeleton cards while fetching
- **Empty**: "All clear! No pending items." message
- **Error**: Connection error with retry button
- **Stale**: Visual indicator if data is old (> 5 min since last *successful* refresh)
- **Backoff**: Indicator showing retry status when experiencing errors
---
## User Flows
### Flow 1: Morning Check-in
1. Open GitLab Inbox (already running in background tab)
2. See list of todos sorted by newest first
3. Press `Enter` to open in GitLab
4. Handle the item (reply, review, etc.)
5. Return to Inbox, press `h` to mark handled
6. Item moves to Done Today
7. Repeat until Inbox is empty
### Flow 2: Triage with Snooze
1. See inbox with 12 items
2. Quickly triage: handle 3, snooze 5 until tomorrow, dismiss 2 already-resolved
3. Inbox now shows 2 items to focus on
4. Tomorrow: snoozed items wake up and return to inbox
### Flow 3: Awaiting Reply
1. Handle a todo (you replied to someone's question)
2. Toggle "Add to Watchlist" when marking handled
3. Item appears in Watchlist view
4. Later: see "New activity" indicator when they respond
5. Open, read response, remove from watchlist
### Flow 4: Focus Session
1. Enable Focus Mode
2. See only top 3 items
3. Work through them sequentially
4. As items complete, next ones appear
5. Reduced decision fatigue
### Flow 5: End-of-Day Review
1. Navigate to Done Today view
2. See all items handled today
3. Satisfaction from visible progress
---
## Success Metrics
| Metric | Target | Measurement |
|--------|--------|-------------|
| Time to awareness | < 2 min | Time from GitLab event to user seeing it |
| Daily items handled | Increased | Compare to baseline (manual tracking) |
| Context switches | Reduced | Fewer GitLab tabs open simultaneously |
| Snooze usage | Regular | Items snoozed vs dismissed (healthy ratio = snooze used) |
| Reply awareness | High | Watchlist items caught before manual check |
| User satisfaction | Qualitative | Does this reduce ADHD-related friction? |
---
## Risks and Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| GitLab API rate limits | Polling blocked | Configurable interval, backoff + jitter, respect 429/Retry-After |
| Token expiration/rotation | App stops working | Clear error state + setup flow; surface expiry guidance and re-auth path |
| State file corruption | Lose handled/snoozed/watch state | Atomic writes (tmp+rename), schema validation on load, keep last-known-good backup |
| GitLab API changes | App breaks | Pin to known API version, monitor deprecations, forward-compatible types |
| Token leakage | Security incident | Store in OS keychain, not in repo-adjacent files |
---
## Future Considerations (Post v1.0)
- **Grouping**: By project, by action type
- **Stale highlighting**: Visual alert for items waiting > X days
- **Desktop notifications**: OS-level alerts for new high-priority items
- **Quick actions**: Approve MR, close issue directly from app
- **Multiple GitLab instances**: Connect to both gitlab.com and self-hosted
- **Done history**: View handled items from yesterday, this week
---
## Implementation Phases
### Phase 0: Setup & Auth
- First-run setup wizard (GitLab URL + token)
- Token storage implementation (keychain/encrypted local)
- Connectivity check (`/todos`, auth failure UX)
- Clear error states for invalid/expired tokens
### Phase 1: Foundation
- Initialize TanStack Start project
- Set up Tailwind CSS
- Create GitLab API client with PAT auth
- Fetch and display todos in basic list (using `target_url` for navigation)
- Implement click-to-open
### Phase 2: Core Workflow
- Add local storage with atomic writes + backup
- Implement date-bucketed handled state
- Implement "Mark Handled" action
- Create Done Today view
- Add keyboard shortcuts (minimal set: j/k/Enter/h/s/d)
- Add Snooze + Snoozed view
- Filter handled/snoozed todos from Inbox
### Phase 3: Reliability & Awareness
- Background polling with configurable interval
- Backoff/jitter + 429 handling
- Last successful refresh tracking
- Watchlist ("Awaiting Reply") implementation
- Per-target polling for watched items (small set)
- Add manual refresh button
- Relative time display
- Action type badges with colors
- Loading and error states
- Connection status indicator
### Phase 4: Polish
- Focus Mode implementation
- Snooze time picker refinement
- Keyboard shortcut help overlay
- State migration handling (schemaVersion)
- Edge case handling (DST, timezone changes)
---
## Appendix
### Environment Configuration
**Primary configuration**:
- URL + settings in: `~/.config/gitlab-inbox/config.json`
- Token stored in OS keychain (preferred)
**Optional (dev-only) `.env.local` support**:
```env
GITLAB_URL=https://gitlab.yourcompany.com
GITLAB_TOKEN=glpat-xxxxxxxxxxxx
```
**Config file structure**:
```json
{
"gitlabUrl": "https://gitlab.yourcompany.com",
"pollingInterval": 60,
"focusModeCount": 3
}
```
### Creating a GitLab PAT
1. Go to GitLab → User Settings → Access Tokens
2. Create token with `read_api` scope
3. Set expiration (note: tokens expire at midnight UTC on expiry date)
4. Save token via setup wizard (stored in keychain)
5. Token never leaves local machine
### Project Structure
```
gitlab-inbox/
├── app/
│ ├── routes/
│ │ ├── __root.tsx
│ │ ├── index.tsx # Inbox view
│ │ ├── snoozed.tsx # Snoozed view
│ │ ├── watchlist.tsx # Watchlist view
│ │ ├── done.tsx # Done Today view
│ │ └── setup.tsx # First-run setup
│ ├── components/
│ │ ├── TodoCard.tsx
│ │ ├── TodoList.tsx
│ │ ├── ActionBadge.tsx
│ │ ├── Header.tsx
│ │ ├── SnoozePicker.tsx
│ │ ├── FocusMode.tsx
│ │ └── KeyboardHelp.tsx
│ ├── lib/
│ │ ├── gitlab.ts # API client
│ │ ├── storage.ts # Atomic state persistence
│ │ ├── keychain.ts # Token storage
│ │ ├── polling.ts # Polling state machine
│ │ ├── snooze.ts # Snooze logic + wake checking
│ │ ├── watchlist.ts # Watchlist polling
│ │ └── types.ts
│ └── app.tsx
├── package.json
├── tailwind.config.ts
└── vite.config.ts
```
### Test Strategy
**Unit Tests**:
- State normalization and migration
- Snooze wake time calculations
- Date bucketing logic (timezone handling)
- Polling backoff calculations
**Integration Tests** (mocked GitLab API):
- `/todos` response parsing
- `mark_as_done` endpoint calls
- Error handling (401, 429, network errors)
- State persistence round-trip (write + read)
- Backup recovery on corruption
**Manual Testing**:
- First-run setup flow
- Keyboard navigation
- Snooze + wake cycle
- Watchlist activity detection

641
SPEC.md Normal file
View File

@@ -0,0 +1,641 @@
# GitLab Knowledge Engine - Spec Document
## Executive Summary
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, comments/notes, and MR file-change links) from 2 main repositories (~10K items). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Commit-level indexing is explicitly post-MVP.
---
## Discovery Summary
### Pain Points Identified
1. **Knowledge discovery** - Tribal knowledge buried in old MRs/issues that nobody can find
2. **Decision traceability** - Hard to find *why* decisions were made; context scattered across issue comments and MR discussions
### Constraints
| Constraint | Detail |
|------------|--------|
| Hosting | Self-hosted only, no external APIs |
| Compute | Local dev machine (M-series Mac assumed) |
| GitLab Access | Self-hosted instance, PAT access, no webhooks (could request) |
| Build Method | AI agents will implement; user is TypeScript expert for review |
### Target Use Cases (Priority Order)
1. **MVP: Semantic Search** - "Find discussions about authentication redesign"
2. **Future: File/Feature History** - "What decisions were made about src/auth/login.ts?"
3. **Future: Personal Tracking** - "What am I assigned to or mentioned in?"
4. **Future: Person Context** - "What's @johndoe's background in this project?"
---
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ GitLab API │
│ (Issues, MRs, Notes) │
└─────────────────────────────────────────────────────────────────┘
(Commit-level indexing explicitly post-MVP)
┌─────────────────────────────────────────────────────────────────┐
│ Data Ingestion Layer │
│ - Incremental sync (PAT-based polling) │
│ - Rate limiting / backoff │
│ - Raw JSON storage for replay │
│ - Dependent resource fetching (notes, MR changes) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Data Processing Layer │
│ - Normalize artifacts to unified schema │
│ - Extract searchable documents (canonical text + metadata) │
│ - Content hashing for change detection │
│ - Build relationship graph (issue↔MR↔note↔file) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ - SQLite + sqlite-vss + FTS5 (hybrid search) │
│ - Structured metadata in relational tables │
│ - Vector embeddings for semantic search │
│ - Full-text index for lexical search fallback │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Query Interface │
│ - CLI for human testing │
│ - JSON API for AI agent testing │
│ - Semantic search with filters (author, date, type, label) │
└─────────────────────────────────────────────────────────────────┘
```
### Technology Choices
| Component | Recommendation | Rationale |
|-----------|---------------|-----------|
| Language | TypeScript/Node.js | User expertise, good GitLab libs, AI agent friendly |
| Database | SQLite + sqlite-vss | Zero-config, portable, vector search built-in |
| Embeddings | Ollama + nomic-embed-text | Self-hosted, runs well on Apple Silicon, 768-dim vectors |
| CLI Framework | Commander.js or oclif | Standard, well-documented |
### Alternative Considered: Postgres + pgvector
- Pros: More scalable, better for production multi-user
- Cons: Requires running Postgres, heavier setup
- Decision: Start with SQLite for simplicity; migration path exists if needed
---
## Checkpoint Structure
Each checkpoint is a **testable milestone** where a human can validate the system works before proceeding.
### Checkpoint 0: Project Setup
**Deliverable:** Scaffolded project with GitLab API connection verified
**Tests:**
1. Run `gitlab-engine auth-test` → returns authenticated user info
2. Run `gitlab-engine doctor` → verifies:
- Can reach GitLab baseUrl
- PAT is present and can read configured projects
- SQLite opens DB and migrations apply
- Ollama reachable OR embedding disabled with clear warning
**Scope:**
- Project structure (TypeScript, ESLint, Vitest)
- GitLab API client with PAT authentication
- Environment and project configuration
- Basic CLI scaffold with `auth-test` command
- `doctor` command for environment verification
- Projects table and initial sync
**Configuration (MVP):**
```json
// gitlab-engine.config.json
{
"gitlab": {
"baseUrl": "https://gitlab.example.com",
"tokenEnvVar": "GITLAB_TOKEN"
},
"projects": [
{ "path": "group/project-one" },
{ "path": "group/project-two" }
],
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"baseUrl": "http://localhost:11434"
}
}
```
**DB Runtime Defaults (Checkpoint 0):**
- On every connection:
- `PRAGMA journal_mode=WAL;`
- `PRAGMA foreign_keys=ON;`
**Schema (Checkpoint 0):**
```sql
-- Projects table (configured targets)
CREATE TABLE projects (
id INTEGER PRIMARY KEY,
gitlab_project_id INTEGER UNIQUE NOT NULL,
path_with_namespace TEXT NOT NULL,
default_branch TEXT,
web_url TEXT,
created_at INTEGER,
updated_at INTEGER,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_projects_path ON projects(path_with_namespace);
-- Sync tracking for reliability
CREATE TABLE sync_runs (
id INTEGER PRIMARY KEY,
started_at INTEGER NOT NULL,
finished_at INTEGER,
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
error TEXT
);
-- Sync cursors for primary resources only
-- Notes and MR changes are dependent resources (fetched via parent updates)
CREATE TABLE sync_cursors (
project_id INTEGER NOT NULL REFERENCES projects(id),
resource_type TEXT NOT NULL, -- 'issues' | 'merge_requests'
updated_at_cursor INTEGER, -- last fully processed updated_at (ms epoch)
tie_breaker_id INTEGER, -- last fully processed gitlab_id (for stable ordering)
PRIMARY KEY(project_id, resource_type)
);
-- Raw payload storage (decoupled from entity tables)
CREATE TABLE raw_payloads (
id INTEGER PRIMARY KEY,
source TEXT NOT NULL, -- 'gitlab'
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note'
gitlab_id INTEGER NOT NULL,
fetched_at INTEGER NOT NULL,
json TEXT NOT NULL
);
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
```
---
### Checkpoint 1: Issue Ingestion
**Deliverable:** All issues from target repos stored locally
**Test:** Run `gitlab-engine ingest --type=issues` → count matches GitLab; run `gitlab-engine list issues --limit=10` → displays issues correctly
**Scope:**
- Issue fetcher with pagination handling
- Raw JSON storage in raw_payloads table
- Normalized issue schema in SQLite
- Labels ingestion derived from issue payload:
- Always persist label names from `labels: string[]`
- Optionally request `with_labels_details=true` to capture color/description when available
- Incremental sync support (run tracking + per-project cursor)
- Basic list/count CLI commands
**Reliability/Idempotency Rules:**
- Every ingest/sync creates a `sync_runs` row
- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
- Cursor advances only after successful transaction commit per page/batch
- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
- Use explicit transactions for batch inserts
**Schema Preview:**
```sql
CREATE TABLE issues (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
created_at INTEGER,
updated_at INTEGER,
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
CREATE INDEX idx_issues_author ON issues(author_username);
-- Labels are derived from issue payloads (string array)
-- Uniqueness is (project_id, name) since gitlab_id isn't always available
CREATE TABLE labels (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER, -- optional (only if available)
project_id INTEGER NOT NULL REFERENCES projects(id),
name TEXT NOT NULL,
color TEXT,
description TEXT
);
CREATE UNIQUE INDEX uq_labels_project_name ON labels(project_id, name);
CREATE INDEX idx_labels_name ON labels(name);
CREATE TABLE issue_labels (
issue_id INTEGER REFERENCES issues(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(issue_id, label_id)
);
CREATE INDEX idx_issue_labels_label ON issue_labels(label_id);
```
---
### Checkpoint 2: MR + Comments + File Links Ingestion
**Deliverable:** All MRs, discussion threads, and file-change links stored locally
**Test:** Run `gitlab-engine ingest --type=merge_requests` → count matches; run `gitlab-engine show mr 1234` → displays MR with comments and files changed
**Scope:**
- MR fetcher with pagination
- Notes fetcher (issue notes + MR notes) as a dependent resource:
- During initial ingest: fetch notes for every issue/MR
- During sync: refetch notes only for issues/MRs updated since cursor
- MR changes/diffs fetcher as a dependent resource:
- During initial ingest: fetch changes for every MR
- During sync: refetch changes only for MRs updated since cursor
- Relationship linking (note → parent issue/MR via foreign keys, MR → files)
- Extended CLI commands for MR display
**Schema Additions:**
```sql
CREATE TABLE merge_requests (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
iid INTEGER NOT NULL,
title TEXT,
description TEXT,
state TEXT,
author_username TEXT,
source_branch TEXT,
target_branch TEXT,
created_at INTEGER,
updated_at INTEGER,
merged_at INTEGER,
web_url TEXT,
raw_payload_id INTEGER REFERENCES raw_payloads(id)
);
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
-- Notes with explicit parent foreign keys for referential integrity
CREATE TABLE notes (
id INTEGER PRIMARY KEY,
gitlab_id INTEGER UNIQUE NOT NULL,
project_id INTEGER NOT NULL REFERENCES projects(id),
issue_id INTEGER REFERENCES issues(id),
merge_request_id INTEGER REFERENCES merge_requests(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
noteable_iid INTEGER NOT NULL, -- parent IID (from API path)
author_username TEXT,
body TEXT,
created_at INTEGER,
updated_at INTEGER,
system BOOLEAN,
raw_payload_id INTEGER REFERENCES raw_payloads(id),
-- Exactly one parent FK must be set
CHECK (
(noteable_type='Issue' AND issue_id IS NOT NULL AND merge_request_id IS NULL) OR
(noteable_type='MergeRequest' AND merge_request_id IS NOT NULL AND issue_id IS NULL)
)
);
CREATE INDEX idx_notes_issue ON notes(issue_id);
CREATE INDEX idx_notes_mr ON notes(merge_request_id);
CREATE INDEX idx_notes_author ON notes(author_username);
-- File linkage for "what MRs touched this file?" queries (with rename support)
CREATE TABLE mr_files (
id INTEGER PRIMARY KEY,
merge_request_id INTEGER REFERENCES merge_requests(id),
old_path TEXT,
new_path TEXT,
new_file BOOLEAN,
deleted_file BOOLEAN,
renamed_file BOOLEAN,
UNIQUE(merge_request_id, old_path, new_path)
);
CREATE INDEX idx_mr_files_old_path ON mr_files(old_path);
CREATE INDEX idx_mr_files_new_path ON mr_files(new_path);
-- MR labels (reuse same labels table)
CREATE TABLE mr_labels (
merge_request_id INTEGER REFERENCES merge_requests(id),
label_id INTEGER REFERENCES labels(id),
PRIMARY KEY(merge_request_id, label_id)
);
CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
```
---
### Checkpoint 3: Embedding Generation
**Deliverable:** Vector embeddings generated for all text content
**Test:** Run `gitlab-engine embed --all` → progress indicator; run `gitlab-engine stats` → shows embedding coverage percentage
**Scope:**
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline (batch processing)
- Vector storage in SQLite (sqlite-vss extension)
- Progress tracking and resumability
- Document extraction layer:
- Canonical "search documents" derived from issues/MRs/notes
- Stable content hashing for change detection (SHA-256 of content_text)
- Single embedding per document (chunking deferred to post-MVP)
- Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via `document_labels` join table
**Schema Additions:**
```sql
-- Unified searchable documents (derived from issues/MRs/notes)
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'note'
source_id INTEGER NOT NULL, -- local DB id in the source table
project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT,
label_names TEXT, -- JSON array (display/debug only)
created_at INTEGER,
updated_at INTEGER,
url TEXT,
title TEXT, -- null for notes
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
content_hash TEXT NOT NULL, -- SHA-256 for change detection
UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
-- Fast label filtering for documents (indexed exact-match)
CREATE TABLE document_labels (
document_id INTEGER NOT NULL REFERENCES documents(id),
label_name TEXT NOT NULL,
PRIMARY KEY(document_id, label_name)
);
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
-- sqlite-vss virtual table
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0(
embedding(768)
);
-- Embedding provenance + change detection
-- document_id is PRIMARY KEY and equals embeddings.rowid
CREATE TABLE embedding_metadata (
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
model TEXT NOT NULL, -- 'nomic-embed-text'
dims INTEGER NOT NULL, -- 768
content_hash TEXT NOT NULL, -- copied from documents.content_hash
created_at INTEGER NOT NULL
);
```
**Storage Rule (MVP):**
- Insert embedding with `rowid = documents.id`
- Upsert `embedding_metadata` by `document_id`
- This alignment simplifies joins and eliminates rowid mapping fragility
**Document Extraction Rules:**
- Issue → title + "\n\n" + description
- MR → title + "\n\n" + description
- Note → body (skip system notes unless they contain meaningful content)
---
### Checkpoint 4: Semantic Search
**Deliverable:** Working semantic search across all indexed content
**Tests:**
1. Run `gitlab-engine search "authentication redesign"` → returns ranked results with snippets
2. Golden queries: curated list of 10 queries with expected result *containment* (e.g., "at least one of these 3 known URLs appears in top 10")
3. `gitlab-engine search "..." --json` validates against JSON schema (stable fields present)
**Scope:**
- Hybrid retrieval:
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
- Result ranking and scoring (document-level)
- Search filters: `--type=issue|mr|note`, `--author=username`, `--after=date`, `--label=name`
- Label filtering operates on `document_labels` (indexed, exact-match)
- Output formatting: ranked list with title, snippet, score, URL
- JSON output mode for AI agent consumption
**Schema Additions:**
```sql
-- Full-text search for hybrid retrieval
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
```
**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
1. Query both vector index (top 50) and FTS5 (top 50)
2. Merge results by document_id
3. Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N)
- `rrf_score = Σ 1 / (k + rank)` with k=60 (tunable)
- RRF is simpler than weighted sums and doesn't require score normalization
4. Apply filters (type, author, date, label)
5. Return top K
**Why RRF over Weighted Sums:**
- FTS5 BM25 scores and vector distances use different scales
- Weighted sums (`0.7 * vector + 0.3 * fts`) require careful normalization
- RRF operates on ranks, not scores, making it robust to scale differences
- Well-established in information retrieval literature
**CLI Interface:**
```bash
# Basic semantic search
gitlab-engine search "why did we choose Redis"
# Pure FTS search (fallback if embeddings unavailable)
gitlab-engine search "redis" --mode=lexical
# Filtered search
gitlab-engine search "authentication" --type=mr --after=2024-01-01
# Filter by label
gitlab-engine search "performance" --label=bug --label=critical
# JSON output for programmatic use
gitlab-engine search "payment processing" --json
```
---
### Checkpoint 5: Incremental Sync
**Deliverable:** Efficient ongoing synchronization with GitLab
**Test:** Make a change in GitLab; run `gitlab-engine sync` → only fetches changed items; verify change appears in search
**Scope:**
- Delta sync based on stable cursor (updated_at + tie-breaker id)
- Dependent resources sync strategy (notes, MR changes)
- Webhook handler (optional, if webhook access granted)
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
- Sync status reporting
**Correctness Rules (MVP):**
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
2. Cursor advances only after successful DB commit for that page
3. Dependent resources:
- For each updated issue/MR, refetch its notes (sorted by `updated_at`)
- For each updated MR, refetch its file changes
4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
**Why Dependent Resource Model:**
- GitLab Notes API doesn't provide a clean global `updated_after` stream
- Notes are listed per-issue or per-MR, not as a top-level resource
- Treating notes as dependent resources (refetch when parent updates) is simpler and more correct
- Same applies to MR changes/diffs
**CLI Commands:**
```bash
# Full sync (respects cursors, only fetches new/updated)
gitlab-engine sync
# Force full re-sync (resets cursors)
gitlab-engine sync --full
# Override stale 'running' run after operator review
gitlab-engine sync --force
# Show sync status
gitlab-engine sync-status
```
---
## Future Checkpoints (Post-MVP)
### Checkpoint 6: File/Feature History View
- Map commits to MRs to discussions
- Query: "Show decision history for src/auth/login.ts"
- Ship `gitlab-engine file-history <path>` as a first-class feature here
- This command is deferred from MVP to sharpen checkpoint focus
### Checkpoint 7: Personal Dashboard
- Filter by assigned/mentioned
- Integrate with existing gitlab-inbox tool
### Checkpoint 8: Person Context
- Aggregate contributions by author
- Expertise inference from activity
### Checkpoint 9: Decision Graph
- Extract decisions from discussions (LLM-assisted)
- Visualize decision relationships
---
## Verification Strategy
Each checkpoint includes:
1. **Automated tests** - Unit tests for data transformations, integration tests for API calls
2. **CLI smoke tests** - Manual commands with expected outputs documented
3. **Data integrity checks** - Count verification against GitLab, schema validation
4. **Search quality tests** - Known queries with expected results (for Checkpoint 4+)
---
## Risk Mitigation
| Risk | Mitigation |
|------|------------|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
| SQLite scale limits | Monitor performance; Postgres migration path documented |
| Stale data | Incremental sync with change detection |
| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |
**SQLite Performance Defaults (MVP):**
- Enable `PRAGMA journal_mode=WAL;` on every connection
- Enable `PRAGMA foreign_keys=ON;` on every connection
- Use explicit transactions for page/batch inserts
- Targeted indexes on `(project_id, updated_at)` for primary resources
---
## Schema Summary
| Table | Checkpoint | Purpose |
|-------|------------|---------|
| projects | 0 | Configured GitLab projects |
| sync_runs | 0 | Audit trail of sync operations |
| sync_cursors | 0 | Resumable sync state per primary resource |
| raw_payloads | 0 | Decoupled raw JSON storage |
| issues | 1 | Normalized issues |
| labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction |
| merge_requests | 2 | Normalized MRs |
| notes | 2 | Issue and MR comments (with parent FKs) |
| mr_files | 2 | MR file changes (with rename tracking) |
| mr_labels | 2 | MR-label junction |
| documents | 3 | Unified searchable documents |
| document_labels | 3 | Document-label junction for fast filtering |
| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) |
| embedding_metadata | 3 | Embedding provenance + change detection |
| documents_fts | 4 | Full-text search index (fts5) |
---
## Resolved Decisions
| Question | Decision | Rationale |
|----------|----------|-----------|
| Commit/file linkage | **Include MR→file links** | Enables "what MRs touched this file?" without full commit history |
| Labels | **Index as filters** | Labels are well-used; `document_labels` table enables fast `--label=X` filtering |
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings; gitlab_id isn't always available |
| Sync method | **Polling for MVP** | Decide on webhooks after using the system |
| Notes sync | **Dependent resource** | Notes API is per-parent, not global; refetch on parent update |
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping during upserts |
| file-history CLI | **Post-MVP (CP6)** | Sharpens MVP checkpoint focus |
---
## Next Steps
1. User approves this spec
2. Generate Checkpoint 0 PRD for project setup
3. Implement Checkpoint 0
4. Human validates → proceed to Checkpoint 1
5. Repeat for each checkpoint