more planning
This commit is contained in:
373
SPEC-REVISIONS-2.md
Normal file
373
SPEC-REVISIONS-2.md
Normal file
@@ -0,0 +1,373 @@
|
|||||||
|
# SPEC.md Revision Document - Round 2
|
||||||
|
|
||||||
|
This document provides git-diff style changes for the second round of improvements from ChatGPT's review. These are primarily correctness fixes and optimizations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 1: Fix Tuple Cursor Correctness Gap (Cursor Rewind + Local Filtering)
|
||||||
|
|
||||||
|
**Why this is critical:** The spec specifies tuple cursor semantics `(updated_at, gitlab_id)` but GitLab's API only supports `updated_after` which is strictly "after" - it cannot express `WHERE updated_at = X AND id > Y` server-side. This creates a real risk of missed items on crash/resume and on dense timestamp buckets.
|
||||||
|
|
||||||
|
**Fix:** Cursor rewind + local filtering. Call GitLab with `updated_after = cursor_updated_at - rewindSeconds`, then locally discard items we've already processed.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Correctness Rules (MVP): @@
|
||||||
|
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
|
||||||
|
2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
|
||||||
|
- - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)`
|
||||||
|
+ - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.**
|
||||||
|
+ - Use **cursor rewind + local filtering**:
|
||||||
|
+ - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable)
|
||||||
|
+ - Locally discard items where:
|
||||||
|
+ - `updated_at < cursor_updated_at`, OR
|
||||||
|
+ - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id`
|
||||||
|
+ - This makes the tuple cursor rule true in practice while keeping API calls simple.
|
||||||
|
- Cursor advances only after successful DB commit for that page
|
||||||
|
- When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Configuration (MVP): @@
|
||||||
|
"sync": {
|
||||||
|
"backfillDays": 14,
|
||||||
|
"staleLockMinutes": 10,
|
||||||
|
- "heartbeatIntervalSeconds": 30
|
||||||
|
+ "heartbeatIntervalSeconds": 30,
|
||||||
|
+ "cursorRewindSeconds": 2
|
||||||
|
},
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 2: Make App Lock Actually Safe (BEGIN IMMEDIATE CAS)
|
||||||
|
|
||||||
|
**Why this is critical:** INSERT OR REPLACE can overwrite an active lock if two processes start close together (both do "stale check" outside a write transaction, then both INSERT OR REPLACE). SQLite's BEGIN IMMEDIATE provides a proper compare-and-swap.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Reliability/Idempotency Rules: @@
|
||||||
|
- Every ingest/sync creates a `sync_runs` row
|
||||||
|
- Single-flight via DB-enforced app lock:
|
||||||
|
- - On start: INSERT OR REPLACE lock row with new owner token
|
||||||
|
+ - On start: acquire lock via transactional compare-and-swap:
|
||||||
|
+ - `BEGIN IMMEDIATE` (acquires write lock immediately)
|
||||||
|
+ - If no row exists → INSERT new lock
|
||||||
|
+ - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps
|
||||||
|
+ - Else if `owner` matches current run → UPDATE heartbeat (re-entrant)
|
||||||
|
+ - Else → ROLLBACK and fail fast (another run is active)
|
||||||
|
+ - `COMMIT`
|
||||||
|
- During run: update `heartbeat_at` every 30 seconds
|
||||||
|
- If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
|
||||||
|
- `--force` remains as operator override for edge cases, but should rarely be needed
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 3: Dependent Resource Pagination + Bounded Concurrency
|
||||||
|
|
||||||
|
**Why this is important:** Discussions endpoints are paginated on many GitLab instances. Without pagination, we silently lose data. Without bounded concurrency, initial sync can become unstable (429s, long tail retries).
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Dependent Resources (Per-Parent Fetch): @@
|
||||||
|
-GET /projects/:id/issues/:iid/discussions
|
||||||
|
-GET /projects/:id/merge_requests/:iid/discussions
|
||||||
|
+GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
|
||||||
|
+GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
|
||||||
|
+
|
||||||
|
+**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent.
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Rate Limiting: @@
|
||||||
|
- Default: 10 requests/second with exponential backoff
|
||||||
|
- Respect `Retry-After` headers on 429 responses
|
||||||
|
- Add jitter to avoid thundering herd on retry
|
||||||
|
+- **Separate concurrency limits:**
|
||||||
|
+ - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4)
|
||||||
|
+ - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
|
||||||
|
+ - Bound concurrency per-project to avoid one repo starving the other
|
||||||
|
- Initial sync estimate: 10-20 minutes depending on rate limits
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Configuration (MVP): @@
|
||||||
|
"sync": {
|
||||||
|
"backfillDays": 14,
|
||||||
|
"staleLockMinutes": 10,
|
||||||
|
"heartbeatIntervalSeconds": 30,
|
||||||
|
- "cursorRewindSeconds": 2
|
||||||
|
+ "cursorRewindSeconds": 2,
|
||||||
|
+ "primaryConcurrency": 4,
|
||||||
|
+ "dependentConcurrency": 2
|
||||||
|
},
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 4: Track last_seen_at for Eventual Consistency Debugging
|
||||||
|
|
||||||
|
**Why this is valuable:** Even without implementing deletions, you want to know: (a) whether a record is actively refreshed under backfill/sync, (b) whether a sync run is "covering" the dataset, (c) whether a particular item hasn't been seen in months (helps diagnose missed updates).
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Preview - issues: @@
|
||||||
|
CREATE TABLE issues (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
||||||
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||||
|
iid INTEGER NOT NULL,
|
||||||
|
title TEXT,
|
||||||
|
description TEXT,
|
||||||
|
state TEXT,
|
||||||
|
author_username TEXT,
|
||||||
|
created_at INTEGER,
|
||||||
|
updated_at INTEGER,
|
||||||
|
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||||
|
web_url TEXT,
|
||||||
|
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - merge_requests: @@
|
||||||
|
CREATE TABLE merge_requests (
|
||||||
|
@@
|
||||||
|
updated_at INTEGER,
|
||||||
|
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||||
|
merged_at INTEGER,
|
||||||
|
@@
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - discussions: @@
|
||||||
|
CREATE TABLE discussions (
|
||||||
|
@@
|
||||||
|
last_note_at INTEGER,
|
||||||
|
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||||
|
resolvable BOOLEAN,
|
||||||
|
@@
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - notes: @@
|
||||||
|
CREATE TABLE notes (
|
||||||
|
@@
|
||||||
|
updated_at INTEGER,
|
||||||
|
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||||
|
position INTEGER,
|
||||||
|
@@
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 5: Raw Payload Compression
|
||||||
|
|
||||||
|
**Why this is valuable:** At 50-100K documents plus threaded discussions, raw JSON is likely the largest storage consumer. Supporting gzip compression reduces DB size while preserving replay capability.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema (Checkpoint 0) - raw_payloads: @@
|
||||||
|
CREATE TABLE raw_payloads (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source TEXT NOT NULL, -- 'gitlab'
|
||||||
|
project_id INTEGER REFERENCES projects(id),
|
||||||
|
resource_type TEXT NOT NULL,
|
||||||
|
gitlab_id INTEGER NOT NULL,
|
||||||
|
fetched_at INTEGER NOT NULL,
|
||||||
|
- json TEXT NOT NULL
|
||||||
|
+ content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
|
||||||
|
+ payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Configuration (MVP): @@
|
||||||
|
+ "storage": {
|
||||||
|
+ "compressRawPayloads": true -- gzip raw payloads to reduce DB size
|
||||||
|
+ },
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 6: Scope Discussions Unique by Project
|
||||||
|
|
||||||
|
**Why this is important:** `gitlab_discussion_id TEXT UNIQUE` assumes global uniqueness across all projects. While likely true for GitLab, it's safer to scope by project_id. This avoids rare but painful collisions and makes it easier to support more repos later.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - discussions: @@
|
||||||
|
CREATE TABLE discussions (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
- gitlab_discussion_id TEXT UNIQUE NOT NULL,
|
||||||
|
+ gitlab_discussion_id TEXT NOT NULL,
|
||||||
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||||
|
@@
|
||||||
|
);
|
||||||
|
+CREATE UNIQUE INDEX uq_discussions_project_discussion_id
|
||||||
|
+ ON discussions(project_id, gitlab_discussion_id);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 7: Dirty Queue for Document Regeneration
|
||||||
|
|
||||||
|
**Why this is valuable:** The orchestration says "Regenerate documents for changed entities" but doesn't define how "changed" is computed without scanning large tables. A dirty queue populated during ingestion makes doc regen deterministic and fast.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions (Checkpoint 3): @@
|
||||||
|
+-- Track sources that require document regeneration (populated during ingestion)
|
||||||
|
+CREATE TABLE dirty_sources (
|
||||||
|
+ source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
|
||||||
|
+ source_id INTEGER NOT NULL, -- local DB id
|
||||||
|
+ queued_at INTEGER NOT NULL,
|
||||||
|
+ PRIMARY KEY(source_type, source_id)
|
||||||
|
+);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Orchestration steps (in order): @@
|
||||||
|
1. Acquire app lock with heartbeat
|
||||||
|
2. Ingest delta (issues, MRs, discussions) based on cursors
|
||||||
|
+ - During ingestion, INSERT into dirty_sources for each upserted entity
|
||||||
|
3. Apply rolling backfill window
|
||||||
|
-4. Regenerate documents for changed entities
|
||||||
|
+4. Regenerate documents for entities in dirty_sources (process + delete from queue)
|
||||||
|
5. Embed documents with changed content_hash
|
||||||
|
6. FTS triggers auto-sync (no explicit step needed)
|
||||||
|
7. Release lock, record sync_run as succeeded
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 8: document_paths + --path Filter
|
||||||
|
|
||||||
|
**Why this is high value:** We're already capturing DiffNote file paths in CP2. Adding a `--path` filter now makes the MVP dramatically more compelling for engineers who search by file path constantly.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 4 Scope: @@
|
||||||
|
-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`
|
||||||
|
+- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`
|
||||||
|
+ - `--path` filters documents by referenced file paths (from DiffNote positions)
|
||||||
|
+ - MVP: substring/exact match; glob patterns deferred
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions (Checkpoint 3): @@
|
||||||
|
+-- Fast path filtering for documents (extracted from DiffNote positions)
|
||||||
|
+CREATE TABLE document_paths (
|
||||||
|
+ document_id INTEGER NOT NULL REFERENCES documents(id),
|
||||||
|
+ path TEXT NOT NULL,
|
||||||
|
+ PRIMARY KEY(document_id, path)
|
||||||
|
+);
|
||||||
|
+CREATE INDEX idx_document_paths_path ON document_paths(path);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ CLI Interface: @@
|
||||||
|
# Search within specific project
|
||||||
|
gi search "authentication" --project=group/project-one
|
||||||
|
|
||||||
|
+# Search by file path (finds discussions/MRs touching this file)
|
||||||
|
+gi search "rate limit" --path=src/client.ts
|
||||||
|
+
|
||||||
|
# Pure FTS search (fallback if embeddings unavailable)
|
||||||
|
gi search "redis" --mode=lexical
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Manual CLI Smoke Tests: @@
|
||||||
|
+| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 9: Character-Based Truncation (Not Exact Tokens)
|
||||||
|
|
||||||
|
**Why this is practical:** "8000 tokens" sounds precise, but tokenizers vary. Exact token counting adds dependency complexity. A conservative character budget is simpler and avoids false precision.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Document Extraction Rules: @@
|
||||||
|
- Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
|
||||||
|
+ - **Implementation:** Use character budget, not exact token count
|
||||||
|
+ - `maxChars = 32000` (conservative 4 chars/token estimate)
|
||||||
|
+ - `approxTokens = ceil(charCount / 4)` for reporting/logging only
|
||||||
|
+ - This avoids tokenizer dependency while preventing embedding failures
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Truncation: @@
|
||||||
|
If content exceeds 8000 tokens:
|
||||||
|
+**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`.
|
||||||
|
+
|
||||||
|
1. Truncate from the middle (preserve first + last notes for context)
|
||||||
|
2. Set `documents.is_truncated = 1`
|
||||||
|
3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
|
||||||
|
4. Log a warning with document ID and original token count
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 10: Move Lexical Search to CP3 (Reorder, Not New Scope)
|
||||||
|
|
||||||
|
**Why this is better:** Nothing "search-like" exists until CP4, but FTS5 is already a dependency for graceful degradation. Moving FTS setup to CP3 (when documents exist) gives an earlier usable artifact and better validation. CP4 becomes "hybrid ranking upgrade."
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 3: Embedding Generation @@
|
||||||
|
-### Checkpoint 3: Embedding Generation
|
||||||
|
-**Deliverable:** Vector embeddings generated for all text content
|
||||||
|
+### Checkpoint 3: Document + Embedding Generation with Lexical Search
|
||||||
|
+**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 3 Scope: @@
|
||||||
|
- Ollama integration (nomic-embed-text model)
|
||||||
|
- Embedding generation pipeline:
|
||||||
|
@@
|
||||||
|
- Fast label filtering via `document_labels` join table
|
||||||
|
+- FTS5 index for lexical search (moved from CP4)
|
||||||
|
+- `gi search --mode=lexical` CLI command (works without Ollama)
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 3 Manual CLI Smoke Tests: @@
|
||||||
|
+| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required |
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 4: Semantic Search @@
|
||||||
|
-### Checkpoint 4: Semantic Search
|
||||||
|
-**Deliverable:** Working semantic search across all indexed content
|
||||||
|
+### Checkpoint 4: Hybrid Search (Semantic + Lexical)
|
||||||
|
+**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 4 Scope: @@
|
||||||
|
**Scope:**
|
||||||
|
- Hybrid retrieval:
|
||||||
|
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
|
||||||
|
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
|
||||||
|
+- Query embedding generation (same Ollama pipeline as documents)
|
||||||
|
- Result ranking and scoring (document-level)
|
||||||
|
-- Search filters: ...
|
||||||
|
+- Filters work identically in hybrid and lexical modes
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary of All Changes (Round 2)
|
||||||
|
|
||||||
|
| # | Change | Impact |
|
||||||
|
|---|--------|--------|
|
||||||
|
| 1 | **Cursor rewind + local filtering** | Fixes real correctness gap in tuple cursor implementation |
|
||||||
|
| 2 | **BEGIN IMMEDIATE CAS for lock** | Prevents race condition in lock acquisition |
|
||||||
|
| 3 | **Discussions pagination + concurrency** | Prevents silent data loss on large discussion threads |
|
||||||
|
| 4 | **last_seen_at columns** | Enables debugging of sync coverage without deletions |
|
||||||
|
| 5 | **Raw payload compression** | Reduces DB size significantly at scale |
|
||||||
|
| 6 | **Scope discussions unique by project** | Defensive uniqueness for multi-project safety |
|
||||||
|
| 7 | **Dirty queue for doc regen** | Makes document regeneration deterministic and fast |
|
||||||
|
| 8 | **document_paths + --path filter** | High-value file search with minimal scope |
|
||||||
|
| 9 | **Character-based truncation** | Practical implementation without tokenizer dependency |
|
||||||
|
| 10 | **Lexical search in CP3** | Earlier usable artifact; better checkpoint validation |
|
||||||
|
|
||||||
|
**Net effect:** These changes fix several correctness gaps (cursor, lock, pagination) while adding high-value features (--path filter) and operational improvements (compression, dirty queue, last_seen_at).
|
||||||
427
SPEC-REVISIONS-3.md
Normal file
427
SPEC-REVISIONS-3.md
Normal file
@@ -0,0 +1,427 @@
|
|||||||
|
# SPEC.md Revisions - First-Time User Experience
|
||||||
|
|
||||||
|
**Date:** 2026-01-21
|
||||||
|
**Purpose:** Document all changes adding installation, setup, and user flow documentation to SPEC.md
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary of Changes
|
||||||
|
|
||||||
|
| Change | Location | Description |
|
||||||
|
|--------|----------|-------------|
|
||||||
|
| 1. Quick Start | After Executive Summary | Prerequisites, installation, first-run walkthrough |
|
||||||
|
| 2. `gi init` Command | Checkpoint 0 | Interactive setup wizard with GitLab validation |
|
||||||
|
| 3. CLI Command Reference | Before Future Work | Unified table of all commands |
|
||||||
|
| 4. Error Handling | After CLI Reference | Common errors with recovery guidance |
|
||||||
|
| 5. Database Management | After Error Handling | Location, backup, reset, migrations |
|
||||||
|
| 6. Empty State Handling | Checkpoint 4 scope | Behavior when no data indexed |
|
||||||
|
| 7. Resolved Decisions | Resolved Decisions table | New decisions from this revision |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 1: Quick Start Section
|
||||||
|
|
||||||
|
**Location:** Insert after line 6 (after Executive Summary), before Discovery Summary
|
||||||
|
|
||||||
|
```diff
|
||||||
|
A self-hosted tool to extract, index, and semantically search 2+ years of GitLab data (issues, MRs, and discussion threads) from 2 main repositories (~50-100K documents including threaded discussions). The MVP delivers semantic search as a foundational capability that enables future specialized views (file history, personal tracking, person context). Discussion threads are preserved as first-class entities to maintain conversational context essential for decision traceability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
+## Quick Start
|
||||||
|
+
|
||||||
|
+### Prerequisites
|
||||||
|
+
|
||||||
|
+| Requirement | Version | Notes |
|
||||||
|
+|-------------|---------|-------|
|
||||||
|
+| Node.js | 20+ | LTS recommended |
|
||||||
|
+| npm | 10+ | Comes with Node.js |
|
||||||
|
+| Ollama | Latest | Optional for semantic search; lexical search works without it |
|
||||||
|
+
|
||||||
|
+### Installation
|
||||||
|
+
|
||||||
|
+```bash
|
||||||
|
+# Clone and install
|
||||||
|
+git clone https://github.com/your-org/gitlab-inbox.git
|
||||||
|
+cd gitlab-inbox
|
||||||
|
+npm install
|
||||||
|
+npm run build
|
||||||
|
+npm link # Makes `gi` available globally
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+### First Run
|
||||||
|
+
|
||||||
|
+1. **Set your GitLab token** (create at GitLab > Settings > Access Tokens with `read_api` scope):
|
||||||
|
+ ```bash
|
||||||
|
+ export GITLAB_TOKEN="glpat-xxxxxxxxxxxxxxxxxxxx"
|
||||||
|
+ ```
|
||||||
|
+
|
||||||
|
+2. **Run the setup wizard:**
|
||||||
|
+ ```bash
|
||||||
|
+ gi init
|
||||||
|
+ ```
|
||||||
|
+ This creates `gi.config.json` with your GitLab URL and project paths.
|
||||||
|
+
|
||||||
|
+3. **Verify your environment:**
|
||||||
|
+ ```bash
|
||||||
|
+ gi doctor
|
||||||
|
+ ```
|
||||||
|
+ All checks should pass (Ollama warning is OK if you only need lexical search).
|
||||||
|
+
|
||||||
|
+4. **Sync your data:**
|
||||||
|
+ ```bash
|
||||||
|
+ gi sync
|
||||||
|
+ ```
|
||||||
|
+ Initial sync takes 10-20 minutes depending on repo size and rate limits.
|
||||||
|
+
|
||||||
|
+5. **Search:**
|
||||||
|
+ ```bash
|
||||||
|
+ gi search "authentication redesign"
|
||||||
|
+ ```
|
||||||
|
+
|
||||||
|
+### Troubleshooting First Run
|
||||||
|
+
|
||||||
|
+| Symptom | Solution |
|
||||||
|
+|---------|----------|
|
||||||
|
+| `Config file not found` | Run `gi init` first |
|
||||||
|
+| `GITLAB_TOKEN not set` | Export the environment variable |
|
||||||
|
+| `401 Unauthorized` | Check token has `read_api` scope |
|
||||||
|
+| `Project not found: group/project` | Verify project path in GitLab URL |
|
||||||
|
+| `Ollama connection refused` | Start Ollama or use `--mode=lexical` for search |
|
||||||
|
+
|
||||||
|
+---
|
||||||
|
+
|
||||||
|
## Discovery Summary
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 2: `gi init` Command in Checkpoint 0
|
||||||
|
|
||||||
|
**Location:** Insert in Checkpoint 0 Manual CLI Smoke Tests table and Scope section
|
||||||
|
|
||||||
|
### 2a: Add to Manual CLI Smoke Tests table (after line 193)
|
||||||
|
|
||||||
|
```diff
|
||||||
|
| `GITLAB_TOKEN=invalid gi auth-test` | Error message | Non-zero exit code, clear error about auth failure |
|
||||||
|
+| `gi init` | Interactive prompts | Creates valid gi.config.json |
|
||||||
|
+| `gi init` (config exists) | Confirmation prompt | Warns before overwriting |
|
||||||
|
+| `gi --help` | Command list | Shows all available commands |
|
||||||
|
+| `gi version` | Version number | Shows installed version |
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2b: Add Automated Tests for init (after line 185)
|
||||||
|
|
||||||
|
```diff
|
||||||
|
tests/integration/app-lock.test.ts
|
||||||
|
✓ acquires lock successfully
|
||||||
|
✓ updates heartbeat during operation
|
||||||
|
✓ detects stale lock and recovers
|
||||||
|
✓ refuses concurrent acquisition
|
||||||
|
+
|
||||||
|
+tests/integration/init.test.ts
|
||||||
|
+ ✓ creates config file with valid structure
|
||||||
|
+ ✓ validates GitLab URL format
|
||||||
|
+ ✓ validates GitLab connection before writing config
|
||||||
|
+ ✓ validates each project path exists in GitLab
|
||||||
|
+ ✓ fails if token not set
|
||||||
|
+ ✓ fails if GitLab auth fails
|
||||||
|
+ ✓ fails if any project path not found
|
||||||
|
+ ✓ prompts before overwriting existing config
|
||||||
|
+ ✓ respects --force to skip confirmation
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2c: Add to Checkpoint 0 Scope (after line 209)
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- Rate limit handling with exponential backoff + jitter
|
||||||
|
+- `gi init` command for guided setup:
|
||||||
|
+ - Prompts for GitLab base URL
|
||||||
|
+ - Prompts for project paths (comma-separated or multiple prompts)
|
||||||
|
+ - Prompts for token environment variable name (default: GITLAB_TOKEN)
|
||||||
|
+ - **Validates before writing config:**
|
||||||
|
+ - Token must be set in environment
|
||||||
|
+ - Tests auth with `GET /user` endpoint
|
||||||
|
+ - Validates each project path with `GET /projects/:path`
|
||||||
|
+ - Only writes config after all validations pass
|
||||||
|
+ - Generates `gi.config.json` with sensible defaults
|
||||||
|
+- `gi --help` shows all available commands
|
||||||
|
+- `gi <command> --help` shows command-specific help
|
||||||
|
+- `gi version` shows installed version
|
||||||
|
+- First-run detection: if no config exists, suggest `gi init`
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 3: CLI Command Reference Section
|
||||||
|
|
||||||
|
**Location:** Insert before "## Future Work (Post-MVP)" (before line 1174)
|
||||||
|
|
||||||
|
```diff
|
||||||
|
+## CLI Command Reference
|
||||||
|
+
|
||||||
|
+All commands support `--help` for detailed usage information.
|
||||||
|
+
|
||||||
|
+### Setup & Diagnostics
|
||||||
|
+
|
||||||
|
+| Command | CP | Description |
|
||||||
|
+|---------|-----|-------------|
|
||||||
|
+| `gi init` | 0 | Interactive setup wizard; creates gi.config.json |
|
||||||
|
+| `gi auth-test` | 0 | Verify GitLab authentication |
|
||||||
|
+| `gi doctor` | 0 | Check environment (GitLab, Ollama, DB) |
|
||||||
|
+| `gi doctor --json` | 0 | JSON output for scripting |
|
||||||
|
+| `gi version` | 0 | Show installed version |
|
||||||
|
+
|
||||||
|
+### Data Ingestion
|
||||||
|
+
|
||||||
|
+| Command | CP | Description |
|
||||||
|
+|---------|-----|-------------|
|
||||||
|
+| `gi ingest --type=issues` | 1 | Fetch issues from GitLab |
|
||||||
|
+| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions |
|
||||||
|
+| `gi embed --all` | 3 | Generate embeddings for all documents |
|
||||||
|
+| `gi embed --retry-failed` | 3 | Retry failed embeddings |
|
||||||
|
+| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) |
|
||||||
|
+| `gi sync --full` | 5 | Force complete re-sync (reset cursors) |
|
||||||
|
+| `gi sync --force` | 5 | Override stale lock after operator review |
|
||||||
|
+| `gi sync --no-embed` | 5 | Sync without embedding (faster) |
|
||||||
|
+
|
||||||
|
+### Data Inspection
|
||||||
|
+
|
||||||
|
+| Command | CP | Description |
|
||||||
|
+|---------|-----|-------------|
|
||||||
|
+| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues |
|
||||||
|
+| `gi list mrs [--limit=N]` | 2 | List merge requests |
|
||||||
|
+| `gi count issues` | 1 | Count issues |
|
||||||
|
+| `gi count mrs` | 2 | Count merge requests |
|
||||||
|
+| `gi count discussions` | 2 | Count discussions |
|
||||||
|
+| `gi count notes` | 2 | Count notes |
|
||||||
|
+| `gi show issue <iid>` | 1 | Show issue details |
|
||||||
|
+| `gi show mr <iid>` | 2 | Show MR details with discussions |
|
||||||
|
+| `gi stats` | 3 | Embedding coverage statistics |
|
||||||
|
+| `gi stats --json` | 3 | JSON stats for scripting |
|
||||||
|
+| `gi sync-status` | 1 | Show cursor positions and last sync |
|
||||||
|
+
|
||||||
|
+### Search
|
||||||
|
+
|
||||||
|
+| Command | CP | Description |
|
||||||
|
+|---------|-----|-------------|
|
||||||
|
+| `gi search "query"` | 4 | Hybrid semantic + lexical search |
|
||||||
|
+| `gi search "query" --mode=lexical` | 3 | Lexical-only search (no Ollama required) |
|
||||||
|
+| `gi search "query" --type=issue\|mr\|discussion` | 4 | Filter by document type |
|
||||||
|
+| `gi search "query" --author=USERNAME` | 4 | Filter by author |
|
||||||
|
+| `gi search "query" --after=YYYY-MM-DD` | 4 | Filter by date |
|
||||||
|
+| `gi search "query" --label=NAME` | 4 | Filter by label (repeatable) |
|
||||||
|
+| `gi search "query" --project=PATH` | 4 | Filter by project |
|
||||||
|
+| `gi search "query" --path=FILE` | 4 | Filter by file path |
|
||||||
|
+| `gi search "query" --json` | 4 | JSON output for scripting |
|
||||||
|
+| `gi search "query" --explain` | 4 | Show ranking breakdown |
|
||||||
|
+
|
||||||
|
+### Database Management
|
||||||
|
+
|
||||||
|
+| Command | CP | Description |
|
||||||
|
+|---------|-----|-------------|
|
||||||
|
+| `gi backup` | 0 | Create timestamped database backup |
|
||||||
|
+| `gi reset --confirm` | 0 | Delete database and reset cursors |
|
||||||
|
+
|
||||||
|
+---
|
||||||
|
+
|
||||||
|
## Future Work (Post-MVP)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 4: Error Handling Section
|
||||||
|
|
||||||
|
**Location:** Insert after CLI Command Reference, before Future Work
|
||||||
|
|
||||||
|
```diff
|
||||||
|
+## Error Handling
|
||||||
|
+
|
||||||
|
+Common errors and their resolutions:
|
||||||
|
+
|
||||||
|
+### Configuration Errors
|
||||||
|
+
|
||||||
|
+| Error | Cause | Resolution |
|
||||||
|
+|-------|-------|------------|
|
||||||
|
+| `Config file not found` | No gi.config.json | Run `gi init` to create configuration |
|
||||||
|
+| `Invalid config: missing baseUrl` | Malformed config | Re-run `gi init` or fix gi.config.json manually |
|
||||||
|
+| `Invalid config: no projects defined` | Empty projects array | Add at least one project path to config |
|
||||||
|
+
|
||||||
|
+### Authentication Errors
|
||||||
|
+
|
||||||
|
+| Error | Cause | Resolution |
|
||||||
|
+|-------|-------|------------|
|
||||||
|
+| `GITLAB_TOKEN environment variable not set` | Token not exported | `export GITLAB_TOKEN="glpat-xxx"` |
|
||||||
|
+| `401 Unauthorized` | Invalid or expired token | Generate new token with `read_api` scope |
|
||||||
|
+| `403 Forbidden` | Token lacks permissions | Ensure token has `read_api` scope |
|
||||||
|
+
|
||||||
|
+### GitLab API Errors
|
||||||
|
+
|
||||||
|
+| Error | Cause | Resolution |
|
||||||
|
+|-------|-------|------------|
|
||||||
|
+| `Project not found: group/project` | Invalid project path | Verify path matches GitLab URL (case-sensitive) |
|
||||||
|
+| `429 Too Many Requests` | Rate limited | Wait for Retry-After period; sync will auto-retry |
|
||||||
|
+| `Connection refused` | GitLab unreachable | Check GitLab URL and network connectivity |
|
||||||
|
+
|
||||||
|
+### Data Errors
|
||||||
|
+
|
||||||
|
+| Error | Cause | Resolution |
|
||||||
|
+|-------|-------|------------|
|
||||||
|
+| `No documents indexed` | Sync not run | Run `gi sync` first |
|
||||||
|
+| `No results found` | Query too specific | Try broader search terms |
|
||||||
|
+| `Database locked` | Concurrent access | Wait for other process; use `gi sync --force` if stale |
|
||||||
|
+
|
||||||
|
+### Embedding Errors
|
||||||
|
+
|
||||||
|
+| Error | Cause | Resolution |
|
||||||
|
+|-------|-------|------------|
|
||||||
|
+| `Ollama connection refused` | Ollama not running | Start Ollama or use `--mode=lexical` |
|
||||||
|
+| `Model not found: nomic-embed-text` | Model not pulled | Run `ollama pull nomic-embed-text` |
|
||||||
|
+| `Embedding failed for N documents` | Transient failures | Run `gi embed --retry-failed` |
|
||||||
|
+
|
||||||
|
+### Operational Behavior
|
||||||
|
+
|
||||||
|
+| Scenario | Behavior |
|
||||||
|
+|----------|----------|
|
||||||
|
+| **Ctrl+C during sync** | Graceful shutdown: finishes current page, commits cursor, exits cleanly. Resume with `gi sync`. |
|
||||||
|
+| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
|
||||||
|
+| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
|
||||||
|
+| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
|
||||||
|
+
|
||||||
|
+---
|
||||||
|
+
|
||||||
|
## Future Work (Post-MVP)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 5: Database Management Section
|
||||||
|
|
||||||
|
**Location:** Insert after Error Handling, before Future Work
|
||||||
|
|
||||||
|
```diff
|
||||||
|
+## Database Management
|
||||||
|
+
|
||||||
|
+### Database Location
|
||||||
|
+
|
||||||
|
+The SQLite database is stored at an XDG-compliant location:
|
||||||
|
+
|
||||||
|
+```
|
||||||
|
+~/.local/share/gi/data.db
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+This can be overridden in `gi.config.json`:
|
||||||
|
+
|
||||||
|
+```json
|
||||||
|
+{
|
||||||
|
+ "storage": {
|
||||||
|
+ "dbPath": "/custom/path/to/data.db"
|
||||||
|
+ }
|
||||||
|
+}
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+### Backup
|
||||||
|
+
|
||||||
|
+Create a timestamped backup of the database:
|
||||||
|
+
|
||||||
|
+```bash
|
||||||
|
+gi backup
|
||||||
|
+# Creates: ~/.local/share/gi/backups/data-2026-01-21T14-30-00.db
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+Backups are SQLite `.backup` command copies (safe even during active writes due to WAL mode).
|
||||||
|
+
|
||||||
|
+### Reset
|
||||||
|
+
|
||||||
|
+To completely reset the database and all sync cursors:
|
||||||
|
+
|
||||||
|
+```bash
|
||||||
|
+gi reset --confirm
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+This deletes:
|
||||||
|
+- The database file
|
||||||
|
+- All sync cursors
|
||||||
|
+- All embeddings
|
||||||
|
+
|
||||||
|
+You'll need to run `gi sync` again to repopulate.
|
||||||
|
+
|
||||||
|
+### Schema Migrations
|
||||||
|
+
|
||||||
|
+Database schema is version-tracked and migrations auto-apply on startup:
|
||||||
|
+
|
||||||
|
+1. On first run, schema is created at latest version
|
||||||
|
+2. On subsequent runs, pending migrations are applied automatically
|
||||||
|
+3. Migration version is stored in `schema_version` table
|
||||||
|
+4. Migrations are idempotent and reversible where possible
|
||||||
|
+
|
||||||
|
+**Manual migration check:**
|
||||||
|
+```bash
|
||||||
|
+gi doctor --json | jq '.checks.database'
|
||||||
|
+# Shows: { "status": "ok", "schemaVersion": 5, "pendingMigrations": 0 }
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+---
|
||||||
|
+
|
||||||
|
## Future Work (Post-MVP)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 6: Empty State Handling in Checkpoint 4
|
||||||
|
|
||||||
|
**Location:** Add to Checkpoint 4 scope section (around line 885, after "Graceful degradation")
|
||||||
|
|
||||||
|
```diff
|
||||||
|
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
|
||||||
|
+- Empty state handling:
|
||||||
|
+ - No documents indexed: `No data indexed. Run 'gi sync' first.`
|
||||||
|
+ - Query returns no results: `No results found for "query".`
|
||||||
|
+ - Filters exclude all results: `No results match the specified filters.`
|
||||||
|
+ - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
|
||||||
|
```
|
||||||
|
|
||||||
|
**Location:** Add to Manual CLI Smoke Tests table (after `gi search "xyznonexistent123"` row)
|
||||||
|
|
||||||
|
```diff
|
||||||
|
| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
|
||||||
|
+| `gi search "auth"` (no data synced) | No data message | Shows "Run gi sync first" |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 7: Update Resolved Decisions Table
|
||||||
|
|
||||||
|
**Location:** Add new rows to Resolved Decisions table (around line 1280)
|
||||||
|
|
||||||
|
```diff
|
||||||
|
| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
|
||||||
|
+| Database location | **XDG compliant: `~/.local/share/gi/`** | Standard location, user-configurable |
|
||||||
|
+| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
|
||||||
|
+| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
|
||||||
|
+| Empty state UX | **Actionable messages** | Guide user to next step |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Modified
|
||||||
|
|
||||||
|
| File | Action |
|
||||||
|
|------|--------|
|
||||||
|
| `SPEC.md` | 7 changes applied |
|
||||||
|
| `SPEC-REVISIONS-3.md` | Created (this file) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verification Checklist
|
||||||
|
|
||||||
|
After applying changes:
|
||||||
|
|
||||||
|
- [ ] Quick Start section provides clear 5-step onboarding
|
||||||
|
- [ ] `gi init` fully specified with validation behavior
|
||||||
|
- [ ] All CLI commands documented in reference table
|
||||||
|
- [ ] Error scenarios have recovery guidance
|
||||||
|
- [ ] Database location and management documented
|
||||||
|
- [ ] Empty states have helpful messages
|
||||||
|
- [ ] Resolved Decisions updated with new choices
|
||||||
|
- [ ] No orphaned command references
|
||||||
716
SPEC-REVISIONS.md
Normal file
716
SPEC-REVISIONS.md
Normal file
@@ -0,0 +1,716 @@
|
|||||||
|
# SPEC.md Revision Document
|
||||||
|
|
||||||
|
This document provides git-diff style changes to integrate improvements from ChatGPT's review into the original SPEC.md. The goal is a "best of all worlds" hybrid that maintains the original architecture while adding production-grade hardening.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 1: Crash-safe Single-flight with Heartbeat Lock
|
||||||
|
|
||||||
|
**Why this is better:** The original plan's single-flight protection is policy-based, not DB-enforced. A race condition exists where two processes could both start before either writes to `sync_runs`. The heartbeat approach provides DB-enforced atomicity, automatic crash recovery, and less manual intervention.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema (Checkpoint 0): @@
|
||||||
|
CREATE TABLE sync_runs (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
started_at INTEGER NOT NULL,
|
||||||
|
+ heartbeat_at INTEGER NOT NULL,
|
||||||
|
finished_at INTEGER,
|
||||||
|
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
|
||||||
|
command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc.
|
||||||
|
error TEXT
|
||||||
|
);
|
||||||
|
|
||||||
|
+-- Crash-safe single-flight lock (DB-enforced)
|
||||||
|
+CREATE TABLE app_locks (
|
||||||
|
+ name TEXT PRIMARY KEY, -- 'sync'
|
||||||
|
+ owner TEXT NOT NULL, -- random run token (UUIDv4)
|
||||||
|
+ acquired_at INTEGER NOT NULL,
|
||||||
|
+ heartbeat_at INTEGER NOT NULL
|
||||||
|
+);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 0: Project Setup - Scope @@
|
||||||
|
**Scope:**
|
||||||
|
- Project structure (TypeScript, ESLint, Vitest)
|
||||||
|
- GitLab API client with PAT authentication
|
||||||
|
- Environment and project configuration
|
||||||
|
- Basic CLI scaffold with `auth-test` command
|
||||||
|
- `doctor` command for environment verification
|
||||||
|
-- Projects table and initial sync
|
||||||
|
+- Projects table and initial project resolution (no issue/MR ingestion yet)
|
||||||
|
+- DB migrations + WAL + FK + app lock primitives
|
||||||
|
+- Crash-safe single-flight lock with heartbeat
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Reliability/Idempotency Rules: @@
|
||||||
|
- Every ingest/sync creates a `sync_runs` row
|
||||||
|
-- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
|
||||||
|
+- Single-flight: acquire `app_locks('sync')` before starting
|
||||||
|
+ - On start: INSERT OR REPLACE lock row with new owner token
|
||||||
|
+ - During run: update `heartbeat_at` every 30 seconds
|
||||||
|
+ - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
|
||||||
|
+ - `--force` remains as operator override for edge cases, but should rarely be needed
|
||||||
|
- Cursor advances only after successful transaction commit per page/batch
|
||||||
|
- Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
|
||||||
|
- Use explicit transactions for batch inserts
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Configuration (MVP): @@
|
||||||
|
// gi.config.json
|
||||||
|
{
|
||||||
|
"gitlab": {
|
||||||
|
"baseUrl": "https://gitlab.example.com",
|
||||||
|
"tokenEnvVar": "GITLAB_TOKEN"
|
||||||
|
},
|
||||||
|
"projects": [
|
||||||
|
{ "path": "group/project-one" },
|
||||||
|
{ "path": "group/project-two" }
|
||||||
|
],
|
||||||
|
+ "sync": {
|
||||||
|
+ "backfillDays": 14,
|
||||||
|
+ "staleLockMinutes": 10,
|
||||||
|
+ "heartbeatIntervalSeconds": 30
|
||||||
|
+ },
|
||||||
|
"embedding": {
|
||||||
|
"provider": "ollama",
|
||||||
|
"model": "nomic-embed-text",
|
||||||
|
- "baseUrl": "http://localhost:11434"
|
||||||
|
+ "baseUrl": "http://localhost:11434",
|
||||||
|
+ "concurrency": 4
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 2: Harden Cursor Semantics + Rolling Backfill Window
|
||||||
|
|
||||||
|
**Why this is better:** The original plan's "critical assumption" that comments update parent `updated_at` is mostly true but the failure mode is catastrophic (silently missing new discussion content). The rolling backfill provides a safety net without requiring weekly full resyncs.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ GitLab API Strategy - Critical Assumption @@
|
||||||
|
-### Critical Assumption
|
||||||
|
-
|
||||||
|
-**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
|
||||||
|
-
|
||||||
|
-Mitigation: Periodic full re-sync (weekly) as a safety net.
|
||||||
|
+### Critical Assumption (Softened)
|
||||||
|
+
|
||||||
|
+We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively.
|
||||||
|
+
|
||||||
|
+**Mitigations (MVP):**
|
||||||
|
+1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor.
|
||||||
|
+2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
|
||||||
|
+3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`).
|
||||||
|
+
|
||||||
|
+The backfill window provides 80% of the safety of full resync at <5% of the API cost.
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 5: Incremental Sync - Scope @@
|
||||||
|
**Scope:**
|
||||||
|
-- Delta sync based on stable cursor (updated_at + tie-breaker id)
|
||||||
|
+- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)`
|
||||||
|
+- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
|
||||||
|
- Dependent resources sync strategy (discussions refetched when parent updates)
|
||||||
|
- Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
|
||||||
|
- Sync status reporting
|
||||||
|
- Recommended: run via cron every 10 minutes
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Correctness Rules (MVP): @@
|
||||||
|
-1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
|
||||||
|
-2. Cursor advances only after successful DB commit for that page
|
||||||
|
+1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
|
||||||
|
+2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
|
||||||
|
+ - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)`
|
||||||
|
+ - Cursor advances only after successful DB commit for that page
|
||||||
|
+ - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
|
||||||
|
3. Dependent resources:
|
||||||
|
- For each updated issue/MR, refetch ALL its discussions
|
||||||
|
- Discussion documents are regenerated and re-embedded if content_hash changes
|
||||||
|
-4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
|
||||||
|
-5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
|
||||||
|
+4. Rolling backfill window:
|
||||||
|
+ - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays`
|
||||||
|
+ - This catches any items whose timestamps were updated without triggering our cursor
|
||||||
|
+5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
|
||||||
|
+6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 3: Raw Payload Scoping + project_id
|
||||||
|
|
||||||
|
**Why this is better:** The original `raw_payloads(resource_type, gitlab_id)` index could have collisions in edge cases (especially if later adding more projects or resource types). Adding `project_id` is defensive and enables project-scoped lookups.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema (Checkpoint 0) - raw_payloads @@
|
||||||
|
CREATE TABLE raw_payloads (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source TEXT NOT NULL, -- 'gitlab'
|
||||||
|
+ project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources
|
||||||
|
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
|
||||||
|
gitlab_id INTEGER NOT NULL,
|
||||||
|
fetched_at INTEGER NOT NULL,
|
||||||
|
json TEXT NOT NULL
|
||||||
|
);
|
||||||
|
-CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
|
||||||
|
+CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
|
||||||
|
+CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 4: Tighten Uniqueness Constraints (project_id + iid)
|
||||||
|
|
||||||
|
**Why this is better:** Users think in terms of "issue 123 in project X," not global IDs. This enables O(1) `gi show issue 123 --project=X` and prevents subtle ingestion bugs from creating duplicate rows.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Preview - issues @@
|
||||||
|
CREATE TABLE issues (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
||||||
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||||
|
iid INTEGER NOT NULL,
|
||||||
|
title TEXT,
|
||||||
|
description TEXT,
|
||||||
|
state TEXT,
|
||||||
|
author_username TEXT,
|
||||||
|
created_at INTEGER,
|
||||||
|
updated_at INTEGER,
|
||||||
|
web_url TEXT,
|
||||||
|
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||||||
|
);
|
||||||
|
CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
|
||||||
|
CREATE INDEX idx_issues_author ON issues(author_username);
|
||||||
|
+CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - merge_requests @@
|
||||||
|
CREATE TABLE merge_requests (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
||||||
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||||
|
iid INTEGER NOT NULL,
|
||||||
|
...
|
||||||
|
);
|
||||||
|
CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
|
||||||
|
CREATE INDEX idx_mrs_author ON merge_requests(author_username);
|
||||||
|
+CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 5: Store System Notes (Flagged) + Capture DiffNote Paths
|
||||||
|
|
||||||
|
**Why this is better:** Two problems with dropping system notes entirely: (1) Some system notes carry decision trace context ("marked as resolved", "changed milestone"). (2) File/path search is disproportionately valuable for engineers. DiffNote positions already contain path metadata - capturing it now enables immediate filename search.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 2 Scope @@
|
||||||
|
- Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
|
||||||
|
- Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
|
||||||
|
- During initial ingest: fetch discussions for every issue/MR
|
||||||
|
- During sync: refetch discussions only for issues/MRs updated since cursor
|
||||||
|
- - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
|
||||||
|
+ - Preserve system notes but flag them with `is_system=1`; exclude from embeddings by default
|
||||||
|
+ - Capture DiffNote file path/line metadata from `position` field for immediate filename search value
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - notes @@
|
||||||
|
CREATE TABLE notes (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
gitlab_id INTEGER UNIQUE NOT NULL,
|
||||||
|
discussion_id INTEGER NOT NULL REFERENCES discussions(id),
|
||||||
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||||
|
type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
|
||||||
|
+ is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.)
|
||||||
|
author_username TEXT,
|
||||||
|
body TEXT,
|
||||||
|
created_at INTEGER,
|
||||||
|
updated_at INTEGER,
|
||||||
|
position INTEGER, -- derived from array order in API response (0-indexed)
|
||||||
|
resolvable BOOLEAN,
|
||||||
|
resolved BOOLEAN,
|
||||||
|
resolved_by TEXT,
|
||||||
|
resolved_at INTEGER,
|
||||||
|
+ -- DiffNote position metadata (nullable, from GitLab API position object)
|
||||||
|
+ position_old_path TEXT,
|
||||||
|
+ position_new_path TEXT,
|
||||||
|
+ position_old_line INTEGER,
|
||||||
|
+ position_new_line INTEGER,
|
||||||
|
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||||||
|
);
|
||||||
|
CREATE INDEX idx_notes_discussion ON notes(discussion_id);
|
||||||
|
CREATE INDEX idx_notes_author ON notes(author_username);
|
||||||
|
CREATE INDEX idx_notes_type ON notes(type);
|
||||||
|
+CREATE INDEX idx_notes_system ON notes(is_system);
|
||||||
|
+CREATE INDEX idx_notes_new_path ON notes(position_new_path);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Discussion Processing Rules @@
|
||||||
|
-- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
|
||||||
|
+- System notes (`system: true`) are ingested with `notes.is_system=1`
|
||||||
|
+ - Excluded from document extraction/embeddings by default (reduces noise in semantic search)
|
||||||
|
+ - Preserved for audit trail, timeline views, and potential future decision-tracing features
|
||||||
|
+ - Can be toggled via `--include-system-notes` flag if needed
|
||||||
|
+- DiffNote position data is extracted and stored:
|
||||||
|
+ - `position.old_path`, `position.new_path` for file-level search
|
||||||
|
+ - `position.old_line`, `position.new_line` for line-level context
|
||||||
|
- Each discussion from the API becomes one row in `discussions` table
|
||||||
|
- All notes within a discussion are stored with their `discussion_id` foreign key
|
||||||
|
- `individual_note: true` discussions have exactly one note (standalone comment)
|
||||||
|
- `individual_note: false` discussions have multiple notes (threaded conversation)
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 2 Automated Tests @@
|
||||||
|
tests/unit/discussion-transformer.test.ts
|
||||||
|
- transforms discussion payload to normalized schema
|
||||||
|
- extracts notes array from discussion
|
||||||
|
- sets individual_note flag correctly
|
||||||
|
- - filters out system notes (system: true)
|
||||||
|
+ - flags system notes with is_system=1
|
||||||
|
+ - extracts DiffNote position metadata (paths and lines)
|
||||||
|
- preserves note order via position field
|
||||||
|
|
||||||
|
tests/integration/discussion-ingestion.test.ts
|
||||||
|
- fetches discussions for each issue
|
||||||
|
- fetches discussions for each MR
|
||||||
|
- creates discussion rows with correct parent FK
|
||||||
|
- creates note rows linked to discussions
|
||||||
|
- - excludes system notes from storage
|
||||||
|
+ - stores system notes with is_system=1 flag
|
||||||
|
+ - extracts position_new_path from DiffNotes
|
||||||
|
- captures note-level resolution status
|
||||||
|
- captures note type (DiscussionNote, DiffNote)
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 2 Data Integrity Checks @@
|
||||||
|
- [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
|
||||||
|
- [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
|
||||||
|
- [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
|
||||||
|
-- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
|
||||||
|
+- [ ] System notes have `is_system=1` flag set correctly
|
||||||
|
+- [ ] DiffNotes have `position_new_path` populated when available
|
||||||
|
- [ ] Every discussion has at least one note
|
||||||
|
- [ ] `individual_note = true` discussions have exactly one note
|
||||||
|
- [ ] Discussion `first_note_at` <= `last_note_at` for all rows
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 6: Document Extraction Structured Header + Truncation Metadata
|
||||||
|
|
||||||
|
**Why this is better:** Adding a deterministic header improves search snippets (more informative), embeddings (model gets stable context), and debuggability (see if/why truncation happened).
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - documents @@
|
||||||
|
CREATE TABLE documents (
|
||||||
|
id INTEGER PRIMARY KEY,
|
||||||
|
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
|
||||||
|
source_id INTEGER NOT NULL, -- local DB id in the source table
|
||||||
|
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||||
|
author_username TEXT, -- for discussions: first note author
|
||||||
|
label_names TEXT, -- JSON array (display/debug only)
|
||||||
|
created_at INTEGER,
|
||||||
|
updated_at INTEGER,
|
||||||
|
url TEXT,
|
||||||
|
title TEXT, -- null for discussions
|
||||||
|
content_text TEXT NOT NULL, -- canonical text for embedding/snippets
|
||||||
|
content_hash TEXT NOT NULL, -- SHA-256 for change detection
|
||||||
|
+ is_truncated BOOLEAN NOT NULL DEFAULT 0,
|
||||||
|
+ truncated_reason TEXT, -- 'token_limit_middle_drop' | null
|
||||||
|
UNIQUE(source_type, source_id)
|
||||||
|
);
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Discussion Document Format @@
|
||||||
|
-[Issue #234: Authentication redesign] Discussion
|
||||||
|
+[[Discussion]] Issue #234: Authentication redesign
|
||||||
|
+Project: group/project-one
|
||||||
|
+URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
|
||||||
|
+Labels: ["bug", "auth"]
|
||||||
|
+Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread
|
||||||
|
+
|
||||||
|
+--- Thread ---
|
||||||
|
|
||||||
|
@johndoe (2024-03-15):
|
||||||
|
I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...
|
||||||
|
|
||||||
|
@janedoe (2024-03-15):
|
||||||
|
Agreed. What about refresh token strategy?
|
||||||
|
|
||||||
|
@johndoe (2024-03-16):
|
||||||
|
Short-lived access tokens (15min), longer refresh (7 days). Here's why...
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Document Extraction Rules @@
|
||||||
|
| Source | content_text Construction |
|
||||||
|
|--------|--------------------------|
|
||||||
|
-| Issue | `title + "\n\n" + description` |
|
||||||
|
-| MR | `title + "\n\n" + description` |
|
||||||
|
+| Issue | Structured header + `title + "\n\n" + description` |
|
||||||
|
+| MR | Structured header + `title + "\n\n" + description` |
|
||||||
|
| Discussion | Full thread with context (see below) |
|
||||||
|
|
||||||
|
+**Structured Header Format (all document types):**
|
||||||
|
+```
|
||||||
|
+[[{SourceType}]] {Title}
|
||||||
|
+Project: {path_with_namespace}
|
||||||
|
+URL: {web_url}
|
||||||
|
+Labels: {JSON array of label names}
|
||||||
|
+Files: {JSON array of paths from DiffNotes, if any}
|
||||||
|
+
|
||||||
|
+--- Content ---
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+This format provides:
|
||||||
|
+- Stable, parseable context for embeddings
|
||||||
|
+- Consistent snippet formatting in search results
|
||||||
|
+- File path context without full file-history feature
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Truncation @@
|
||||||
|
-**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.
|
||||||
|
+**Truncation:**
|
||||||
|
+If content exceeds 8000 tokens:
|
||||||
|
+1. Truncate from the middle (preserve first + last notes for context)
|
||||||
|
+2. Set `documents.is_truncated = 1`
|
||||||
|
+3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
|
||||||
|
+4. Log a warning with document ID and original token count
|
||||||
|
+
|
||||||
|
+This metadata enables:
|
||||||
|
+- Monitoring truncation frequency in production
|
||||||
|
+- Future investigation of high-value truncated documents
|
||||||
|
+- Debugging when search misses expected content
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 7: Embedding Pipeline Concurrency + Per-Document Error Tracking
|
||||||
|
|
||||||
|
**Why this is better:** For 50-100K documents, embedding is the longest pole. Controlled concurrency (4-8 workers) saturates local inference without OOM. Per-document error tracking prevents single bad payloads from stalling "100% coverage" and enables targeted re-runs.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 3: Embedding Generation - Scope @@
|
||||||
|
**Scope:**
|
||||||
|
- Ollama integration (nomic-embed-text model)
|
||||||
|
-- Embedding generation pipeline (batch processing, 32 documents per batch)
|
||||||
|
+- Embedding generation pipeline:
|
||||||
|
+ - Batch size: 32 documents per batch
|
||||||
|
+ - Concurrency: configurable (default 4 workers)
|
||||||
|
+ - Retry with exponential backoff for transient failures (max 3 attempts)
|
||||||
|
+ - Per-document failure recording to enable targeted re-runs
|
||||||
|
- Vector storage in SQLite (sqlite-vss extension)
|
||||||
|
- Progress tracking and resumability
|
||||||
|
- Document extraction layer:
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Schema Additions - embedding_metadata @@
|
||||||
|
CREATE TABLE embedding_metadata (
|
||||||
|
document_id INTEGER PRIMARY KEY REFERENCES documents(id),
|
||||||
|
model TEXT NOT NULL, -- 'nomic-embed-text'
|
||||||
|
dims INTEGER NOT NULL, -- 768
|
||||||
|
content_hash TEXT NOT NULL, -- copied from documents.content_hash
|
||||||
|
- created_at INTEGER NOT NULL
|
||||||
|
+ created_at INTEGER NOT NULL,
|
||||||
|
+ -- Error tracking for resumable embedding
|
||||||
|
+ last_error TEXT, -- error message from last failed attempt
|
||||||
|
+ attempt_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
+ last_attempt_at INTEGER -- when last attempt occurred
|
||||||
|
);
|
||||||
|
+
|
||||||
|
+-- Index for finding failed embeddings to retry
|
||||||
|
+CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 3 Automated Tests @@
|
||||||
|
tests/integration/embedding-storage.test.ts
|
||||||
|
- stores embedding in sqlite-vss
|
||||||
|
- embedding rowid matches document id
|
||||||
|
- creates embedding_metadata record
|
||||||
|
- skips re-embedding when content_hash unchanged
|
||||||
|
- re-embeds when content_hash changes
|
||||||
|
+ - records error in embedding_metadata on failure
|
||||||
|
+ - increments attempt_count on each retry
|
||||||
|
+ - clears last_error on successful embedding
|
||||||
|
+ - respects concurrency limit
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 3 Manual CLI Smoke Tests @@
|
||||||
|
| Command | Expected Output | Pass Criteria |
|
||||||
|
|---------|-----------------|---------------|
|
||||||
|
| `gi embed --all` | Progress bar with ETA | Completes without error |
|
||||||
|
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
|
||||||
|
+| `gi embed --retry-failed` | Progress on failed docs | Re-attempts previously failed embeddings |
|
||||||
|
| `gi stats` | Embedding coverage stats | Shows 100% coverage |
|
||||||
|
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
|
||||||
|
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
|
||||||
|
+
|
||||||
|
+**Stats output should include:**
|
||||||
|
+- Total documents
|
||||||
|
+- Successfully embedded
|
||||||
|
+- Failed (with error breakdown)
|
||||||
|
+- Pending (never attempted)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 8: Search UX Improvements (--project, --explain, Stable JSON Schema)
|
||||||
|
|
||||||
|
**Why this is better:** For day-to-day use, "search across everything" is less useful than "search within repo X." The `--explain` flag helps validate ranking during MVP. Stable JSON schema prevents accidental breaking changes for agent/MCP consumption.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 4 Scope @@
|
||||||
|
-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
|
||||||
|
+- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`
|
||||||
|
+- Debug: `--explain` returns rank contributions from vector + FTS + RRF
|
||||||
|
- Label filtering operates on `document_labels` (indexed, exact-match)
|
||||||
|
- Output formatting: ranked list with title, snippet, score, URL
|
||||||
|
-- JSON output mode for AI agent consumption
|
||||||
|
+- JSON output mode for AI/agent consumption (stable schema, documented)
|
||||||
|
- Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ CLI Interface @@
|
||||||
|
# Basic semantic search
|
||||||
|
gi search "why did we choose Redis"
|
||||||
|
|
||||||
|
+# Search within specific project
|
||||||
|
+gi search "authentication" --project=group/project-one
|
||||||
|
+
|
||||||
|
# Pure FTS search (fallback if embeddings unavailable)
|
||||||
|
gi search "redis" --mode=lexical
|
||||||
|
|
||||||
|
# Filtered search
|
||||||
|
gi search "authentication" --type=mr --after=2024-01-01
|
||||||
|
|
||||||
|
# Filter by label
|
||||||
|
gi search "performance" --label=bug --label=critical
|
||||||
|
|
||||||
|
# JSON output for programmatic use
|
||||||
|
gi search "payment processing" --json
|
||||||
|
+
|
||||||
|
+# Debug ranking (shows how each retriever contributed)
|
||||||
|
+gi search "authentication" --explain
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ JSON Output Schema (NEW SECTION) @@
|
||||||
|
+**JSON Output Schema (Stable)**
|
||||||
|
+
|
||||||
|
+For AI/agent consumption, `--json` output follows this stable schema:
|
||||||
|
+
|
||||||
|
+```typescript
|
||||||
|
+interface SearchResult {
|
||||||
|
+ documentId: number;
|
||||||
|
+ sourceType: "issue" | "merge_request" | "discussion";
|
||||||
|
+ title: string | null;
|
||||||
|
+ url: string;
|
||||||
|
+ projectPath: string;
|
||||||
|
+ author: string | null;
|
||||||
|
+ createdAt: string; // ISO 8601
|
||||||
|
+ updatedAt: string; // ISO 8601
|
||||||
|
+ score: number; // 0-1 normalized RRF score
|
||||||
|
+ snippet: string; // truncated content_text
|
||||||
|
+ labels: string[];
|
||||||
|
+ // Only present with --explain flag
|
||||||
|
+ explain?: {
|
||||||
|
+ vectorRank?: number; // null if not in vector results
|
||||||
|
+ ftsRank?: number; // null if not in FTS results
|
||||||
|
+ rrfScore: number;
|
||||||
|
+ };
|
||||||
|
+}
|
||||||
|
+
|
||||||
|
+interface SearchResponse {
|
||||||
|
+ query: string;
|
||||||
|
+ mode: "hybrid" | "lexical" | "semantic";
|
||||||
|
+ totalResults: number;
|
||||||
|
+ results: SearchResult[];
|
||||||
|
+ warnings?: string[]; // e.g., "Embedding service unavailable"
|
||||||
|
+}
|
||||||
|
+```
|
||||||
|
+
|
||||||
|
+**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 4 Manual CLI Smoke Tests @@
|
||||||
|
| Command | Expected Output | Pass Criteria |
|
||||||
|
|---------|-----------------|---------------|
|
||||||
|
| `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
|
||||||
|
+| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project |
|
||||||
|
| `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
|
||||||
|
| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
|
||||||
|
| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
|
||||||
|
| `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
|
||||||
|
| `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
|
||||||
|
| `gi search "authentication" --json` | JSON output | Valid JSON matching schema |
|
||||||
|
+| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
|
||||||
|
| `gi search "xyznonexistent123"` | No results message | Graceful empty state |
|
||||||
|
| `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 9: Make `gi sync` an Orchestrator
|
||||||
|
|
||||||
|
**Why this is better:** Once CP3+ exist, operators want one command that does the right thing. The most common MVP failure is "I ingested but forgot to regenerate docs / embed / update FTS."
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 5 CLI Commands @@
|
||||||
|
```bash
|
||||||
|
-# Full sync (respects cursors, only fetches new/updated)
|
||||||
|
-gi sync
|
||||||
|
+# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
|
||||||
|
+gi sync # orchestrates all steps
|
||||||
|
+gi sync --no-embed # skip embedding step (fast ingest/debug)
|
||||||
|
+gi sync --no-docs # skip document regeneration (debug)
|
||||||
|
|
||||||
|
# Force full re-sync (resets cursors)
|
||||||
|
gi sync --full
|
||||||
|
|
||||||
|
# Override stale 'running' run after operator review
|
||||||
|
gi sync --force
|
||||||
|
|
||||||
|
# Show sync status
|
||||||
|
gi sync-status
|
||||||
|
```
|
||||||
|
+
|
||||||
|
+**Orchestration steps (in order):**
|
||||||
|
+1. Acquire app lock with heartbeat
|
||||||
|
+2. Ingest delta (issues, MRs, discussions) based on cursors
|
||||||
|
+3. Apply rolling backfill window
|
||||||
|
+4. Regenerate documents for changed entities
|
||||||
|
+5. Embed documents with changed content_hash
|
||||||
|
+6. FTS triggers auto-sync (no explicit step needed)
|
||||||
|
+7. Release lock, record sync_run as succeeded
|
||||||
|
+
|
||||||
|
+Individual commands remain available for checkpoint testing and debugging:
|
||||||
|
+- `gi ingest --type=issues`
|
||||||
|
+- `gi ingest --type=merge_requests`
|
||||||
|
+- `gi embed --all`
|
||||||
|
+- `gi embed --retry-failed`
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 10: Checkpoint Focus Sharpening
|
||||||
|
|
||||||
|
**Why this is better:** Makes each checkpoint's exit criteria crisper and reduces overlap.
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 0: Project Setup @@
|
||||||
|
-**Deliverable:** Scaffolded project with GitLab API connection verified
|
||||||
|
+**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working
|
||||||
|
|
||||||
|
**Scope:**
|
||||||
|
- Project structure (TypeScript, ESLint, Vitest)
|
||||||
|
- GitLab API client with PAT authentication
|
||||||
|
- Environment and project configuration
|
||||||
|
- Basic CLI scaffold with `auth-test` command
|
||||||
|
- `doctor` command for environment verification
|
||||||
|
-- Projects table and initial sync
|
||||||
|
-- Sync tracking for reliability
|
||||||
|
+- Projects table and initial project resolution (no issue/MR ingestion yet)
|
||||||
|
+- DB migrations + WAL + FK enforcement
|
||||||
|
+- Sync tracking with crash-safe single-flight lock
|
||||||
|
+- Rate limit handling with exponential backoff + jitter
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 1 Deliverable @@
|
||||||
|
-**Deliverable:** All issues from target repos stored locally
|
||||||
|
+**Deliverable:** All issues + labels from target repos stored locally with resumable cursor-based sync
|
||||||
|
```
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Checkpoint 2 Deliverable @@
|
||||||
|
-**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context
|
||||||
|
+**Deliverable:** All MRs + discussions + notes (including flagged system notes) stored locally with full thread context and DiffNote file paths captured
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 11: Risk Mitigation Updates
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Risk Mitigation @@
|
||||||
|
| Risk | Mitigation |
|
||||||
|
|------|------------|
|
||||||
|
| GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
|
||||||
|
| Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
|
||||||
|
| SQLite scale limits | Monitor performance; Postgres migration path documented |
|
||||||
|
| Stale data | Incremental sync with change detection |
|
||||||
|
-| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
|
||||||
|
+| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
|
||||||
|
+| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
|
||||||
|
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
|
||||||
|
-| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |
|
||||||
|
+| Concurrent sync corruption | DB-enforced app lock with heartbeat, automatic stale lock recovery |
|
||||||
|
+| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Change 12: Resolved Decisions Updates
|
||||||
|
|
||||||
|
```diff
|
||||||
|
@@ Resolved Decisions @@
|
||||||
|
| Question | Decision | Rationale |
|
||||||
|
|----------|----------|-----------|
|
||||||
|
| Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability |
|
||||||
|
-| System notes | **Exclude during ingestion** | System notes add noise without semantic value |
|
||||||
|
+| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise |
|
||||||
|
+| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature |
|
||||||
|
| MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature |
|
||||||
|
| Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering |
|
||||||
|
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
|
||||||
|
| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
|
||||||
|
+| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
|
||||||
|
| Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates |
|
||||||
|
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
|
||||||
|
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
|
||||||
|
| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context |
|
||||||
|
-| Embedding batching | **32 documents per batch** | Balance throughput and memory |
|
||||||
|
+| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
|
||||||
|
| FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
|
||||||
|
| Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
|
||||||
|
+| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary of All Changes
|
||||||
|
|
||||||
|
| # | Change | Impact |
|
||||||
|
|---|--------|--------|
|
||||||
|
| 1 | Crash-safe heartbeat lock | Prevents race conditions, auto-recovers from crashes |
|
||||||
|
| 2 | Tuple cursor + rolling backfill | Reduces risk of missed updates dramatically |
|
||||||
|
| 3 | project_id on raw_payloads | Defensive scoping for multi-project scenarios |
|
||||||
|
| 4 | Uniqueness on (project_id, iid) | Enables O(1) `gi show issue 123 --project=X` |
|
||||||
|
| 5 | Store system notes flagged + DiffNote paths | Preserves audit trail, enables immediate file search |
|
||||||
|
| 6 | Structured document header + truncation metadata | Better embeddings, debuggability |
|
||||||
|
| 7 | Embedding concurrency + per-doc errors | 50-100K docs becomes manageable |
|
||||||
|
| 8 | --project, --explain, stable JSON | Day-to-day UX and trust-building |
|
||||||
|
| 9 | `gi sync` orchestrator | Reduces human error |
|
||||||
|
| 10 | Checkpoint focus sharpening | Clearer exit criteria |
|
||||||
|
| 11-12 | Risk/Decisions updates | Documentation alignment |
|
||||||
|
|
||||||
|
**Net effect:** Same MVP product (semantic search over issues/MRs/discussions), but with production-grade hardening that prevents the class of bugs that typically kill MVPs in real-world use.
|
||||||
Reference in New Issue
Block a user