# SPEC.md Revision Document - Round 2 This document provides git-diff style changes for the second round of improvements from ChatGPT's review. These are primarily correctness fixes and optimizations. --- ## Change 1: Fix Tuple Cursor Correctness Gap (Cursor Rewind + Local Filtering) **Why this is critical:** The spec specifies tuple cursor semantics `(updated_at, gitlab_id)` but GitLab's API only supports `updated_after` which is strictly "after" - it cannot express `WHERE updated_at = X AND id > Y` server-side. This creates a real risk of missed items on crash/resume and on dense timestamp buckets. **Fix:** Cursor rewind + local filtering. Call GitLab with `updated_after = cursor_updated_at - rewindSeconds`, then locally discard items we've already processed. ```diff @@ Correctness Rules (MVP): @@ 1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC` 2. Cursor is a stable tuple `(updated_at, gitlab_id)`: - - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)` + - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.** + - Use **cursor rewind + local filtering**: + - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable) + - Locally discard items where: + - `updated_at < cursor_updated_at`, OR + - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id` + - This makes the tuple cursor rule true in practice while keeping API calls simple. - Cursor advances only after successful DB commit for that page - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)` ``` ```diff @@ Configuration (MVP): @@ "sync": { "backfillDays": 14, "staleLockMinutes": 10, - "heartbeatIntervalSeconds": 30 + "heartbeatIntervalSeconds": 30, + "cursorRewindSeconds": 2 }, ``` --- ## Change 2: Make App Lock Actually Safe (BEGIN IMMEDIATE CAS) **Why this is critical:** INSERT OR REPLACE can overwrite an active lock if two processes start close together (both do "stale check" outside a write transaction, then both INSERT OR REPLACE). SQLite's BEGIN IMMEDIATE provides a proper compare-and-swap. ```diff @@ Reliability/Idempotency Rules: @@ - Every ingest/sync creates a `sync_runs` row - Single-flight via DB-enforced app lock: - - On start: INSERT OR REPLACE lock row with new owner token + - On start: acquire lock via transactional compare-and-swap: + - `BEGIN IMMEDIATE` (acquires write lock immediately) + - If no row exists → INSERT new lock + - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps + - Else if `owner` matches current run → UPDATE heartbeat (re-entrant) + - Else → ROLLBACK and fail fast (another run is active) + - `COMMIT` - During run: update `heartbeat_at` every 30 seconds - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire - `--force` remains as operator override for edge cases, but should rarely be needed ``` --- ## Change 3: Dependent Resource Pagination + Bounded Concurrency **Why this is important:** Discussions endpoints are paginated on many GitLab instances. Without pagination, we silently lose data. Without bounded concurrency, initial sync can become unstable (429s, long tail retries). ```diff @@ Dependent Resources (Per-Parent Fetch): @@ -GET /projects/:id/issues/:iid/discussions -GET /projects/:id/merge_requests/:iid/discussions +GET /projects/:id/issues/:iid/discussions?per_page=100&page=N +GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N + +**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent. ``` ```diff @@ Rate Limiting: @@ - Default: 10 requests/second with exponential backoff - Respect `Retry-After` headers on 429 responses - Add jitter to avoid thundering herd on retry +- **Separate concurrency limits:** + - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4) + - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s) + - Bound concurrency per-project to avoid one repo starving the other - Initial sync estimate: 10-20 minutes depending on rate limits ``` ```diff @@ Configuration (MVP): @@ "sync": { "backfillDays": 14, "staleLockMinutes": 10, "heartbeatIntervalSeconds": 30, - "cursorRewindSeconds": 2 + "cursorRewindSeconds": 2, + "primaryConcurrency": 4, + "dependentConcurrency": 2 }, ``` --- ## Change 4: Track last_seen_at for Eventual Consistency Debugging **Why this is valuable:** Even without implementing deletions, you want to know: (a) whether a record is actively refreshed under backfill/sync, (b) whether a sync run is "covering" the dataset, (c) whether a particular item hasn't been seen in months (helps diagnose missed updates). ```diff @@ Schema Preview - issues: @@ CREATE TABLE issues ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, title TEXT, description TEXT, state TEXT, author_username TEXT, created_at INTEGER, updated_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); ``` ```diff @@ Schema Additions - merge_requests: @@ CREATE TABLE merge_requests ( @@ updated_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync merged_at INTEGER, @@ ); ``` ```diff @@ Schema Additions - discussions: @@ CREATE TABLE discussions ( @@ last_note_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync resolvable BOOLEAN, @@ ); ``` ```diff @@ Schema Additions - notes: @@ CREATE TABLE notes ( @@ updated_at INTEGER, + last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync position INTEGER, @@ ); ``` --- ## Change 5: Raw Payload Compression **Why this is valuable:** At 50-100K documents plus threaded discussions, raw JSON is likely the largest storage consumer. Supporting gzip compression reduces DB size while preserving replay capability. ```diff @@ Schema (Checkpoint 0) - raw_payloads: @@ CREATE TABLE raw_payloads ( id INTEGER PRIMARY KEY, source TEXT NOT NULL, -- 'gitlab' project_id INTEGER REFERENCES projects(id), resource_type TEXT NOT NULL, gitlab_id INTEGER NOT NULL, fetched_at INTEGER NOT NULL, - json TEXT NOT NULL + content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip' + payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON ); ``` ```diff @@ Configuration (MVP): @@ + "storage": { + "compressRawPayloads": true -- gzip raw payloads to reduce DB size + }, ``` --- ## Change 6: Scope Discussions Unique by Project **Why this is important:** `gitlab_discussion_id TEXT UNIQUE` assumes global uniqueness across all projects. While likely true for GitLab, it's safer to scope by project_id. This avoids rare but painful collisions and makes it easier to support more repos later. ```diff @@ Schema Additions - discussions: @@ CREATE TABLE discussions ( id INTEGER PRIMARY KEY, - gitlab_discussion_id TEXT UNIQUE NOT NULL, + gitlab_discussion_id TEXT NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), @@ ); +CREATE UNIQUE INDEX uq_discussions_project_discussion_id + ON discussions(project_id, gitlab_discussion_id); ``` --- ## Change 7: Dirty Queue for Document Regeneration **Why this is valuable:** The orchestration says "Regenerate documents for changed entities" but doesn't define how "changed" is computed without scanning large tables. A dirty queue populated during ingestion makes doc regen deterministic and fast. ```diff @@ Schema Additions (Checkpoint 3): @@ +-- Track sources that require document regeneration (populated during ingestion) +CREATE TABLE dirty_sources ( + source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' + source_id INTEGER NOT NULL, -- local DB id + queued_at INTEGER NOT NULL, + PRIMARY KEY(source_type, source_id) +); ``` ```diff @@ Orchestration steps (in order): @@ 1. Acquire app lock with heartbeat 2. Ingest delta (issues, MRs, discussions) based on cursors + - During ingestion, INSERT into dirty_sources for each upserted entity 3. Apply rolling backfill window -4. Regenerate documents for changed entities +4. Regenerate documents for entities in dirty_sources (process + delete from queue) 5. Embed documents with changed content_hash 6. FTS triggers auto-sync (no explicit step needed) 7. Release lock, record sync_run as succeeded ``` --- ## Change 8: document_paths + --path Filter **Why this is high value:** We're already capturing DiffNote file paths in CP2. Adding a `--path` filter now makes the MVP dramatically more compelling for engineers who search by file path constantly. ```diff @@ Checkpoint 4 Scope: @@ -- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path` +- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file` + - `--path` filters documents by referenced file paths (from DiffNote positions) + - MVP: substring/exact match; glob patterns deferred ``` ```diff @@ Schema Additions (Checkpoint 3): @@ +-- Fast path filtering for documents (extracted from DiffNote positions) +CREATE TABLE document_paths ( + document_id INTEGER NOT NULL REFERENCES documents(id), + path TEXT NOT NULL, + PRIMARY KEY(document_id, path) +); +CREATE INDEX idx_document_paths_path ON document_paths(path); ``` ```diff @@ CLI Interface: @@ # Search within specific project gi search "authentication" --project=group/project-one +# Search by file path (finds discussions/MRs touching this file) +gi search "rate limit" --path=src/client.ts + # Pure FTS search (fallback if embeddings unavailable) gi search "redis" --mode=lexical ``` ```diff @@ Manual CLI Smoke Tests: @@ +| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ | ``` --- ## Change 9: Character-Based Truncation (Not Exact Tokens) **Why this is practical:** "8000 tokens" sounds precise, but tokenizers vary. Exact token counting adds dependency complexity. A conservative character budget is simpler and avoids false precision. ```diff @@ Document Extraction Rules: @@ - Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192) + - **Implementation:** Use character budget, not exact token count + - `maxChars = 32000` (conservative 4 chars/token estimate) + - `approxTokens = ceil(charCount / 4)` for reporting/logging only + - This avoids tokenizer dependency while preventing embedding failures ``` ```diff @@ Truncation: @@ If content exceeds 8000 tokens: +**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`. + 1. Truncate from the middle (preserve first + last notes for context) 2. Set `documents.is_truncated = 1` 3. Set `documents.truncated_reason = 'token_limit_middle_drop'` 4. Log a warning with document ID and original token count ``` --- ## Change 10: Move Lexical Search to CP3 (Reorder, Not New Scope) **Why this is better:** Nothing "search-like" exists until CP4, but FTS5 is already a dependency for graceful degradation. Moving FTS setup to CP3 (when documents exist) gives an earlier usable artifact and better validation. CP4 becomes "hybrid ranking upgrade." ```diff @@ Checkpoint 3: Embedding Generation @@ -### Checkpoint 3: Embedding Generation -**Deliverable:** Vector embeddings generated for all text content +### Checkpoint 3: Document + Embedding Generation with Lexical Search +**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end ``` ```diff @@ Checkpoint 3 Scope: @@ - Ollama integration (nomic-embed-text model) - Embedding generation pipeline: @@ - Fast label filtering via `document_labels` join table +- FTS5 index for lexical search (moved from CP4) +- `gi search --mode=lexical` CLI command (works without Ollama) ``` ```diff @@ Checkpoint 3 Manual CLI Smoke Tests: @@ +| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required | ``` ```diff @@ Checkpoint 4: Semantic Search @@ -### Checkpoint 4: Semantic Search -**Deliverable:** Working semantic search across all indexed content +### Checkpoint 4: Hybrid Search (Semantic + Lexical) +**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content ``` ```diff @@ Checkpoint 4 Scope: @@ **Scope:** - Hybrid retrieval: - Vector recall (sqlite-vss) + FTS lexical recall (fts5) - Merge + rerank results using Reciprocal Rank Fusion (RRF) +- Query embedding generation (same Ollama pipeline as documents) - Result ranking and scoring (document-level) -- Search filters: ... +- Filters work identically in hybrid and lexical modes ``` --- ## Summary of All Changes (Round 2) | # | Change | Impact | |---|--------|--------| | 1 | **Cursor rewind + local filtering** | Fixes real correctness gap in tuple cursor implementation | | 2 | **BEGIN IMMEDIATE CAS for lock** | Prevents race condition in lock acquisition | | 3 | **Discussions pagination + concurrency** | Prevents silent data loss on large discussion threads | | 4 | **last_seen_at columns** | Enables debugging of sync coverage without deletions | | 5 | **Raw payload compression** | Reduces DB size significantly at scale | | 6 | **Scope discussions unique by project** | Defensive uniqueness for multi-project safety | | 7 | **Dirty queue for doc regen** | Makes document regeneration deterministic and fast | | 8 | **document_paths + --path filter** | High-value file search with minimal scope | | 9 | **Character-based truncation** | Practical implementation without tokenizer dependency | | 10 | **Lexical search in CP3** | Earlier usable artifact; better checkpoint validation | **Net effect:** These changes fix several correctness gaps (cursor, lock, pagination) while adding high-value features (--path filter) and operational improvements (compression, dirty queue, last_seen_at).