# SPEC.md Revision Document This document provides git-diff style changes to integrate improvements from ChatGPT's review into the original SPEC.md. The goal is a "best of all worlds" hybrid that maintains the original architecture while adding production-grade hardening. --- ## Change 1: Crash-safe Single-flight with Heartbeat Lock **Why this is better:** The original plan's single-flight protection is policy-based, not DB-enforced. A race condition exists where two processes could both start before either writes to `sync_runs`. The heartbeat approach provides DB-enforced atomicity, automatic crash recovery, and less manual intervention. ```diff @@ Schema (Checkpoint 0): @@ CREATE TABLE sync_runs ( id INTEGER PRIMARY KEY, started_at INTEGER NOT NULL, + heartbeat_at INTEGER NOT NULL, finished_at INTEGER, status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed' command TEXT NOT NULL, -- 'ingest issues' | 'sync' | etc. error TEXT ); +-- Crash-safe single-flight lock (DB-enforced) +CREATE TABLE app_locks ( + name TEXT PRIMARY KEY, -- 'sync' + owner TEXT NOT NULL, -- random run token (UUIDv4) + acquired_at INTEGER NOT NULL, + heartbeat_at INTEGER NOT NULL +); ``` ```diff @@ Checkpoint 0: Project Setup - Scope @@ **Scope:** - Project structure (TypeScript, ESLint, Vitest) - GitLab API client with PAT authentication - Environment and project configuration - Basic CLI scaffold with `auth-test` command - `doctor` command for environment verification -- Projects table and initial sync +- Projects table and initial project resolution (no issue/MR ingestion yet) +- DB migrations + WAL + FK + app lock primitives +- Crash-safe single-flight lock with heartbeat ``` ```diff @@ Reliability/Idempotency Rules: @@ - Every ingest/sync creates a `sync_runs` row -- Single-flight: refuse to start if an existing run is `running` (unless `--force`) +- Single-flight: acquire `app_locks('sync')` before starting + - On start: INSERT OR REPLACE lock row with new owner token + - During run: update `heartbeat_at` every 30 seconds + - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire + - `--force` remains as operator override for edge cases, but should rarely be needed - Cursor advances only after successful transaction commit per page/batch - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC` - Use explicit transactions for batch inserts ``` ```diff @@ Configuration (MVP): @@ // gi.config.json { "gitlab": { "baseUrl": "https://gitlab.example.com", "tokenEnvVar": "GITLAB_TOKEN" }, "projects": [ { "path": "group/project-one" }, { "path": "group/project-two" } ], + "sync": { + "backfillDays": 14, + "staleLockMinutes": 10, + "heartbeatIntervalSeconds": 30 + }, "embedding": { "provider": "ollama", "model": "nomic-embed-text", - "baseUrl": "http://localhost:11434" + "baseUrl": "http://localhost:11434", + "concurrency": 4 } } ``` --- ## Change 2: Harden Cursor Semantics + Rolling Backfill Window **Why this is better:** The original plan's "critical assumption" that comments update parent `updated_at` is mostly true but the failure mode is catastrophic (silently missing new discussion content). The rolling backfill provides a safety net without requiring weekly full resyncs. ```diff @@ GitLab API Strategy - Critical Assumption @@ -### Critical Assumption - -**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed. - -Mitigation: Periodic full re-sync (weekly) as a safety net. +### Critical Assumption (Softened) + +We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively. + +**Mitigations (MVP):** +1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor. +2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly. +3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`). + +The backfill window provides 80% of the safety of full resync at <5% of the API cost. ``` ```diff @@ Checkpoint 5: Incremental Sync - Scope @@ **Scope:** -- Delta sync based on stable cursor (updated_at + tie-breaker id) +- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)` +- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates - Dependent resources sync strategy (discussions refetched when parent updates) - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash) - Sync status reporting - Recommended: run via cron every 10 minutes ``` ```diff @@ Correctness Rules (MVP): @@ -1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC` -2. Cursor advances only after successful DB commit for that page +1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC` +2. Cursor is a stable tuple `(updated_at, gitlab_id)`: + - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)` + - Cursor advances only after successful DB commit for that page + - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)` 3. Dependent resources: - For each updated issue/MR, refetch ALL its discussions - Discussion documents are regenerated and re-embedded if content_hash changes -4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` -5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) +4. Rolling backfill window: + - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays` + - This catches any items whose timestamps were updated without triggering our cursor +5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash` +6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor) ``` --- ## Change 3: Raw Payload Scoping + project_id **Why this is better:** The original `raw_payloads(resource_type, gitlab_id)` index could have collisions in edge cases (especially if later adding more projects or resource types). Adding `project_id` is defensive and enables project-scoped lookups. ```diff @@ Schema (Checkpoint 0) - raw_payloads @@ CREATE TABLE raw_payloads ( id INTEGER PRIMARY KEY, source TEXT NOT NULL, -- 'gitlab' + project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion' gitlab_id INTEGER NOT NULL, fetched_at INTEGER NOT NULL, json TEXT NOT NULL ); -CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id); +CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id); +CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at); ``` --- ## Change 4: Tighten Uniqueness Constraints (project_id + iid) **Why this is better:** Users think in terms of "issue 123 in project X," not global IDs. This enables O(1) `gi show issue 123 --project=X` and prevents subtle ingestion bugs from creating duplicate rows. ```diff @@ Schema Preview - issues @@ CREATE TABLE issues ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, title TEXT, description TEXT, state TEXT, author_username TEXT, created_at INTEGER, updated_at INTEGER, web_url TEXT, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at); CREATE INDEX idx_issues_author ON issues(author_username); +CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid); ``` ```diff @@ Schema Additions - merge_requests @@ CREATE TABLE merge_requests ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, project_id INTEGER NOT NULL REFERENCES projects(id), iid INTEGER NOT NULL, ... ); CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at); CREATE INDEX idx_mrs_author ON merge_requests(author_username); +CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid); ``` --- ## Change 5: Store System Notes (Flagged) + Capture DiffNote Paths **Why this is better:** Two problems with dropping system notes entirely: (1) Some system notes carry decision trace context ("marked as resolved", "changed milestone"). (2) File/path search is disproportionately valuable for engineers. DiffNote positions already contain path metadata - capturing it now enables immediate filename search. ```diff @@ Checkpoint 2 Scope @@ - Discussions fetcher (issue discussions + MR discussions) as a dependent resource: - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions` - During initial ingest: fetch discussions for every issue/MR - During sync: refetch discussions only for issues/MRs updated since cursor - - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise + - Preserve system notes but flag them with `is_system=1`; exclude from embeddings by default + - Capture DiffNote file path/line metadata from `position` field for immediate filename search value ``` ```diff @@ Schema Additions - notes @@ CREATE TABLE notes ( id INTEGER PRIMARY KEY, gitlab_id INTEGER UNIQUE NOT NULL, discussion_id INTEGER NOT NULL REFERENCES discussions(id), project_id INTEGER NOT NULL REFERENCES projects(id), type TEXT, -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API) + is_system BOOLEAN NOT NULL DEFAULT 0, -- system notes (assignments, label changes, etc.) author_username TEXT, body TEXT, created_at INTEGER, updated_at INTEGER, position INTEGER, -- derived from array order in API response (0-indexed) resolvable BOOLEAN, resolved BOOLEAN, resolved_by TEXT, resolved_at INTEGER, + -- DiffNote position metadata (nullable, from GitLab API position object) + position_old_path TEXT, + position_new_path TEXT, + position_old_line INTEGER, + position_new_line INTEGER, raw_payload_id INTEGER REFERENCES raw_payloads(id) ); CREATE INDEX idx_notes_discussion ON notes(discussion_id); CREATE INDEX idx_notes_author ON notes(author_username); CREATE INDEX idx_notes_type ON notes(type); +CREATE INDEX idx_notes_system ON notes(is_system); +CREATE INDEX idx_notes_new_path ON notes(position_new_path); ``` ```diff @@ Discussion Processing Rules @@ -- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.) +- System notes (`system: true`) are ingested with `notes.is_system=1` + - Excluded from document extraction/embeddings by default (reduces noise in semantic search) + - Preserved for audit trail, timeline views, and potential future decision-tracing features + - Can be toggled via `--include-system-notes` flag if needed +- DiffNote position data is extracted and stored: + - `position.old_path`, `position.new_path` for file-level search + - `position.old_line`, `position.new_line` for line-level context - Each discussion from the API becomes one row in `discussions` table - All notes within a discussion are stored with their `discussion_id` foreign key - `individual_note: true` discussions have exactly one note (standalone comment) - `individual_note: false` discussions have multiple notes (threaded conversation) ``` ```diff @@ Checkpoint 2 Automated Tests @@ tests/unit/discussion-transformer.test.ts - transforms discussion payload to normalized schema - extracts notes array from discussion - sets individual_note flag correctly - - filters out system notes (system: true) + - flags system notes with is_system=1 + - extracts DiffNote position metadata (paths and lines) - preserves note order via position field tests/integration/discussion-ingestion.test.ts - fetches discussions for each issue - fetches discussions for each MR - creates discussion rows with correct parent FK - creates note rows linked to discussions - - excludes system notes from storage + - stores system notes with is_system=1 flag + - extracts position_new_path from DiffNotes - captures note-level resolution status - captures note type (DiscussionNote, DiffNote) ``` ```diff @@ Checkpoint 2 Data Integrity Checks @@ - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count - [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments - [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked) -- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes) +- [ ] System notes have `is_system=1` flag set correctly +- [ ] DiffNotes have `position_new_path` populated when available - [ ] Every discussion has at least one note - [ ] `individual_note = true` discussions have exactly one note - [ ] Discussion `first_note_at` <= `last_note_at` for all rows ``` --- ## Change 6: Document Extraction Structured Header + Truncation Metadata **Why this is better:** Adding a deterministic header improves search snippets (more informative), embeddings (model gets stable context), and debuggability (see if/why truncation happened). ```diff @@ Schema Additions - documents @@ CREATE TABLE documents ( id INTEGER PRIMARY KEY, source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' source_id INTEGER NOT NULL, -- local DB id in the source table project_id INTEGER NOT NULL REFERENCES projects(id), author_username TEXT, -- for discussions: first note author label_names TEXT, -- JSON array (display/debug only) created_at INTEGER, updated_at INTEGER, url TEXT, title TEXT, -- null for discussions content_text TEXT NOT NULL, -- canonical text for embedding/snippets content_hash TEXT NOT NULL, -- SHA-256 for change detection + is_truncated BOOLEAN NOT NULL DEFAULT 0, + truncated_reason TEXT, -- 'token_limit_middle_drop' | null UNIQUE(source_type, source_id) ); ``` ```diff @@ Discussion Document Format @@ -[Issue #234: Authentication redesign] Discussion +[[Discussion]] Issue #234: Authentication redesign +Project: group/project-one +URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345 +Labels: ["bug", "auth"] +Files: ["src/auth/login.ts"] -- present if any DiffNotes exist in thread + +--- Thread --- @johndoe (2024-03-15): I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients... @janedoe (2024-03-15): Agreed. What about refresh token strategy? @johndoe (2024-03-16): Short-lived access tokens (15min), longer refresh (7 days). Here's why... ``` ```diff @@ Document Extraction Rules @@ | Source | content_text Construction | |--------|--------------------------| -| Issue | `title + "\n\n" + description` | -| MR | `title + "\n\n" + description` | +| Issue | Structured header + `title + "\n\n" + description` | +| MR | Structured header + `title + "\n\n" + description` | | Discussion | Full thread with context (see below) | +**Structured Header Format (all document types):** +``` +[[{SourceType}]] {Title} +Project: {path_with_namespace} +URL: {web_url} +Labels: {JSON array of label names} +Files: {JSON array of paths from DiffNotes, if any} + +--- Content --- +``` + +This format provides: +- Stable, parseable context for embeddings +- Consistent snippet formatting in search results +- File path context without full file-history feature ``` ```diff @@ Truncation @@ -**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning. +**Truncation:** +If content exceeds 8000 tokens: +1. Truncate from the middle (preserve first + last notes for context) +2. Set `documents.is_truncated = 1` +3. Set `documents.truncated_reason = 'token_limit_middle_drop'` +4. Log a warning with document ID and original token count + +This metadata enables: +- Monitoring truncation frequency in production +- Future investigation of high-value truncated documents +- Debugging when search misses expected content ``` --- ## Change 7: Embedding Pipeline Concurrency + Per-Document Error Tracking **Why this is better:** For 50-100K documents, embedding is the longest pole. Controlled concurrency (4-8 workers) saturates local inference without OOM. Per-document error tracking prevents single bad payloads from stalling "100% coverage" and enables targeted re-runs. ```diff @@ Checkpoint 3: Embedding Generation - Scope @@ **Scope:** - Ollama integration (nomic-embed-text model) -- Embedding generation pipeline (batch processing, 32 documents per batch) +- Embedding generation pipeline: + - Batch size: 32 documents per batch + - Concurrency: configurable (default 4 workers) + - Retry with exponential backoff for transient failures (max 3 attempts) + - Per-document failure recording to enable targeted re-runs - Vector storage in SQLite (sqlite-vss extension) - Progress tracking and resumability - Document extraction layer: ``` ```diff @@ Schema Additions - embedding_metadata @@ CREATE TABLE embedding_metadata ( document_id INTEGER PRIMARY KEY REFERENCES documents(id), model TEXT NOT NULL, -- 'nomic-embed-text' dims INTEGER NOT NULL, -- 768 content_hash TEXT NOT NULL, -- copied from documents.content_hash - created_at INTEGER NOT NULL + created_at INTEGER NOT NULL, + -- Error tracking for resumable embedding + last_error TEXT, -- error message from last failed attempt + attempt_count INTEGER NOT NULL DEFAULT 0, + last_attempt_at INTEGER -- when last attempt occurred ); + +-- Index for finding failed embeddings to retry +CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL; ``` ```diff @@ Checkpoint 3 Automated Tests @@ tests/integration/embedding-storage.test.ts - stores embedding in sqlite-vss - embedding rowid matches document id - creates embedding_metadata record - skips re-embedding when content_hash unchanged - re-embeds when content_hash changes + - records error in embedding_metadata on failure + - increments attempt_count on each retry + - clears last_error on successful embedding + - respects concurrency limit ``` ```diff @@ Checkpoint 3 Manual CLI Smoke Tests @@ | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi embed --all` | Progress bar with ETA | Completes without error | | `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs | +| `gi embed --retry-failed` | Progress on failed docs | Re-attempts previously failed embeddings | | `gi stats` | Embedding coverage stats | Shows 100% coverage | | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error | + +**Stats output should include:** +- Total documents +- Successfully embedded +- Failed (with error breakdown) +- Pending (never attempted) ``` --- ## Change 8: Search UX Improvements (--project, --explain, Stable JSON Schema) **Why this is better:** For day-to-day use, "search across everything" is less useful than "search within repo X." The `--explain` flag helps validate ranking during MVP. Stable JSON schema prevents accidental breaking changes for agent/MCP consumption. ```diff @@ Checkpoint 4 Scope @@ -- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name` +- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path` +- Debug: `--explain` returns rank contributions from vector + FTS + RRF - Label filtering operates on `document_labels` (indexed, exact-match) - Output formatting: ranked list with title, snippet, score, URL -- JSON output mode for AI agent consumption +- JSON output mode for AI/agent consumption (stable schema, documented) - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning ``` ```diff @@ CLI Interface @@ # Basic semantic search gi search "why did we choose Redis" +# Search within specific project +gi search "authentication" --project=group/project-one + # Pure FTS search (fallback if embeddings unavailable) gi search "redis" --mode=lexical # Filtered search gi search "authentication" --type=mr --after=2024-01-01 # Filter by label gi search "performance" --label=bug --label=critical # JSON output for programmatic use gi search "payment processing" --json + +# Debug ranking (shows how each retriever contributed) +gi search "authentication" --explain ``` ```diff @@ JSON Output Schema (NEW SECTION) @@ +**JSON Output Schema (Stable)** + +For AI/agent consumption, `--json` output follows this stable schema: + +```typescript +interface SearchResult { + documentId: number; + sourceType: "issue" | "merge_request" | "discussion"; + title: string | null; + url: string; + projectPath: string; + author: string | null; + createdAt: string; // ISO 8601 + updatedAt: string; // ISO 8601 + score: number; // 0-1 normalized RRF score + snippet: string; // truncated content_text + labels: string[]; + // Only present with --explain flag + explain?: { + vectorRank?: number; // null if not in vector results + ftsRank?: number; // null if not in FTS results + rrfScore: number; + }; +} + +interface SearchResponse { + query: string; + mode: "hybrid" | "lexical" | "semantic"; + totalResults: number; + results: SearchResult[]; + warnings?: string[]; // e.g., "Embedding service unavailable" +} +``` + +**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed. ``` ```diff @@ Checkpoint 4 Manual CLI Smoke Tests @@ | Command | Expected Output | Pass Criteria | |---------|-----------------|---------------| | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score | +| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project | | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output | | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | | `gi search "authentication" --label=bug` | Label filtered | All results have bug label | | `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama | | `gi search "authentication" --json` | JSON output | Valid JSON matching schema | +| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions | | `gi search "xyznonexistent123"` | No results message | Graceful empty state | | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | ``` --- ## Change 9: Make `gi sync` an Orchestrator **Why this is better:** Once CP3+ exist, operators want one command that does the right thing. The most common MVP failure is "I ingested but forgot to regenerate docs / embed / update FTS." ```diff @@ Checkpoint 5 CLI Commands @@ ```bash -# Full sync (respects cursors, only fetches new/updated) -gi sync +# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced) +gi sync # orchestrates all steps +gi sync --no-embed # skip embedding step (fast ingest/debug) +gi sync --no-docs # skip document regeneration (debug) # Force full re-sync (resets cursors) gi sync --full # Override stale 'running' run after operator review gi sync --force # Show sync status gi sync-status ``` + +**Orchestration steps (in order):** +1. Acquire app lock with heartbeat +2. Ingest delta (issues, MRs, discussions) based on cursors +3. Apply rolling backfill window +4. Regenerate documents for changed entities +5. Embed documents with changed content_hash +6. FTS triggers auto-sync (no explicit step needed) +7. Release lock, record sync_run as succeeded + +Individual commands remain available for checkpoint testing and debugging: +- `gi ingest --type=issues` +- `gi ingest --type=merge_requests` +- `gi embed --all` +- `gi embed --retry-failed` ``` --- ## Change 10: Checkpoint Focus Sharpening **Why this is better:** Makes each checkpoint's exit criteria crisper and reduces overlap. ```diff @@ Checkpoint 0: Project Setup @@ -**Deliverable:** Scaffolded project with GitLab API connection verified +**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working **Scope:** - Project structure (TypeScript, ESLint, Vitest) - GitLab API client with PAT authentication - Environment and project configuration - Basic CLI scaffold with `auth-test` command - `doctor` command for environment verification -- Projects table and initial sync -- Sync tracking for reliability +- Projects table and initial project resolution (no issue/MR ingestion yet) +- DB migrations + WAL + FK enforcement +- Sync tracking with crash-safe single-flight lock +- Rate limit handling with exponential backoff + jitter ``` ```diff @@ Checkpoint 1 Deliverable @@ -**Deliverable:** All issues from target repos stored locally +**Deliverable:** All issues + labels from target repos stored locally with resumable cursor-based sync ``` ```diff @@ Checkpoint 2 Deliverable @@ -**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context +**Deliverable:** All MRs + discussions + notes (including flagged system notes) stored locally with full thread context and DiffNote file paths captured ``` --- ## Change 11: Risk Mitigation Updates ```diff @@ Risk Mitigation @@ | Risk | Mitigation | |------|------------| | GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync | | Embedding model quality | Start with nomic-embed-text; architecture allows model swap | | SQLite scale limits | Monitor performance; Postgres migration path documented | | Stale data | Incremental sync with change detection | -| Mid-sync failures | Cursor-based resumption, sync_runs audit trail | +| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery | +| Missed updates | Rolling backfill window (14 days), tuple cursor semantics | | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | -| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) | +| Concurrent sync corruption | DB-enforced app lock with heartbeat, automatic stale lock recovery | +| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs | ``` --- ## Change 12: Resolved Decisions Updates ```diff @@ Resolved Decisions @@ | Question | Decision | Rationale | |----------|----------|-----------| | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability | -| System notes | **Exclude during ingestion** | System notes add noise without semantic value | +| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise | +| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature | | MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature | | Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering | | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings | | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient | +| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates | | Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping | | Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context | -| Embedding batching | **32 documents per batch** | Balance throughput and memory | +| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation | | FTS5 tokenizer | **porter unicode61** | Stemming improves recall | | Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching | +| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption | ``` --- ## Summary of All Changes | # | Change | Impact | |---|--------|--------| | 1 | Crash-safe heartbeat lock | Prevents race conditions, auto-recovers from crashes | | 2 | Tuple cursor + rolling backfill | Reduces risk of missed updates dramatically | | 3 | project_id on raw_payloads | Defensive scoping for multi-project scenarios | | 4 | Uniqueness on (project_id, iid) | Enables O(1) `gi show issue 123 --project=X` | | 5 | Store system notes flagged + DiffNote paths | Preserves audit trail, enables immediate file search | | 6 | Structured document header + truncation metadata | Better embeddings, debuggability | | 7 | Embedding concurrency + per-doc errors | 50-100K docs becomes manageable | | 8 | --project, --explain, stable JSON | Day-to-day UX and trust-building | | 9 | `gi sync` orchestrator | Reduces human error | | 10 | Checkpoint focus sharpening | Clearer exit criteria | | 11-12 | Risk/Decisions updates | Documentation alignment | **Net effect:** Same MVP product (semantic search over issues/MRs/discussions), but with production-grade hardening that prevents the class of bugs that typically kill MVPs in real-world use.