Files

teernisse 55b895a2eb Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

30 KiB

Raw Blame History

SPEC.md Revision Document

Note: The project was renamed from "gitlab-inbox" to "gitlore" and the CLI from "gi" to "lore". References to "gi" in this document should be read as "lore".

This document provides git-diff style changes to integrate improvements from ChatGPT's review into the original SPEC.md. The goal is a "best of all worlds" hybrid that maintains the original architecture while adding production-grade hardening.

Change 1: Crash-safe Single-flight with Heartbeat Lock

Why this is better: The original plan's single-flight protection is policy-based, not DB-enforced. A race condition exists where two processes could both start before either writes to sync_runs. The heartbeat approach provides DB-enforced atomicity, automatic crash recovery, and less manual intervention.

@@ Schema (Checkpoint 0): @@
 CREATE TABLE sync_runs (
   id INTEGER PRIMARY KEY,
   started_at INTEGER NOT NULL,
+  heartbeat_at INTEGER NOT NULL,
   finished_at INTEGER,
   status TEXT NOT NULL,          -- 'running' | 'succeeded' | 'failed'
   command TEXT NOT NULL,         -- 'ingest issues' | 'sync' | etc.
   error TEXT
 );

+-- Crash-safe single-flight lock (DB-enforced)
+CREATE TABLE app_locks (
+  name TEXT PRIMARY KEY,         -- 'sync'
+  owner TEXT NOT NULL,           -- random run token (UUIDv4)
+  acquired_at INTEGER NOT NULL,
+  heartbeat_at INTEGER NOT NULL
+);

@@ Checkpoint 0: Project Setup - Scope @@
 **Scope:**
 - Project structure (TypeScript, ESLint, Vitest)
 - GitLab API client with PAT authentication
 - Environment and project configuration
 - Basic CLI scaffold with `auth-test` command
 - `doctor` command for environment verification
-- Projects table and initial sync
+- Projects table and initial project resolution (no issue/MR ingestion yet)
+- DB migrations + WAL + FK + app lock primitives
+- Crash-safe single-flight lock with heartbeat

@@ Reliability/Idempotency Rules: @@
 - Every ingest/sync creates a `sync_runs` row
-- Single-flight: refuse to start if an existing run is `running` (unless `--force`)
+- Single-flight: acquire `app_locks('sync')` before starting
+  - On start: INSERT OR REPLACE lock row with new owner token
+  - During run: update `heartbeat_at` every 30 seconds
+  - If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
+  - `--force` remains as operator override for edge cases, but should rarely be needed
 - Cursor advances only after successful transaction commit per page/batch
 - Ordering: `updated_at ASC`, tie-breaker `gitlab_id ASC`
 - Use explicit transactions for batch inserts

@@ Configuration (MVP): @@
 // gi.config.json
 {
   "gitlab": {
     "baseUrl": "https://gitlab.example.com",
     "tokenEnvVar": "GITLAB_TOKEN"
   },
   "projects": [
     { "path": "group/project-one" },
     { "path": "group/project-two" }
   ],
+  "sync": {
+    "backfillDays": 14,
+    "staleLockMinutes": 10,
+    "heartbeatIntervalSeconds": 30
+  },
   "embedding": {
     "provider": "ollama",
     "model": "nomic-embed-text",
-    "baseUrl": "http://localhost:11434"
+    "baseUrl": "http://localhost:11434",
+    "concurrency": 4
   }
 }

Change 2: Harden Cursor Semantics + Rolling Backfill Window

Why this is better: The original plan's "critical assumption" that comments update parent updated_at is mostly true but the failure mode is catastrophic (silently missing new discussion content). The rolling backfill provides a safety net without requiring weekly full resyncs.

@@ GitLab API Strategy - Critical Assumption @@
-### Critical Assumption
-
-**Adding a comment/discussion updates the parent's `updated_at` timestamp.** This assumption is necessary for incremental sync to detect new discussions. If incorrect, new comments on stale items would be missed.
-
-Mitigation: Periodic full re-sync (weekly) as a safety net.
+### Critical Assumption (Softened)
+
+We *expect* adding a note/discussion updates the parent's `updated_at`, but we do not rely on it exclusively.
+
+**Mitigations (MVP):**
+1. **Tuple cursor semantics:** Cursor is a stable tuple `(updated_at, gitlab_id)`. Ties are handled explicitly - process all items with equal `updated_at` before advancing cursor.
+2. **Rolling backfill window:** Each sync also re-fetches items updated within the last N days (default 14, configurable). This ensures "late" updates are eventually captured even if parent timestamps behave unexpectedly.
+3. **Periodic full re-sync:** Remains optional as an extra safety net (`gi sync --full`).
+
+The backfill window provides 80% of the safety of full resync at <5% of the API cost.

@@ Checkpoint 5: Incremental Sync - Scope @@
 **Scope:**
-- Delta sync based on stable cursor (updated_at + tie-breaker id)
+- Delta sync based on stable tuple cursor `(updated_at, gitlab_id)`
+- Rolling backfill window (configurable, default 14 days) to reduce risk of missed updates
 - Dependent resources sync strategy (discussions refetched when parent updates)
 - Re-embedding based on content_hash change (documents.content_hash != embedding_metadata.content_hash)
 - Sync status reporting
 - Recommended: run via cron every 10 minutes

@@ Correctness Rules (MVP): @@
-1. Fetch pages ordered by `updated_at ASC`, within identical timestamps advance by `gitlab_id ASC`
-2. Cursor advances only after successful DB commit for that page
+1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
+2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
+   - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)`
+   - Cursor advances only after successful DB commit for that page
+   - When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
 3. Dependent resources:
    - For each updated issue/MR, refetch ALL its discussions
    - Discussion documents are regenerated and re-embedded if content_hash changes
-4. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
-5. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)
+4. Rolling backfill window:
+   - After cursor-based delta sync, also fetch items where `updated_at > NOW() - backfillDays`
+   - This catches any items whose timestamps were updated without triggering our cursor
+5. A document is queued for embedding iff `documents.content_hash != embedding_metadata.content_hash`
+6. Sync run is marked 'failed' with error message if any page fails (can resume from cursor)

Change 3: Raw Payload Scoping + project_id

Why this is better: The original raw_payloads(resource_type, gitlab_id) index could have collisions in edge cases (especially if later adding more projects or resource types). Adding project_id is defensive and enables project-scoped lookups.

@@ Schema (Checkpoint 0) - raw_payloads @@
 CREATE TABLE raw_payloads (
   id INTEGER PRIMARY KEY,
   source TEXT NOT NULL,          -- 'gitlab'
+  project_id INTEGER REFERENCES projects(id),  -- nullable for instance-level resources
   resource_type TEXT NOT NULL,   -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
   gitlab_id INTEGER NOT NULL,
   fetched_at INTEGER NOT NULL,
   json TEXT NOT NULL
 );
-CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(resource_type, gitlab_id);
+CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
+CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);

Change 4: Tighten Uniqueness Constraints (project_id + iid)

Why this is better: Users think in terms of "issue 123 in project X," not global IDs. This enables O(1) gi show issue 123 --project=X and prevents subtle ingestion bugs from creating duplicate rows.

@@ Schema Preview - issues @@
 CREATE TABLE issues (
   id INTEGER PRIMARY KEY,
   gitlab_id INTEGER UNIQUE NOT NULL,
   project_id INTEGER NOT NULL REFERENCES projects(id),
   iid INTEGER NOT NULL,
   title TEXT,
   description TEXT,
   state TEXT,
   author_username TEXT,
   created_at INTEGER,
   updated_at INTEGER,
   web_url TEXT,
   raw_payload_id INTEGER REFERENCES raw_payloads(id)
 );
 CREATE INDEX idx_issues_project_updated ON issues(project_id, updated_at);
 CREATE INDEX idx_issues_author ON issues(author_username);
+CREATE UNIQUE INDEX uq_issues_project_iid ON issues(project_id, iid);

@@ Schema Additions - merge_requests @@
 CREATE TABLE merge_requests (
   id INTEGER PRIMARY KEY,
   gitlab_id INTEGER UNIQUE NOT NULL,
   project_id INTEGER NOT NULL REFERENCES projects(id),
   iid INTEGER NOT NULL,
   ...
 );
 CREATE INDEX idx_mrs_project_updated ON merge_requests(project_id, updated_at);
 CREATE INDEX idx_mrs_author ON merge_requests(author_username);
+CREATE UNIQUE INDEX uq_mrs_project_iid ON merge_requests(project_id, iid);

Change 5: Store System Notes (Flagged) + Capture DiffNote Paths

Why this is better: Two problems with dropping system notes entirely: (1) Some system notes carry decision trace context ("marked as resolved", "changed milestone"). (2) File/path search is disproportionately valuable for engineers. DiffNote positions already contain path metadata - capturing it now enables immediate filename search.

@@ Checkpoint 2 Scope @@
 - Discussions fetcher (issue discussions + MR discussions) as a dependent resource:
   - Uses `GET /projects/:id/issues/:iid/discussions` and `GET /projects/:id/merge_requests/:iid/discussions`
   - During initial ingest: fetch discussions for every issue/MR
   - During sync: refetch discussions only for issues/MRs updated since cursor
-  - Filter out system notes (`system: true`) - these are automated messages (assignments, label changes) that add noise
+  - Preserve system notes but flag them with `is_system=1`; exclude from embeddings by default
+  - Capture DiffNote file path/line metadata from `position` field for immediate filename search value

@@ Schema Additions - notes @@
 CREATE TABLE notes (
   id INTEGER PRIMARY KEY,
   gitlab_id INTEGER UNIQUE NOT NULL,
   discussion_id INTEGER NOT NULL REFERENCES discussions(id),
   project_id INTEGER NOT NULL REFERENCES projects(id),
   type TEXT,                                  -- 'DiscussionNote' | 'DiffNote' | null (from GitLab API)
+  is_system BOOLEAN NOT NULL DEFAULT 0,       -- system notes (assignments, label changes, etc.)
   author_username TEXT,
   body TEXT,
   created_at INTEGER,
   updated_at INTEGER,
   position INTEGER,                           -- derived from array order in API response (0-indexed)
   resolvable BOOLEAN,
   resolved BOOLEAN,
   resolved_by TEXT,
   resolved_at INTEGER,
+  -- DiffNote position metadata (nullable, from GitLab API position object)
+  position_old_path TEXT,
+  position_new_path TEXT,
+  position_old_line INTEGER,
+  position_new_line INTEGER,
   raw_payload_id INTEGER REFERENCES raw_payloads(id)
 );
 CREATE INDEX idx_notes_discussion ON notes(discussion_id);
 CREATE INDEX idx_notes_author ON notes(author_username);
 CREATE INDEX idx_notes_type ON notes(type);
+CREATE INDEX idx_notes_system ON notes(is_system);
+CREATE INDEX idx_notes_new_path ON notes(position_new_path);

@@ Discussion Processing Rules @@
-- System notes (`system: true`) are excluded during ingestion - they're noise (assignment changes, label updates, etc.)
+- System notes (`system: true`) are ingested with `notes.is_system=1`
+  - Excluded from document extraction/embeddings by default (reduces noise in semantic search)
+  - Preserved for audit trail, timeline views, and potential future decision-tracing features
+  - Can be toggled via `--include-system-notes` flag if needed
+- DiffNote position data is extracted and stored:
+  - `position.old_path`, `position.new_path` for file-level search
+  - `position.old_line`, `position.new_line` for line-level context
 - Each discussion from the API becomes one row in `discussions` table
 - All notes within a discussion are stored with their `discussion_id` foreign key
 - `individual_note: true` discussions have exactly one note (standalone comment)
 - `individual_note: false` discussions have multiple notes (threaded conversation)

@@ Checkpoint 2 Automated Tests @@
 tests/unit/discussion-transformer.test.ts
   - transforms discussion payload to normalized schema
   - extracts notes array from discussion
   - sets individual_note flag correctly
-  - filters out system notes (system: true)
+  - flags system notes with is_system=1
+  - extracts DiffNote position metadata (paths and lines)
   - preserves note order via position field

 tests/integration/discussion-ingestion.test.ts
   - fetches discussions for each issue
   - fetches discussions for each MR
   - creates discussion rows with correct parent FK
   - creates note rows linked to discussions
-  - excludes system notes from storage
+  - stores system notes with is_system=1 flag
+  - extracts position_new_path from DiffNotes
   - captures note-level resolution status
   - captures note type (DiscussionNote, DiffNote)

@@ Checkpoint 2 Data Integrity Checks @@
 - [ ] `SELECT COUNT(*) FROM merge_requests` matches GitLab MR count
 - [ ] `SELECT COUNT(*) FROM discussions` is non-zero for projects with comments
 - [ ] `SELECT COUNT(*) FROM notes WHERE discussion_id IS NULL` = 0 (all notes linked)
-- [ ] `SELECT COUNT(*) FROM notes n JOIN raw_payloads r ON ... WHERE json_extract(r.json, '$.system') = true` = 0 (no system notes)
+- [ ] System notes have `is_system=1` flag set correctly
+- [ ] DiffNotes have `position_new_path` populated when available
 - [ ] Every discussion has at least one note
 - [ ] `individual_note = true` discussions have exactly one note
 - [ ] Discussion `first_note_at` <= `last_note_at` for all rows

Change 6: Document Extraction Structured Header + Truncation Metadata

Why this is better: Adding a deterministic header improves search snippets (more informative), embeddings (model gets stable context), and debuggability (see if/why truncation happened).

@@ Schema Additions - documents @@
 CREATE TABLE documents (
   id INTEGER PRIMARY KEY,
   source_type TEXT NOT NULL,     -- 'issue' | 'merge_request' | 'discussion'
   source_id INTEGER NOT NULL,    -- local DB id in the source table
   project_id INTEGER NOT NULL REFERENCES projects(id),
   author_username TEXT,          -- for discussions: first note author
   label_names TEXT,              -- JSON array (display/debug only)
   created_at INTEGER,
   updated_at INTEGER,
   url TEXT,
   title TEXT,                    -- null for discussions
   content_text TEXT NOT NULL,    -- canonical text for embedding/snippets
   content_hash TEXT NOT NULL,    -- SHA-256 for change detection
+  is_truncated BOOLEAN NOT NULL DEFAULT 0,
+  truncated_reason TEXT,         -- 'token_limit_middle_drop' | null
   UNIQUE(source_type, source_id)
 );

@@ Discussion Document Format @@
-[Issue #234: Authentication redesign] Discussion
+[[Discussion]] Issue #234: Authentication redesign
+Project: group/project-one
+URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
+Labels: ["bug", "auth"]
+Files: ["src/auth/login.ts"]     -- present if any DiffNotes exist in thread
+
+--- Thread ---

 @johndoe (2024-03-15):
 I think we should move to JWT-based auth because the session cookies are causing issues with our mobile clients...

 @janedoe (2024-03-15):
 Agreed. What about refresh token strategy?

 @johndoe (2024-03-16):
 Short-lived access tokens (15min), longer refresh (7 days). Here's why...

@@ Document Extraction Rules @@
 | Source | content_text Construction |
 |--------|--------------------------|
-| Issue | `title + "\n\n" + description` |
-| MR | `title + "\n\n" + description` |
+| Issue | Structured header + `title + "\n\n" + description` |
+| MR | Structured header + `title + "\n\n" + description` |
 | Discussion | Full thread with context (see below) |

+**Structured Header Format (all document types):**
+```
+[[{SourceType}]] {Title}
+Project: {path_with_namespace}
+URL: {web_url}
+Labels: {JSON array of label names}
+Files: {JSON array of paths from DiffNotes, if any}
+
+--- Content ---
+```
+
+This format provides:
+- Stable, parseable context for embeddings
+- Consistent snippet formatting in search results
+- File path context without full file-history feature

@@ Truncation @@
-**Truncation:** If concatenated discussion exceeds 8000 tokens, truncate from the middle (preserve first and last notes for context) and log a warning.
+**Truncation:**
+If content exceeds 8000 tokens:
+1. Truncate from the middle (preserve first + last notes for context)
+2. Set `documents.is_truncated = 1`
+3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
+4. Log a warning with document ID and original token count
+
+This metadata enables:
+- Monitoring truncation frequency in production
+- Future investigation of high-value truncated documents
+- Debugging when search misses expected content

Change 7: Embedding Pipeline Concurrency + Per-Document Error Tracking

Why this is better: For 50-100K documents, embedding is the longest pole. Controlled concurrency (4-8 workers) saturates local inference without OOM. Per-document error tracking prevents single bad payloads from stalling "100% coverage" and enables targeted re-runs.

@@ Checkpoint 3: Embedding Generation - Scope @@
 **Scope:**
 - Ollama integration (nomic-embed-text model)
-- Embedding generation pipeline (batch processing, 32 documents per batch)
+- Embedding generation pipeline:
+  - Batch size: 32 documents per batch
+  - Concurrency: configurable (default 4 workers)
+  - Retry with exponential backoff for transient failures (max 3 attempts)
+  - Per-document failure recording to enable targeted re-runs
 - Vector storage in SQLite (sqlite-vss extension)
 - Progress tracking and resumability
 - Document extraction layer:

@@ Schema Additions - embedding_metadata @@
 CREATE TABLE embedding_metadata (
   document_id INTEGER PRIMARY KEY REFERENCES documents(id),
   model TEXT NOT NULL,           -- 'nomic-embed-text'
   dims INTEGER NOT NULL,         -- 768
   content_hash TEXT NOT NULL,    -- copied from documents.content_hash
-  created_at INTEGER NOT NULL
+  created_at INTEGER NOT NULL,
+  -- Error tracking for resumable embedding
+  last_error TEXT,               -- error message from last failed attempt
+  attempt_count INTEGER NOT NULL DEFAULT 0,
+  last_attempt_at INTEGER        -- when last attempt occurred
 );
+
+-- Index for finding failed embeddings to retry
+CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;

@@ Checkpoint 3 Automated Tests @@
 tests/integration/embedding-storage.test.ts
   - stores embedding in sqlite-vss
   - embedding rowid matches document id
   - creates embedding_metadata record
   - skips re-embedding when content_hash unchanged
   - re-embeds when content_hash changes
+  - records error in embedding_metadata on failure
+  - increments attempt_count on each retry
+  - clears last_error on successful embedding
+  - respects concurrency limit

@@ Checkpoint 3 Manual CLI Smoke Tests @@
 | Command | Expected Output | Pass Criteria |
 |---------|-----------------|---------------|
 | `gi embed --all` | Progress bar with ETA | Completes without error |
 | `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
+| `gi embed --retry-failed` | Progress on failed docs | Re-attempts previously failed embeddings |
 | `gi stats` | Embedding coverage stats | Shows 100% coverage |
 | `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
 | `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
+
+**Stats output should include:**
+- Total documents
+- Successfully embedded
+- Failed (with error breakdown)
+- Pending (never attempted)

Change 8: Search UX Improvements (--project, --explain, Stable JSON Schema)

Why this is better: For day-to-day use, "search across everything" is less useful than "search within repo X." The --explain flag helps validate ranking during MVP. Stable JSON schema prevents accidental breaking changes for agent/MCP consumption.

@@ Checkpoint 4 Scope @@
-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`
+- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`
+- Debug: `--explain` returns rank contributions from vector + FTS + RRF
   - Label filtering operates on `document_labels` (indexed, exact-match)
 - Output formatting: ranked list with title, snippet, score, URL
-- JSON output mode for AI agent consumption
+- JSON output mode for AI/agent consumption (stable schema, documented)
 - Graceful degradation: if Ollama is unreachable, fall back to FTS5-only search with warning

@@ CLI Interface @@
 # Basic semantic search
 gi search "why did we choose Redis"

+# Search within specific project
+gi search "authentication" --project=group/project-one
+
 # Pure FTS search (fallback if embeddings unavailable)
 gi search "redis" --mode=lexical

 # Filtered search
 gi search "authentication" --type=mr --after=2024-01-01

 # Filter by label
 gi search "performance" --label=bug --label=critical

 # JSON output for programmatic use
 gi search "payment processing" --json
+
+# Debug ranking (shows how each retriever contributed)
+gi search "authentication" --explain

@@ JSON Output Schema (NEW SECTION) @@
+**JSON Output Schema (Stable)**
+
+For AI/agent consumption, `--json` output follows this stable schema:
+
+```typescript
+interface SearchResult {
+  documentId: number;
+  sourceType: "issue" | "merge_request" | "discussion";
+  title: string | null;
+  url: string;
+  projectPath: string;
+  author: string | null;
+  createdAt: string;       // ISO 8601
+  updatedAt: string;       // ISO 8601
+  score: number;           // 0-1 normalized RRF score
+  snippet: string;         // truncated content_text
+  labels: string[];
+  // Only present with --explain flag
+  explain?: {
+    vectorRank?: number;   // null if not in vector results
+    ftsRank?: number;      // null if not in FTS results
+    rrfScore: number;
+  };
+}
+
+interface SearchResponse {
+  query: string;
+  mode: "hybrid" | "lexical" | "semantic";
+  totalResults: number;
+  results: SearchResult[];
+  warnings?: string[];     // e.g., "Embedding service unavailable"
+}
+```
+
+**Schema versioning:** Breaking changes require major version bump in CLI. Non-breaking additions (new optional fields) are allowed.

@@ Checkpoint 4 Manual CLI Smoke Tests @@
 | Command | Expected Output | Pass Criteria |
 |---------|-----------------|---------------|
 | `gi search "authentication"` | Ranked results with snippets | Returns relevant items, shows score |
+| `gi search "authentication" --project=group/project-one` | Project-scoped results | Only results from that project |
 | `gi search "authentication" --type=mr` | Only MR results | No issues or discussions in output |
 | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
 | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
 | `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
 | `gi search "redis" --mode=lexical` | FTS-only results | Works without Ollama |
 | `gi search "authentication" --json` | JSON output | Valid JSON matching schema |
+| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
 | `gi search "xyznonexistent123"` | No results message | Graceful empty state |
 | `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |

Change 9: Make `gi sync` an Orchestrator

Why this is better: Once CP3+ exist, operators want one command that does the right thing. The most common MVP failure is "I ingested but forgot to regenerate docs / embed / update FTS."

@@ Checkpoint 5 CLI Commands @@
 ```bash
-# Full sync (respects cursors, only fetches new/updated)
-gi sync
+# Full sync orchestration (ingest -> docs -> embed -> ensure FTS synced)
+gi sync                    # orchestrates all steps
+gi sync --no-embed         # skip embedding step (fast ingest/debug)
+gi sync --no-docs          # skip document regeneration (debug)

 # Force full re-sync (resets cursors)
 gi sync --full

 # Override stale 'running' run after operator review
 gi sync --force

 # Show sync status
 gi sync-status

+Orchestration steps (in order): +1. Acquire app lock with heartbeat +2. Ingest delta (issues, MRs, discussions) based on cursors +3. Apply rolling backfill window +4. Regenerate documents for changed entities +5. Embed documents with changed content_hash +6. FTS triggers auto-sync (no explicit step needed) +7. Release lock, record sync_run as succeeded + +Individual commands remain available for checkpoint testing and debugging: +- gi ingest --type=issues +- gi ingest --type=merge_requests +- gi embed --all +- gi embed --retry-failed


---

## Change 10: Checkpoint Focus Sharpening

**Why this is better:** Makes each checkpoint's exit criteria crisper and reduces overlap.

```diff
@@ Checkpoint 0: Project Setup @@
-**Deliverable:** Scaffolded project with GitLab API connection verified
+**Deliverable:** Scaffolded project with GitLab API connection verified and project resolution working

 **Scope:**
 - Project structure (TypeScript, ESLint, Vitest)
 - GitLab API client with PAT authentication
 - Environment and project configuration
 - Basic CLI scaffold with `auth-test` command
 - `doctor` command for environment verification
-- Projects table and initial sync
-- Sync tracking for reliability
+- Projects table and initial project resolution (no issue/MR ingestion yet)
+- DB migrations + WAL + FK enforcement
+- Sync tracking with crash-safe single-flight lock
+- Rate limit handling with exponential backoff + jitter

@@ Checkpoint 1 Deliverable @@
-**Deliverable:** All issues from target repos stored locally
+**Deliverable:** All issues + labels from target repos stored locally with resumable cursor-based sync

@@ Checkpoint 2 Deliverable @@
-**Deliverable:** All MRs and discussion threads (for both issues and MRs) stored locally with full thread context
+**Deliverable:** All MRs + discussions + notes (including flagged system notes) stored locally with full thread context and DiffNote file paths captured

Change 11: Risk Mitigation Updates

@@ Risk Mitigation @@
 | Risk | Mitigation |
 |------|------------|
 | GitLab rate limiting | Exponential backoff, respect Retry-After headers, incremental sync |
 | Embedding model quality | Start with nomic-embed-text; architecture allows model swap |
 | SQLite scale limits | Monitor performance; Postgres migration path documented |
 | Stale data | Incremental sync with change detection |
-| Mid-sync failures | Cursor-based resumption, sync_runs audit trail |
+| Mid-sync failures | Cursor-based resumption, sync_runs audit trail, heartbeat-based lock recovery |
+| Missed updates | Rolling backfill window (14 days), tuple cursor semantics |
 | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
-| Concurrent sync corruption | Single-flight protection (refuse if existing run is `running`) |
+| Concurrent sync corruption | DB-enforced app lock with heartbeat, automatic stale lock recovery |
+| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |

Change 12: Resolved Decisions Updates

@@ Resolved Decisions @@
 | Question | Decision | Rationale |
 |----------|----------|-----------|
 | Comments structure | **Discussions as first-class entities** | Thread context is essential for decision traceability |
-| System notes | **Exclude during ingestion** | System notes add noise without semantic value |
+| System notes | **Store flagged, exclude from embeddings** | Preserves audit trail while avoiding semantic noise |
+| DiffNote paths | **Capture now** | Enables immediate file/path search without full file-history feature |
 | MR file linkage | **Deferred to post-MVP (CP6)** | Only needed for file-history feature |
 | Labels | **Index as filters** | `document_labels` table enables fast `--label=X` filtering |
 | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
 | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10min is sufficient |
+| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
 | Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates |
 | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
 | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
 | Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context |
-| Embedding batching | **32 documents per batch** | Balance throughput and memory |
+| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
 | FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
 | Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
+| JSON output | **Stable documented schema** | Enables reliable agent/MCP consumption |

Summary of All Changes

#	Change	Impact
1	Crash-safe heartbeat lock	Prevents race conditions, auto-recovers from crashes
2	Tuple cursor + rolling backfill	Reduces risk of missed updates dramatically
3	project_id on raw_payloads	Defensive scoping for multi-project scenarios
4	Uniqueness on (project_id, iid)	Enables O(1) `gi show issue 123 --project=X`
5	Store system notes flagged + DiffNote paths	Preserves audit trail, enables immediate file search
6	Structured document header + truncation metadata	Better embeddings, debuggability
7	Embedding concurrency + per-doc errors	50-100K docs becomes manageable
8	--project, --explain, stable JSON	Day-to-day UX and trust-building
9	`gi sync` orchestrator	Reduces human error
10	Checkpoint focus sharpening	Clearer exit criteria
11-12	Risk/Decisions updates	Documentation alignment

Net effect: Same MVP product (semantic search over issues/MRs/discussions), but with production-grade hardening that prevents the class of bugs that typically kill MVPs in real-world use.

30 KiB Raw Blame History