More planning

This commit is contained in:
teernisse
2026-01-23 10:03:40 -05:00
parent 1f36fe6a21
commit e846a39ce6

473
SPEC.md
View File

@@ -115,7 +115,8 @@ npm link # Makes `gi` available globally
│ - Normalize artifacts to unified schema │ │ - Normalize artifacts to unified schema │
│ - Extract searchable documents (canonical text + metadata) │ │ - Extract searchable documents (canonical text + metadata) │
│ - Content hashing for change detection │ │ - Content hashing for change detection │
│ - Build relationship graph (issue↔MR↔note↔file) │ - MVP relationships: parent-child FKs + label/path associations
│ (full cross-entity "decision graph" is post-MVP scope) │
└─────────────────────────────────────────────────────────────────┘ └─────────────────────────────────────────────────────────────────┘
@@ -159,10 +160,16 @@ npm link # Makes `gi` available globally
Issues and MRs support efficient bulk fetching with incremental sync: Issues and MRs support efficient bulk fetching with incremental sync:
``` ```
GET /projects/:id/issues?updated_after=X&order_by=updated_at&sort=asc&per_page=100 GET /projects/:id/issues?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
GET /projects/:id/merge_requests?updated_after=X&order_by=updated_at&sort=asc&per_page=100 GET /projects/:id/merge_requests?scope=all&state=all&updated_after=X&order_by=updated_at&sort=asc&per_page=100
``` ```
**Required query params for completeness:**
- `scope=all` - include all issues/MRs, not just authored by current user
- `state=all` - include closed items (GitLab defaults may exclude them)
Without these params, the 2+ years of historical data would be incomplete.
### Dependent Resources (Per-Parent Fetch) ### Dependent Resources (Per-Parent Fetch)
Discussions must be fetched per-issue and per-MR. There is no bulk endpoint: Discussions must be fetched per-issue and per-MR. There is no bulk endpoint:
@@ -178,10 +185,25 @@ GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
**Initial sync:** **Initial sync:**
1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page) 1. Fetch all issues (paginated, ~60 calls for 6K issues at 100/page)
2. For EACH issue → fetch all discussions (~3K calls) 2. For EACH issue → fetch all discussions (≥ issues_count calls + pagination overhead)
3. Fetch all MRs (paginated, ~60 calls) 3. Fetch all MRs (paginated, ~60 calls)
4. For EACH MR → fetch all discussions (~3K calls) 4. For EACH MR → fetch all discussions (≥ mrs_count calls + pagination overhead)
5. Total: ~6,100+ API calls for initial sync 5. Total: thousands of API calls for initial sync
**API Call Estimation Formula:**
```
total_calls ≈ ceil(issues/100) + issues × avg_discussion_pages_per_issue
+ ceil(mrs/100) + mrs × avg_discussion_pages_per_mr
```
Example: 3K issues, 3K MRs, average 1.2 discussion pages per parent:
- Issue list: 30 calls
- Issue discussions: 3,000 × 1.2 = 3,600 calls
- MR list: 30 calls
- MR discussions: 3,000 × 1.2 = 3,600 calls
- **Total: ~7,260 calls**
This matters for rate limit planning and setting realistic "10-20 minutes" expectations.
**Incremental sync:** **Incremental sync:**
1. Fetch issues where `updated_after=cursor` (bulk) 1. Fetch issues where `updated_after=cursor` (bulk)
@@ -235,10 +257,15 @@ tests/unit/db.test.ts
✓ enables foreign keys ✓ enables foreign keys
tests/integration/gitlab-client.test.ts tests/integration/gitlab-client.test.ts
✓ authenticates with valid PAT (mocked) authenticates with valid PAT
✓ returns 401 for invalid PAT (mocked) returns 401 for invalid PAT
✓ fetches project by path (mocked) fetches project by path
✓ handles rate limiting (429) with retry (mocked) handles rate limiting (429) with retry
tests/live/gitlab-client.live.test.ts (optional, gated by GITLAB_LIVE_TESTS=1, not in CI)
✓ authenticates with real PAT against configured baseUrl
✓ fetches real project by path
✓ handles actual rate limiting behavior
tests/integration/app-lock.test.ts tests/integration/app-lock.test.ts
✓ acquires lock successfully ✓ acquires lock successfully
@@ -256,6 +283,7 @@ tests/integration/init.test.ts
✓ fails if any project path not found ✓ fails if any project path not found
✓ prompts before overwriting existing config ✓ prompts before overwriting existing config
✓ respects --force to skip confirmation ✓ respects --force to skip confirmation
✓ generates gi.config.json with sensible defaults
``` ```
**Manual CLI Smoke Tests:** **Manual CLI Smoke Tests:**
@@ -269,6 +297,7 @@ tests/integration/init.test.ts
| `gi init` (config exists) | Confirmation prompt | Warns before overwriting | | `gi init` (config exists) | Confirmation prompt | Warns before overwriting |
| `gi --help` | Command list | Shows all available commands | | `gi --help` | Command list | Shows all available commands |
| `gi version` | Version number | Shows installed version | | `gi version` | Version number | Shows installed version |
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
**Data Integrity Checks:** **Data Integrity Checks:**
- [ ] `projects` table contains rows for each configured project path - [ ] `projects` table contains rows for each configured project path
@@ -332,6 +361,13 @@ tests/integration/init.test.ts
} }
``` ```
**Raw Payload Compression:**
- When `storage.compressRawPayloads: true` (default), raw JSON payloads are gzip-compressed before storage
- `raw_payloads.content_encoding` indicates `'identity'` (uncompressed) or `'gzip'` (compressed)
- Compression typically reduces storage by 70-80% for JSON payloads
- Decompression is handled transparently when reading payloads
- Tradeoff: Slightly higher CPU on write/read, significantly lower disk usage
**DB Runtime Defaults (Checkpoint 0):** **DB Runtime Defaults (Checkpoint 0):**
- On every connection: - On every connection:
- `PRAGMA journal_mode=WAL;` - `PRAGMA journal_mode=WAL;`
@@ -387,13 +423,20 @@ CREATE TABLE raw_payloads (
source TEXT NOT NULL, -- 'gitlab' source TEXT NOT NULL, -- 'gitlab'
project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources project_id INTEGER REFERENCES projects(id), -- nullable for instance-level resources
resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion' resource_type TEXT NOT NULL, -- 'project' | 'issue' | 'mr' | 'note' | 'discussion'
gitlab_id INTEGER NOT NULL, gitlab_id TEXT NOT NULL, -- TEXT because discussion IDs are strings; numeric IDs stored as strings
fetched_at INTEGER NOT NULL, fetched_at INTEGER NOT NULL,
content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip' content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON
); );
CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id); CREATE INDEX idx_raw_payloads_lookup ON raw_payloads(project_id, resource_type, gitlab_id);
CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at); CREATE INDEX idx_raw_payloads_history ON raw_payloads(project_id, resource_type, gitlab_id, fetched_at);
-- Schema version tracking for migrations
CREATE TABLE schema_version (
version INTEGER PRIMARY KEY,
applied_at INTEGER NOT NULL,
description TEXT
);
``` ```
--- ---
@@ -411,7 +454,8 @@ tests/unit/issue-transformer.test.ts
tests/unit/pagination.test.ts tests/unit/pagination.test.ts
✓ fetches all pages when multiple exist ✓ fetches all pages when multiple exist
✓ respects per_page parameter ✓ respects per_page parameter
stops when empty page returned follows X-Next-Page header until empty/absent
✓ falls back to empty-page stop if headers missing (robustness)
tests/unit/discussion-transformer.test.ts tests/unit/discussion-transformer.test.ts
✓ transforms discussion payload to normalized schema ✓ transforms discussion payload to normalized schema
@@ -450,7 +494,7 @@ tests/integration/sync-runs.test.ts
| `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author | | `gi list issues --limit=10` | Table of 10 issues | Shows iid, title, state, author |
| `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project | | `gi list issues --project=group/project-one` | Filtered list | Only shows issues from that project |
| `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI | | `gi count issues` | `Issues: 1,234` (example) | Count matches GitLab UI |
| `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL | | `gi show issue 123` | Issue detail view | Shows title, description, labels, discussions, URL. If multiple projects have issue #123, prompts for clarification or use `--project=PATH` |
| `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count | | `gi count discussions --type=issue` | `Issue Discussions: 5,678` | Non-zero count |
| `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count | | `gi count notes --type=issue` | `Issue Notes: 12,345 (excluding 2,345 system)` | Non-zero count |
| `gi sync-status` | Last sync time, cursor positions | Shows successful last run | | `gi sync-status` | Last sync time, cursor positions | Shows successful last run |
@@ -543,7 +587,7 @@ CREATE TABLE discussions (
gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...") gitlab_discussion_id TEXT NOT NULL, -- GitLab's string ID (e.g. "6a9c1750b37d...")
project_id INTEGER NOT NULL REFERENCES projects(id), project_id INTEGER NOT NULL REFERENCES projects(id),
issue_id INTEGER REFERENCES issues(id), issue_id INTEGER REFERENCES issues(id),
merge_request_id INTEGER REFERENCES merge_requests(id), merge_request_id INTEGER, -- FK added in CP2 via ALTER TABLE
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest' noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion individual_note BOOLEAN NOT NULL, -- standalone comment vs threaded discussion
first_note_at INTEGER, -- for ordering discussions first_note_at INTEGER, -- for ordering discussions
@@ -686,6 +730,15 @@ CREATE INDEX idx_mr_labels_label ON mr_labels(label_id);
-- Additional indexes for DiffNote queries (tables created in CP1) -- Additional indexes for DiffNote queries (tables created in CP1)
CREATE INDEX idx_notes_type ON notes(type); CREATE INDEX idx_notes_type ON notes(type);
CREATE INDEX idx_notes_new_path ON notes(position_new_path); CREATE INDEX idx_notes_new_path ON notes(position_new_path);
-- Migration: Add FK constraint to discussions table (was deferred from CP1)
-- SQLite doesn't support ADD CONSTRAINT, so we recreate the table with FK
-- This is handled by the migration system; pseudocode for clarity:
-- 1. CREATE TABLE discussions_new with REFERENCES merge_requests(id)
-- 2. INSERT INTO discussions_new SELECT * FROM discussions
-- 3. DROP TABLE discussions
-- 4. ALTER TABLE discussions_new RENAME TO discussions
-- 5. Recreate indexes
``` ```
**MR Discussion Processing Rules:** **MR Discussion Processing Rules:**
@@ -696,8 +749,8 @@ CREATE INDEX idx_notes_new_path ON notes(position_new_path);
--- ---
### Checkpoint 3: Document + Embedding Generation with Lexical Search ### Checkpoint 3A: Document Generation + FTS (Lexical Search)
**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end **Deliverable:** Documents generated + FTS5 index; `gi search --mode=lexical` works end-to-end (no Ollama required)
**Automated Tests (Vitest):** **Automated Tests (Vitest):**
``` ```
@@ -707,76 +760,67 @@ tests/unit/document-extractor.test.ts
✓ extracts discussion document with full thread context ✓ extracts discussion document with full thread context
✓ includes parent issue/MR title in discussion header ✓ includes parent issue/MR title in discussion header
✓ formats notes with author and timestamp ✓ formats notes with author and timestamp
truncates content exceeding 8000 tokens excludes system notes from discussion documents by default
✓ includes system notes only when --include-system-notes enabled (debug)
✓ truncates content exceeding 8000 tokens at note boundaries
✓ preserves first and last notes when truncating middle ✓ preserves first and last notes when truncating middle
✓ computes SHA-256 content hash consistently ✓ computes SHA-256 content hash consistently
tests/unit/embedding-client.test.ts
✓ connects to Ollama API
✓ generates embedding for text input
✓ returns 768-dimension vector
✓ handles Ollama connection failure gracefully
✓ batches requests (32 documents per batch)
tests/integration/document-creation.test.ts tests/integration/document-creation.test.ts
✓ creates document for each issue ✓ creates document for each issue
✓ creates document for each MR ✓ creates document for each MR
✓ creates document for each discussion ✓ creates document for each discussion
✓ populates document_labels junction table ✓ populates document_labels junction table
✓ computes content_hash for each document ✓ computes content_hash for each document
✓ excludes system notes from discussion content
tests/integration/embedding-storage.test.ts tests/integration/fts-index.test.ts
stores embedding in sqlite-vss documents_fts row count matches documents
embedding rowid matches document id FTS triggers fire on insert/update/delete
creates embedding_metadata record updates propagate via triggers
✓ skips re-embedding when content_hash unchanged
✓ re-embeds when content_hash changes tests/integration/fts-search.test.ts
✓ returns exact keyword matches
✓ porter stemming works (search/searching)
✓ returns empty for non-matching query
``` ```
**Manual CLI Smoke Tests:** **Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria | | Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------| |---------|-----------------|---------------|
| `gi embed --all` | Progress bar with ETA | Completes without error | | `gi generate-docs` | Progress bar, final count | Completes without error |
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs | | `gi generate-docs` (re-run) | `0 documents to regenerate` | Skips unchanged docs |
| `gi stats` | Embedding coverage stats | Shows 100% coverage | | `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, works without Ollama |
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts | | `gi stats` | Document count stats | Shows document coverage |
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required |
**Data Integrity Checks:** **Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions - [ ] `SELECT COUNT(*) FROM documents` = issues + MRs + discussions
- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents` - [ ] `SELECT COUNT(*) FROM documents_fts` = `SELECT COUNT(*) FROM documents` (via FTS triggers)
- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
- [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings - [ ] `SELECT COUNT(*) FROM documents WHERE LENGTH(content_text) > 32000` logs truncation warnings
- [ ] Discussion documents include parent title in content_text - [ ] Discussion documents include parent title in content_text
- [ ] Discussion documents exclude system notes
**Scope:** **Scope:**
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline:
- Batch size: 32 documents per batch
- Concurrency: configurable (default 4 workers)
- Retry with exponential backoff for transient failures (max 3 attempts)
- Per-document failure recording to enable targeted re-runs
- Vector storage in SQLite (sqlite-vss extension)
- Progress tracking and resumability
- Document extraction layer: - Document extraction layer:
- Canonical "search documents" derived from issues/MRs/discussions - Canonical "search documents" derived from issues/MRs/discussions
- Stable content hashing for change detection (SHA-256 of content_text) - Stable content hashing for change detection (SHA-256 of content_text)
- Single embedding per document (chunking deferred to post-MVP) - Truncation: content_text capped at 8000 tokens at NOTE boundaries
- Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
- **Implementation:** Use character budget, not exact token count - **Implementation:** Use character budget, not exact token count
- `maxChars = 32000` (conservative 4 chars/token estimate) - `maxChars = 32000` (conservative 4 chars/token estimate)
- Drop whole notes from middle, never cut mid-note
- `approxTokens = ceil(charCount / 4)` for reporting/logging only - `approxTokens = ceil(charCount / 4)` for reporting/logging only
- This avoids tokenizer dependency while preventing embedding failures - System notes excluded from discussion documents (stored in DB for audit, but not in embeddings/search)
- Denormalized metadata for fast filtering (author, labels, dates) - Denormalized metadata for fast filtering (author, labels, dates)
- Fast label filtering via `document_labels` join table - Fast label filtering via `document_labels` join table
- FTS5 index for lexical search (enables `gi search --mode=lexical` without Ollama) - FTS5 index for lexical search
- `gi search --mode=lexical` CLI command (works without Ollama) - `gi search --mode=lexical` CLI command (works without Ollama)
**Schema Additions:** This checkpoint delivers a working search experience before introducing embedding infrastructure risk.
**Schema Additions (CP3A):**
```sql ```sql
-- Unified searchable documents (derived from issues/MRs/discussions) -- Unified searchable documents (derived from issues/MRs/discussions)
-- Note: Full documents table schema is in CP3B section for continuity with embeddings
CREATE TABLE documents ( CREATE TABLE documents (
id INTEGER PRIMARY KEY, id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion' source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
@@ -806,7 +850,122 @@ CREATE TABLE document_labels (
); );
CREATE INDEX idx_document_labels_label ON document_labels(label_name); CREATE INDEX idx_document_labels_label ON document_labels(label_name);
-- sqlite-vss virtual table -- Fast path filtering for documents (extracted from DiffNote positions)
CREATE TABLE document_paths (
document_id INTEGER NOT NULL REFERENCES documents(id),
path TEXT NOT NULL,
PRIMARY KEY(document_id, path)
);
CREATE INDEX idx_document_paths_path ON document_paths(path);
-- Track sources that require document regeneration (populated during ingestion)
CREATE TABLE dirty_sources (
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id
queued_at INTEGER NOT NULL,
PRIMARY KEY(source_type, source_id)
);
-- Resumable dependent fetches (discussions are per-parent resources)
CREATE TABLE pending_discussion_fetches (
project_id INTEGER NOT NULL REFERENCES projects(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
noteable_iid INTEGER NOT NULL, -- parent iid (stable human identifier)
queued_at INTEGER NOT NULL,
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER,
last_error TEXT,
PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_retry
ON pending_discussion_fetches(attempt_count, last_attempt_at)
WHERE last_error IS NOT NULL;
-- Full-text search for lexical retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
```
**FTS5 Tokenizer Notes:**
- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
- `unicode61` handles Unicode properly
- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
---
### Checkpoint 3B: Embedding Generation (Semantic Search)
**Deliverable:** Embeddings generated + `gi search --mode=semantic` works; graceful fallback if Ollama unavailable
**Automated Tests (Vitest):**
```
tests/unit/embedding-client.test.ts
✓ connects to Ollama API
✓ generates embedding for text input
✓ returns 768-dimension vector
✓ handles Ollama connection failure gracefully
✓ batches requests (32 documents per batch)
tests/integration/embedding-storage.test.ts
✓ stores embedding in sqlite-vss
✓ embedding rowid matches document id
✓ creates embedding_metadata record
✓ skips re-embedding when content_hash unchanged
✓ re-embeds when content_hash changes
```
**Manual CLI Smoke Tests:**
| Command | Expected Output | Pass Criteria |
|---------|-----------------|---------------|
| `gi embed --all` | Progress bar with ETA | Completes without error |
| `gi embed --all` (re-run) | `0 documents to embed` | Skips already-embedded docs |
| `gi stats` | Embedding coverage stats | Shows 100% coverage |
| `gi stats --json` | JSON stats object | Valid JSON with document/embedding counts |
| `gi embed --all` (Ollama stopped) | Clear error message | Non-zero exit, actionable error |
| `gi search "authentication" --mode=semantic` | Vector results | Returns semantically similar documents |
**Data Integrity Checks:**
- [ ] `SELECT COUNT(*) FROM embeddings` = `SELECT COUNT(*) FROM documents`
- [ ] `SELECT COUNT(*) FROM embedding_metadata` = `SELECT COUNT(*) FROM documents`
- [ ] All `embedding_metadata.content_hash` matches corresponding `documents.content_hash`
**Scope:**
- Ollama integration (nomic-embed-text model)
- Embedding generation pipeline:
- Batch size: 32 documents per batch
- Concurrency: configurable (default 4 workers)
- Retry with exponential backoff for transient failures (max 3 attempts)
- Per-document failure recording to enable targeted re-runs
- Vector storage in SQLite (sqlite-vss extension)
- Progress tracking and resumability
- `gi search --mode=semantic` CLI command
**Schema Additions (CP3B):**
```sql
-- sqlite-vss virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id -- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vss0( CREATE VIRTUAL TABLE embeddings USING vss0(
embedding(768) embedding(768)
@@ -828,22 +987,6 @@ CREATE TABLE embedding_metadata (
-- Index for finding failed embeddings to retry -- Index for finding failed embeddings to retry
CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL; CREATE INDEX idx_embedding_metadata_errors ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
-- Track sources that require document regeneration (populated during ingestion)
CREATE TABLE dirty_sources (
source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
source_id INTEGER NOT NULL, -- local DB id
queued_at INTEGER NOT NULL,
PRIMARY KEY(source_type, source_id)
);
-- Fast path filtering for documents (extracted from DiffNote positions)
CREATE TABLE document_paths (
document_id INTEGER NOT NULL REFERENCES documents(id),
path TEXT NOT NULL,
PRIMARY KEY(document_id, path)
);
CREATE INDEX idx_document_paths_path ON document_paths(path);
``` ```
**Storage Rule (MVP):** **Storage Rule (MVP):**
@@ -879,6 +1022,12 @@ Agreed. What about refresh token strategy?
Short-lived access tokens (15min), longer refresh (7 days). Here's why... Short-lived access tokens (15min), longer refresh (7 days). Here's why...
``` ```
**System Notes Exclusion Rule:**
- System notes (is_system=1) are stored in the DB for audit purposes
- System notes are EXCLUDED from discussion documents by default
- This prevents semantic noise ("changed assignee", "added label", "mentioned in") from polluting embeddings
- Debug flag `--include-system-notes` available for troubleshooting
This format preserves: This format preserves:
- Parent context (issue/MR title and number) - Parent context (issue/MR title and number)
- Project path for scoped search - Project path for scoped search
@@ -889,14 +1038,28 @@ This format preserves:
- Temporal ordering of the conversation - Temporal ordering of the conversation
- Full thread semantics for decision traceability - Full thread semantics for decision traceability
**Truncation:** **Truncation (Note-Boundary Aware):**
If content exceeds 8000 tokens: If content exceeds 8000 tokens (~32000 chars):
**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`.
1. Truncate from the middle (preserve first + last notes for context) **Algorithm:**
2. Set `documents.is_truncated = 1` 1. Count non-system notes in the discussion
3. Set `documents.truncated_reason = 'token_limit_middle_drop'` 2. If total chars ≤ maxChars, no truncation needed
4. Log a warning with document ID and original token count 3. Otherwise, drop whole notes from the MIDDLE:
- Preserve first N notes and last M notes
- Never cut mid-note (produces unreadable snippets and worse embeddings)
- Continue dropping middle notes until under maxChars
4. Insert marker: `\n\n[... N notes omitted for length ...]\n\n`
5. Set `documents.is_truncated = 1`
6. Set `documents.truncated_reason = 'token_limit_middle_drop'`
7. Log a warning with document ID and original/truncated token count
**Why note-boundary truncation:**
- Cutting mid-note produces unreadable snippets ("...the authentication flow because--")
- Keeping whole notes preserves semantic coherence for embeddings
- First notes contain context/problem statement; last notes contain conclusions
- Middle notes are often back-and-forth that's less critical
**Token estimation:** `approxTokens = ceil(charCount / 4)`. No tokenizer dependency.
This metadata enables: This metadata enables:
- Monitoring truncation frequency in production - Monitoring truncation frequency in production
@@ -954,14 +1117,14 @@ tests/e2e/golden-queries.test.ts
| `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe | | `gi search "authentication" --author=johndoe` | Filtered by author | All results have @johndoe |
| `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date | | `gi search "authentication" --after=2024-01-01` | Date filtered | All results after date |
| `gi search "authentication" --label=bug` | Label filtered | All results have bug label | | `gi search "authentication" --label=bug` | Label filtered | All results have bug label |
| `gi search "redis" --mode=lexical` | FTS results only | Works without Ollama | | `gi search "redis" --mode=lexical` | FTS results only | Shows FTS results, no embeddings |
| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ | | `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
| `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema | | `gi search "authentication" --json` | JSON output | Valid JSON matching stable schema |
| `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions | | `gi search "authentication" --explain` | Rank breakdown | Shows vector/FTS/RRF contributions |
| `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results | | `gi search "authentication" --limit=5` | 5 results max | Returns at most 5 results |
| `gi search "xyznonexistent123"` | No results message | Graceful empty state | | `gi search "xyznonexistent123"` | No results message | Graceful empty state |
| `gi search "auth"` (no data synced) | No data message | Shows "Run gi sync first" | | `gi search "authentication"` (no data synced) | No data message | Shows "Run gi sync first" |
| `gi search "auth"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results | | `gi search "authentication"` (Ollama stopped) | FTS results + warning | Shows warning, still returns results |
**Golden Query Test Suite:** **Golden Query Test Suite:**
Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs: Create `tests/fixtures/golden-queries.json` with 10 queries and expected URLs:
@@ -991,8 +1154,12 @@ Each query must have at least one expected URL appear in top 10 results.
- Result ranking and scoring (document-level) - Result ranking and scoring (document-level)
- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N` - Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`, `--limit=N`
- `--limit=N` controls result count (default: 20, max: 100) - `--limit=N` controls result count (default: 20, max: 100)
- `--path` filters documents by referenced file paths (from DiffNote positions) - `--path` filters documents by referenced file paths (from DiffNote positions):
- MVP: substring/exact match; glob patterns deferred - If `--path` ends with `/`: prefix match (`path LIKE 'src/auth/%'`)
- Otherwise: exact match OR prefix on directory boundary
- Examples: `--path=src/auth/` matches `src/auth/login.ts`, `src/auth/utils/helpers.ts`
- Examples: `--path=src/auth/login.ts` matches only that exact file
- Glob patterns deferred to post-MVP
- Label filtering operates on `document_labels` (indexed, exact-match) - Label filtering operates on `document_labels` (indexed, exact-match)
- Filters work identically in hybrid and lexical modes - Filters work identically in hybrid and lexical modes
- Debug: `--explain` returns rank contributions from vector + FTS + RRF - Debug: `--explain` returns rank contributions from vector + FTS + RRF
@@ -1005,51 +1172,25 @@ Each query must have at least one expected URL appear in top 10 results.
- Filters exclude all results: `No results match the specified filters.` - Filters exclude all results: `No results match the specified filters.`
- Helpful hints shown in non-JSON mode (e.g., "Try broadening your search") - Helpful hints shown in non-JSON mode (e.g., "Try broadening your search")
**Schema Additions:**
```sql
-- Full-text search for hybrid retrieval
-- Using porter stemmer for better matching of word variants
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id',
tokenize='porter unicode61'
);
-- Triggers to keep FTS in sync
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
CREATE TRIGGER documents_au AFTER UPDATE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
```
**FTS5 Tokenizer Notes:**
- `porter` enables stemming (searching "authentication" matches "authenticating", "authenticated")
- `unicode61` handles Unicode properly
- Code identifiers (snake_case, camelCase, file paths) may not tokenize ideally; post-MVP consideration for custom tokenizer
**Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:** **Hybrid Search Algorithm (MVP) - Reciprocal Rank Fusion:**
1. Query both vector index (top 50) and FTS5 (top 50) 1. Determine recall size (adaptive based on filters):
2. Merge results by document_id - `baseTopK = 50`
3. Combine with Reciprocal Rank Fusion (RRF): - If any filters present (--project, --type, --author, --label, --path, --after): `topK = 200`
- This prevents "no results" when relevant docs exist outside top-50 unfiltered recall
2. Query both vector index (top topK) and FTS5 (top topK)
- Apply SQL-expressible filters during retrieval when possible (project_id, author_username, source_type)
3. Merge results by document_id
4. Combine with Reciprocal Rank Fusion (RRF):
- For each retriever list, assign ranks (1..N) - For each retriever list, assign ranks (1..N)
- `rrf_score = Σ 1 / (k + rank)` with k=60 (tunable) - `rrfScore = Σ 1 / (k + rank)` with k=60 (tunable)
- RRF is simpler than weighted sums and doesn't require score normalization - RRF is simpler than weighted sums and doesn't require score normalization
4. Apply filters (type, author, date, label) 5. Apply remaining filters (date ranges, labels, paths that weren't applied in SQL)
5. Return top K 6. Return top K results
**Why Adaptive Recall:**
- Fixed top-50 + filter can easily return 0 results even when relevant docs exist
- Increasing recall when filters are present catches more candidates before filtering
- SQL-level filtering is preferred (faster, uses indexes) but not always possible
**Why RRF over Weighted Sums:** **Why RRF over Weighted Sums:**
- FTS5 BM25 scores and vector distances use different scales - FTS5 BM25 scores and vector distances use different scales
@@ -1125,17 +1266,23 @@ interface SearchResult {
author: string | null; author: string | null;
createdAt: string; // ISO 8601 createdAt: string; // ISO 8601
updatedAt: string; // ISO 8601 updatedAt: string; // ISO 8601
score: number; // 0-1 normalized RRF score score: number; // normalized 0-1 (rrfScore / maxRrfScore in this result set)
snippet: string; // truncated content_text snippet: string; // truncated content_text
labels: string[]; labels: string[];
// Only present with --explain flag // Only present with --explain flag
explain?: { explain?: {
vectorRank?: number; // null if not in vector results vectorRank?: number; // null if not in vector results
ftsRank?: number; // null if not in FTS results ftsRank?: number; // null if not in FTS results
rrfScore: number; rrfScore: number; // raw RRF score (rank-based, comparable within a query)
}; };
} }
// Note on score normalization:
// - `score` is normalized 0-1 for UI display convenience
// - Normalization is per-query (score = rrfScore / max(rrfScore) in this result set)
// - Use `explain.rrfScore` for raw scores when comparing across queries
// - Scores are NOT comparable across different queries
interface SearchResponse { interface SearchResponse {
query: string; query: string;
mode: "hybrid" | "lexical" | "semantic"; mode: "hybrid" | "lexical" | "semantic";
@@ -1186,7 +1333,7 @@ tests/integration/sync-recovery.test.ts
| `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check | | `gi sync` (no changes) | `0 issues, 0 MRs updated` | Fast completion, no API calls beyond cursor check |
| `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change | | `gi sync` (after GitLab change) | `1 issue updated, 3 discussions refetched` | Detects and syncs the change |
| `gi sync --full` | Full sync progress | Resets cursors, fetches everything | | `gi sync --full` | Full sync progress | Resets cursors, fetches everything |
| `gi sync-status` | Cursor positions, last sync time | Shows current state | | `gi sync-status` | Last sync time, cursor positions | Shows current state |
| `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually | | `gi sync` (with rate limit) | Backoff messages | Respects rate limits, completes eventually |
| `gi search "new content"` (after sync) | Returns new content | New content is searchable | | `gi search "new content"` (after sync) | Returns new content | New content is searchable |
@@ -1262,13 +1409,25 @@ gi sync-status
**Orchestration steps (in order):** **Orchestration steps (in order):**
1. Acquire app lock with heartbeat 1. Acquire app lock with heartbeat
2. Ingest delta (issues, MRs, discussions) based on cursors 2. Ingest delta (issues, MRs) based on cursors
- During ingestion, INSERT into `dirty_sources` for each upserted entity - For each upserted issue/MR, enqueue into `pending_discussion_fetches`
3. Apply rolling backfill window - INSERT into `dirty_sources` for each upserted issue/MR
4. Regenerate documents for entities in `dirty_sources` (process + delete from queue) 3. Process `pending_discussion_fetches` queue (bounded per run, retryable):
5. Embed documents with changed content_hash - Fetch discussions for each queued parent
6. FTS triggers auto-sync (no explicit step needed) - On success: upsert discussions/notes, INSERT into `dirty_sources`, DELETE from queue
7. Release lock, record sync_run as succeeded - On failure: increment `attempt_count`, record `last_error`, leave in queue for retry
- Bound processing: max N parents per sync run to avoid unbounded API calls
4. Apply rolling backfill window
5. Regenerate documents for entities in `dirty_sources` (process + delete from queue)
6. Embed documents with changed content_hash
7. FTS triggers auto-sync (no explicit step needed)
8. Release lock, record sync_run as succeeded
**Why queue-based discussion fetching:**
- One pathological MR thread (huge pagination, 5xx errors, permission issues) shouldn't block the entire sync
- Primary resource cursors can advance independently
- Discussions can be retried without re-fetching all issues/MRs
- Bounded processing prevents unbounded API calls per sync run
Individual commands remain available for checkpoint testing and debugging: Individual commands remain available for checkpoint testing and debugging:
- `gi ingest --type=issues` - `gi ingest --type=issues`
@@ -1298,7 +1457,8 @@ All commands support `--help` for detailed usage information.
|---------|-----|-------------| |---------|-----|-------------|
| `gi ingest --type=issues` | 1 | Fetch issues from GitLab | | `gi ingest --type=issues` | 1 | Fetch issues from GitLab |
| `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions | | `gi ingest --type=merge_requests` | 2 | Fetch MRs and discussions |
| `gi embed --all` | 3 | Generate embeddings for all documents | | `gi generate-docs` | 3A | Extract documents from issues/MRs/discussions |
| `gi embed --all` | 3B | Generate embeddings for all documents |
| `gi embed --retry-failed` | 3 | Retry failed embeddings | | `gi embed --retry-failed` | 3 | Retry failed embeddings |
| `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) | | `gi sync` | 5 | Full sync orchestration (ingest + docs + embed) |
| `gi sync --full` | 5 | Force complete re-sync (reset cursors) | | `gi sync --full` | 5 | Force complete re-sync (reset cursors) |
@@ -1310,7 +1470,7 @@ All commands support `--help` for detailed usage information.
| Command | CP | Description | | Command | CP | Description |
|---------|-----|-------------| |---------|-----|-------------|
| `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues | | `gi list issues [--limit=N] [--project=PATH]` | 1 | List issues |
| `gi list mrs [--limit=N]` | 2 | List merge requests | | `gi list mrs --limit=N` | 2 | List merge requests |
| `gi count issues` | 1 | Count issues | | `gi count issues` | 1 | Count issues |
| `gi count mrs` | 2 | Count merge requests | | `gi count mrs` | 2 | Count merge requests |
| `gi count discussions --type=issue` | 1 | Count issue discussions | | `gi count discussions --type=issue` | 1 | Count issue discussions |
@@ -1318,8 +1478,8 @@ All commands support `--help` for detailed usage information.
| `gi count discussions --type=mr` | 2 | Count MR discussions | | `gi count discussions --type=mr` | 2 | Count MR discussions |
| `gi count notes --type=issue` | 1 | Count issue notes (excluding system) | | `gi count notes --type=issue` | 1 | Count issue notes (excluding system) |
| `gi count notes` | 2 | Count all notes (excluding system) | | `gi count notes` | 2 | Count all notes (excluding system) |
| `gi show issue <iid>` | 1 | Show issue details | | `gi show issue <iid> [--project=PATH]` | 1 | Show issue details (prompts if iid ambiguous across projects) |
| `gi show mr <iid>` | 2 | Show MR details with discussions | | `gi show mr <iid> [--project=PATH]` | 2 | Show MR details with discussions |
| `gi stats` | 3 | Embedding coverage statistics | | `gi stats` | 3 | Embedding coverage statistics |
| `gi stats --json` | 3 | JSON stats for scripting | | `gi stats --json` | 3 | JSON stats for scripting |
| `gi sync-status` | 1 | Show cursor positions and last sync | | `gi sync-status` | 1 | Show cursor positions and last sync |
@@ -1401,6 +1561,8 @@ Common errors and their resolutions:
| **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. | | **Disk full during write** | Fails with clear error. Cursor preserved at last successful commit. Free space and resume. |
| **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. | | **Stale lock detected** | Lock held > 10 minutes without heartbeat is considered stale. Next sync auto-recovers. |
| **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. | | **Network interruption** | Retries with exponential backoff. After max retries, sync fails but cursor is preserved. |
| **Embedding permanent failure** | After 3 retries, document stays in `embedding_metadata` with `last_error` populated. Use `gi embed --retry-failed` to retry later, or `gi stats` to see failed count. Documents with failed embeddings are excluded from vector search but included in FTS. |
| **Orphaned records** | MVP: No automatic cleanup. `last_seen_at` field enables future detection of items deleted in GitLab. Post-MVP: `gi gc --dry-run` to identify orphans, `gi gc --confirm` to remove. |
--- ---
@@ -1506,7 +1668,7 @@ CREATE TABLE note_positions (
new_line INTEGER, new_line INTEGER,
position_type TEXT -- 'text' | 'image' | etc. position_type TEXT -- 'text' | 'image' | etc.
); );
CREATE INDEX idx_note_positions_new_path ON note_positions(position_new_path); CREATE INDEX idx_note_positions_new_path ON note_positions(new_path);
``` ```
--- ---
@@ -1535,6 +1697,8 @@ Each checkpoint includes:
| Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite | | Search quality | Hybrid (vector + FTS5) retrieval with RRF, golden query test suite |
| Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery | | Concurrent sync corruption | DB lock + heartbeat + rolling backfill, automatic stale lock recovery |
| Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs | | Embedding failures | Per-document error tracking, retry with backoff, targeted re-runs |
| Pathological discussions | Queue-based discussion fetching; one bad thread doesn't block entire sync |
| Empty search results with filters | Adaptive recall (topK 50→200 when filtered) |
**SQLite Performance Defaults (MVP):** **SQLite Performance Defaults (MVP):**
- Enable `PRAGMA journal_mode=WAL;` on every connection - Enable `PRAGMA journal_mode=WAL;` on every connection
@@ -1552,23 +1716,24 @@ Each checkpoint includes:
| sync_runs | 0 | Audit trail of sync operations (with heartbeat) | | sync_runs | 0 | Audit trail of sync operations (with heartbeat) |
| app_locks | 0 | Crash-safe single-flight lock | | app_locks | 0 | Crash-safe single-flight lock |
| sync_cursors | 0 | Resumable sync state per primary resource | | sync_cursors | 0 | Resumable sync state per primary resource |
| raw_payloads | 0 | Decoupled raw JSON storage (with project_id) | | raw_payloads | 0 | Decoupled raw JSON storage (gitlab_id as TEXT) |
| schema_version | 0 | Database migration version tracking | | schema_version | 0 | Database migration version tracking |
| issues | 1 | Normalized issues (unique by project+iid) | | issues | 1 | Normalized issues (unique by project+iid) |
| labels | 1 | Label definitions (unique by project + name) | | labels | 1 | Label definitions (unique by project + name) |
| issue_labels | 1 | Issue-label junction | | issue_labels | 1 | Issue-label junction |
| merge_requests | 2 | Normalized MRs (unique by project+iid) |
| discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) | | discussions | 1 | Discussion threads (issue discussions in CP1, MR discussions in CP2) |
| notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) | | notes | 1 | Individual comments with is_system flag (DiffNote paths added in CP2) |
| merge_requests | 2 | Normalized MRs (unique by project+iid) |
| mr_labels | 2 | MR-label junction | | mr_labels | 2 | MR-label junction |
| documents | 3 | Unified searchable documents with truncation metadata | | documents | 3A | Unified searchable documents with truncation metadata |
| document_labels | 3 | Document-label junction for fast filtering | | document_labels | 3A | Document-label junction for fast filtering |
| document_paths | 3 | Fast path filtering for documents (DiffNote file paths) | | document_paths | 3A | Fast path filtering for documents (DiffNote file paths) |
| dirty_sources | 3 | Queue for incremental document regeneration | | dirty_sources | 3A | Queue for incremental document regeneration |
| embeddings | 3 | Vector embeddings (sqlite-vss, rowid=document_id) | | pending_discussion_fetches | 3A | Resumable queue for dependent discussion fetching |
| embedding_metadata | 3 | Embedding provenance + error tracking | | documents_fts | 3A | Full-text search index (fts5 with porter stemmer) |
| documents_fts | 4 | Full-text search index (fts5 with porter stemmer) | | embeddings | 3B | Vector embeddings (sqlite-vss, rowid=document_id) |
| mr_files | 6 | MR file changes (deferred to File History feature) | | embedding_metadata | 3B | Embedding provenance + error tracking |
| mr_files | 6 | MR file changes (deferred to post-MVP) |
--- ---
@@ -1584,10 +1749,10 @@ Each checkpoint includes:
| Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings | | Labels uniqueness | **By (project_id, name)** | GitLab API returns labels as strings |
| Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient | | Sync method | **Polling only for MVP** | Webhooks add complexity; polling every 10 min is sufficient |
| Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates | | Sync safety | **DB lock + heartbeat + rolling backfill** | Prevents race conditions and missed updates |
| Discussions sync | **Dependent resource model** | Discussions API is per-parent; refetch all when parent updates | | Discussions sync | **Resumable queue model** | Queue-based fetching allows one pathological thread to not block entire sync |
| Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed | | Hybrid ranking | **RRF over weighted sums** | Simpler, no score normalization needed |
| Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping | | Embedding rowid | **rowid = documents.id** | Eliminates fragile rowid mapping |
| Embedding truncation | **8000 tokens, truncate middle** | Preserve first/last notes for context | | Embedding truncation | **Note-boundary aware middle drop** | Never cut mid-note; preserves semantic coherence |
| Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation | | Embedding batching | **32 docs/batch, 4 concurrent workers** | Balance throughput, memory, and error isolation |
| FTS5 tokenizer | **porter unicode61** | Stemming improves recall | | FTS5 tokenizer | **porter unicode61** | Stemming improves recall |
| Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching | | Ollama unavailable | **Graceful degradation to FTS5** | Search still works without semantic matching |
@@ -1596,6 +1761,14 @@ Each checkpoint includes:
| `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX | | `gi init` validation | **Validate GitLab before writing config** | Fail fast, better UX |
| Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly | | Ctrl+C handling | **Graceful shutdown** | Finish page, commit cursor, exit cleanly |
| Empty state UX | **Actionable messages** | Guide user to next step | | Empty state UX | **Actionable messages** | Guide user to next step |
| raw_payloads.gitlab_id | **TEXT not INTEGER** | Discussion IDs are strings; numeric IDs stored as strings |
| GitLab list params | **Always scope=all&state=all** | Ensures all historical data including closed items |
| Pagination | **X-Next-Page headers with empty-page fallback** | Headers are more robust than empty-page detection |
| Integration tests | **Mocked by default, live tests optional** | Deterministic CI; live tests gated by GITLAB_LIVE_TESTS=1 |
| Search recall with filters | **Adaptive topK (50→200 when filtered)** | Prevents "no results" when relevant docs exist outside top-50 |
| RRF score normalization | **Per-query normalized 0-1** | score = rrfScore / max(rrfScore); raw score in explain |
| --path semantics | **Trailing / = prefix match** | `--path=src/auth/` does prefix; otherwise exact match |
| CP3 structure | **Split into 3A (FTS) and 3B (embeddings)** | Lexical search works before embedding infra risk |
--- ---