more planning
This commit is contained in:
373
SPEC-REVISIONS-2.md
Normal file
373
SPEC-REVISIONS-2.md
Normal file
@@ -0,0 +1,373 @@
|
||||
# SPEC.md Revision Document - Round 2
|
||||
|
||||
This document provides git-diff style changes for the second round of improvements from ChatGPT's review. These are primarily correctness fixes and optimizations.
|
||||
|
||||
---
|
||||
|
||||
## Change 1: Fix Tuple Cursor Correctness Gap (Cursor Rewind + Local Filtering)
|
||||
|
||||
**Why this is critical:** The spec specifies tuple cursor semantics `(updated_at, gitlab_id)` but GitLab's API only supports `updated_after` which is strictly "after" - it cannot express `WHERE updated_at = X AND id > Y` server-side. This creates a real risk of missed items on crash/resume and on dense timestamp buckets.
|
||||
|
||||
**Fix:** Cursor rewind + local filtering. Call GitLab with `updated_after = cursor_updated_at - rewindSeconds`, then locally discard items we've already processed.
|
||||
|
||||
```diff
|
||||
@@ Correctness Rules (MVP): @@
|
||||
1. Fetch pages ordered by `updated_at ASC`, within identical timestamps by `gitlab_id ASC`
|
||||
2. Cursor is a stable tuple `(updated_at, gitlab_id)`:
|
||||
- - Fetch `WHERE updated_at > cursor_updated_at OR (updated_at = cursor_updated_at AND gitlab_id > cursor_gitlab_id)`
|
||||
+ - **GitLab API cannot express `(updated_at = X AND id > Y)` server-side.**
|
||||
+ - Use **cursor rewind + local filtering**:
|
||||
+ - Call GitLab with `updated_after = cursor_updated_at - rewindSeconds` (default 2s, configurable)
|
||||
+ - Locally discard items where:
|
||||
+ - `updated_at < cursor_updated_at`, OR
|
||||
+ - `updated_at = cursor_updated_at AND gitlab_id <= cursor_gitlab_id`
|
||||
+ - This makes the tuple cursor rule true in practice while keeping API calls simple.
|
||||
- Cursor advances only after successful DB commit for that page
|
||||
- When advancing, set cursor to the last processed item's `(updated_at, gitlab_id)`
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Configuration (MVP): @@
|
||||
"sync": {
|
||||
"backfillDays": 14,
|
||||
"staleLockMinutes": 10,
|
||||
- "heartbeatIntervalSeconds": 30
|
||||
+ "heartbeatIntervalSeconds": 30,
|
||||
+ "cursorRewindSeconds": 2
|
||||
},
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 2: Make App Lock Actually Safe (BEGIN IMMEDIATE CAS)
|
||||
|
||||
**Why this is critical:** INSERT OR REPLACE can overwrite an active lock if two processes start close together (both do "stale check" outside a write transaction, then both INSERT OR REPLACE). SQLite's BEGIN IMMEDIATE provides a proper compare-and-swap.
|
||||
|
||||
```diff
|
||||
@@ Reliability/Idempotency Rules: @@
|
||||
- Every ingest/sync creates a `sync_runs` row
|
||||
- Single-flight via DB-enforced app lock:
|
||||
- - On start: INSERT OR REPLACE lock row with new owner token
|
||||
+ - On start: acquire lock via transactional compare-and-swap:
|
||||
+ - `BEGIN IMMEDIATE` (acquires write lock immediately)
|
||||
+ - If no row exists → INSERT new lock
|
||||
+ - Else if `heartbeat_at` is stale (> staleLockMinutes) → UPDATE owner + timestamps
|
||||
+ - Else if `owner` matches current run → UPDATE heartbeat (re-entrant)
|
||||
+ - Else → ROLLBACK and fail fast (another run is active)
|
||||
+ - `COMMIT`
|
||||
- During run: update `heartbeat_at` every 30 seconds
|
||||
- If existing lock's `heartbeat_at` is stale (> 10 minutes), treat as abandoned and acquire
|
||||
- `--force` remains as operator override for edge cases, but should rarely be needed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 3: Dependent Resource Pagination + Bounded Concurrency
|
||||
|
||||
**Why this is important:** Discussions endpoints are paginated on many GitLab instances. Without pagination, we silently lose data. Without bounded concurrency, initial sync can become unstable (429s, long tail retries).
|
||||
|
||||
```diff
|
||||
@@ Dependent Resources (Per-Parent Fetch): @@
|
||||
-GET /projects/:id/issues/:iid/discussions
|
||||
-GET /projects/:id/merge_requests/:iid/discussions
|
||||
+GET /projects/:id/issues/:iid/discussions?per_page=100&page=N
|
||||
+GET /projects/:id/merge_requests/:iid/discussions?per_page=100&page=N
|
||||
+
|
||||
+**Pagination:** Discussions endpoints return paginated results. Fetch all pages per parent.
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Rate Limiting: @@
|
||||
- Default: 10 requests/second with exponential backoff
|
||||
- Respect `Retry-After` headers on 429 responses
|
||||
- Add jitter to avoid thundering herd on retry
|
||||
+- **Separate concurrency limits:**
|
||||
+ - `sync.primaryConcurrency`: concurrent requests for issues/MRs list endpoints (default 4)
|
||||
+ - `sync.dependentConcurrency`: concurrent requests for discussions endpoints (default 2, lower to avoid 429s)
|
||||
+ - Bound concurrency per-project to avoid one repo starving the other
|
||||
- Initial sync estimate: 10-20 minutes depending on rate limits
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Configuration (MVP): @@
|
||||
"sync": {
|
||||
"backfillDays": 14,
|
||||
"staleLockMinutes": 10,
|
||||
"heartbeatIntervalSeconds": 30,
|
||||
- "cursorRewindSeconds": 2
|
||||
+ "cursorRewindSeconds": 2,
|
||||
+ "primaryConcurrency": 4,
|
||||
+ "dependentConcurrency": 2
|
||||
},
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 4: Track last_seen_at for Eventual Consistency Debugging
|
||||
|
||||
**Why this is valuable:** Even without implementing deletions, you want to know: (a) whether a record is actively refreshed under backfill/sync, (b) whether a sync run is "covering" the dataset, (c) whether a particular item hasn't been seen in months (helps diagnose missed updates).
|
||||
|
||||
```diff
|
||||
@@ Schema Preview - issues: @@
|
||||
CREATE TABLE issues (
|
||||
id INTEGER PRIMARY KEY,
|
||||
gitlab_id INTEGER UNIQUE NOT NULL,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||
iid INTEGER NOT NULL,
|
||||
title TEXT,
|
||||
description TEXT,
|
||||
state TEXT,
|
||||
author_username TEXT,
|
||||
created_at INTEGER,
|
||||
updated_at INTEGER,
|
||||
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||
web_url TEXT,
|
||||
raw_payload_id INTEGER REFERENCES raw_payloads(id)
|
||||
);
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Schema Additions - merge_requests: @@
|
||||
CREATE TABLE merge_requests (
|
||||
@@
|
||||
updated_at INTEGER,
|
||||
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||
merged_at INTEGER,
|
||||
@@
|
||||
);
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Schema Additions - discussions: @@
|
||||
CREATE TABLE discussions (
|
||||
@@
|
||||
last_note_at INTEGER,
|
||||
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||
resolvable BOOLEAN,
|
||||
@@
|
||||
);
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Schema Additions - notes: @@
|
||||
CREATE TABLE notes (
|
||||
@@
|
||||
updated_at INTEGER,
|
||||
+ last_seen_at INTEGER NOT NULL, -- updated on every upsert during sync
|
||||
position INTEGER,
|
||||
@@
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 5: Raw Payload Compression
|
||||
|
||||
**Why this is valuable:** At 50-100K documents plus threaded discussions, raw JSON is likely the largest storage consumer. Supporting gzip compression reduces DB size while preserving replay capability.
|
||||
|
||||
```diff
|
||||
@@ Schema (Checkpoint 0) - raw_payloads: @@
|
||||
CREATE TABLE raw_payloads (
|
||||
id INTEGER PRIMARY KEY,
|
||||
source TEXT NOT NULL, -- 'gitlab'
|
||||
project_id INTEGER REFERENCES projects(id),
|
||||
resource_type TEXT NOT NULL,
|
||||
gitlab_id INTEGER NOT NULL,
|
||||
fetched_at INTEGER NOT NULL,
|
||||
- json TEXT NOT NULL
|
||||
+ content_encoding TEXT NOT NULL DEFAULT 'identity', -- 'identity' | 'gzip'
|
||||
+ payload BLOB NOT NULL -- raw JSON or gzip-compressed JSON
|
||||
);
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Configuration (MVP): @@
|
||||
+ "storage": {
|
||||
+ "compressRawPayloads": true -- gzip raw payloads to reduce DB size
|
||||
+ },
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 6: Scope Discussions Unique by Project
|
||||
|
||||
**Why this is important:** `gitlab_discussion_id TEXT UNIQUE` assumes global uniqueness across all projects. While likely true for GitLab, it's safer to scope by project_id. This avoids rare but painful collisions and makes it easier to support more repos later.
|
||||
|
||||
```diff
|
||||
@@ Schema Additions - discussions: @@
|
||||
CREATE TABLE discussions (
|
||||
id INTEGER PRIMARY KEY,
|
||||
- gitlab_discussion_id TEXT UNIQUE NOT NULL,
|
||||
+ gitlab_discussion_id TEXT NOT NULL,
|
||||
project_id INTEGER NOT NULL REFERENCES projects(id),
|
||||
@@
|
||||
);
|
||||
+CREATE UNIQUE INDEX uq_discussions_project_discussion_id
|
||||
+ ON discussions(project_id, gitlab_discussion_id);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 7: Dirty Queue for Document Regeneration
|
||||
|
||||
**Why this is valuable:** The orchestration says "Regenerate documents for changed entities" but doesn't define how "changed" is computed without scanning large tables. A dirty queue populated during ingestion makes doc regen deterministic and fast.
|
||||
|
||||
```diff
|
||||
@@ Schema Additions (Checkpoint 3): @@
|
||||
+-- Track sources that require document regeneration (populated during ingestion)
|
||||
+CREATE TABLE dirty_sources (
|
||||
+ source_type TEXT NOT NULL, -- 'issue' | 'merge_request' | 'discussion'
|
||||
+ source_id INTEGER NOT NULL, -- local DB id
|
||||
+ queued_at INTEGER NOT NULL,
|
||||
+ PRIMARY KEY(source_type, source_id)
|
||||
+);
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Orchestration steps (in order): @@
|
||||
1. Acquire app lock with heartbeat
|
||||
2. Ingest delta (issues, MRs, discussions) based on cursors
|
||||
+ - During ingestion, INSERT into dirty_sources for each upserted entity
|
||||
3. Apply rolling backfill window
|
||||
-4. Regenerate documents for changed entities
|
||||
+4. Regenerate documents for entities in dirty_sources (process + delete from queue)
|
||||
5. Embed documents with changed content_hash
|
||||
6. FTS triggers auto-sync (no explicit step needed)
|
||||
7. Release lock, record sync_run as succeeded
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 8: document_paths + --path Filter
|
||||
|
||||
**Why this is high value:** We're already capturing DiffNote file paths in CP2. Adding a `--path` filter now makes the MVP dramatically more compelling for engineers who search by file path constantly.
|
||||
|
||||
```diff
|
||||
@@ Checkpoint 4 Scope: @@
|
||||
-- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`
|
||||
+- Search filters: `--type=issue|mr|discussion`, `--author=username`, `--after=date`, `--label=name`, `--project=path`, `--path=file`
|
||||
+ - `--path` filters documents by referenced file paths (from DiffNote positions)
|
||||
+ - MVP: substring/exact match; glob patterns deferred
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Schema Additions (Checkpoint 3): @@
|
||||
+-- Fast path filtering for documents (extracted from DiffNote positions)
|
||||
+CREATE TABLE document_paths (
|
||||
+ document_id INTEGER NOT NULL REFERENCES documents(id),
|
||||
+ path TEXT NOT NULL,
|
||||
+ PRIMARY KEY(document_id, path)
|
||||
+);
|
||||
+CREATE INDEX idx_document_paths_path ON document_paths(path);
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ CLI Interface: @@
|
||||
# Search within specific project
|
||||
gi search "authentication" --project=group/project-one
|
||||
|
||||
+# Search by file path (finds discussions/MRs touching this file)
|
||||
+gi search "rate limit" --path=src/client.ts
|
||||
+
|
||||
# Pure FTS search (fallback if embeddings unavailable)
|
||||
gi search "redis" --mode=lexical
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Manual CLI Smoke Tests: @@
|
||||
+| `gi search "auth" --path=src/auth/` | Path-filtered results | Only results referencing files in src/auth/ |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 9: Character-Based Truncation (Not Exact Tokens)
|
||||
|
||||
**Why this is practical:** "8000 tokens" sounds precise, but tokenizers vary. Exact token counting adds dependency complexity. A conservative character budget is simpler and avoids false precision.
|
||||
|
||||
```diff
|
||||
@@ Document Extraction Rules: @@
|
||||
- Truncation: content_text capped at 8000 tokens (nomic-embed-text limit is 8192)
|
||||
+ - **Implementation:** Use character budget, not exact token count
|
||||
+ - `maxChars = 32000` (conservative 4 chars/token estimate)
|
||||
+ - `approxTokens = ceil(charCount / 4)` for reporting/logging only
|
||||
+ - This avoids tokenizer dependency while preventing embedding failures
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Truncation: @@
|
||||
If content exceeds 8000 tokens:
|
||||
+**Note:** Token count is approximate (`ceil(charCount / 4)`). Enforce `maxChars = 32000`.
|
||||
+
|
||||
1. Truncate from the middle (preserve first + last notes for context)
|
||||
2. Set `documents.is_truncated = 1`
|
||||
3. Set `documents.truncated_reason = 'token_limit_middle_drop'`
|
||||
4. Log a warning with document ID and original token count
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Change 10: Move Lexical Search to CP3 (Reorder, Not New Scope)
|
||||
|
||||
**Why this is better:** Nothing "search-like" exists until CP4, but FTS5 is already a dependency for graceful degradation. Moving FTS setup to CP3 (when documents exist) gives an earlier usable artifact and better validation. CP4 becomes "hybrid ranking upgrade."
|
||||
|
||||
```diff
|
||||
@@ Checkpoint 3: Embedding Generation @@
|
||||
-### Checkpoint 3: Embedding Generation
|
||||
-**Deliverable:** Vector embeddings generated for all text content
|
||||
+### Checkpoint 3: Document + Embedding Generation with Lexical Search
|
||||
+**Deliverable:** Documents and embeddings generated; `gi search --mode=lexical` works end-to-end
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Checkpoint 3 Scope: @@
|
||||
- Ollama integration (nomic-embed-text model)
|
||||
- Embedding generation pipeline:
|
||||
@@
|
||||
- Fast label filtering via `document_labels` join table
|
||||
+- FTS5 index for lexical search (moved from CP4)
|
||||
+- `gi search --mode=lexical` CLI command (works without Ollama)
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Checkpoint 3 Manual CLI Smoke Tests: @@
|
||||
+| `gi search "authentication" --mode=lexical` | FTS results | Returns matching documents, no embeddings required |
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Checkpoint 4: Semantic Search @@
|
||||
-### Checkpoint 4: Semantic Search
|
||||
-**Deliverable:** Working semantic search across all indexed content
|
||||
+### Checkpoint 4: Hybrid Search (Semantic + Lexical)
|
||||
+**Deliverable:** Working hybrid semantic search (vector + FTS5 + RRF) across all indexed content
|
||||
```
|
||||
|
||||
```diff
|
||||
@@ Checkpoint 4 Scope: @@
|
||||
**Scope:**
|
||||
- Hybrid retrieval:
|
||||
- Vector recall (sqlite-vss) + FTS lexical recall (fts5)
|
||||
- Merge + rerank results using Reciprocal Rank Fusion (RRF)
|
||||
+- Query embedding generation (same Ollama pipeline as documents)
|
||||
- Result ranking and scoring (document-level)
|
||||
-- Search filters: ...
|
||||
+- Filters work identically in hybrid and lexical modes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary of All Changes (Round 2)
|
||||
|
||||
| # | Change | Impact |
|
||||
|---|--------|--------|
|
||||
| 1 | **Cursor rewind + local filtering** | Fixes real correctness gap in tuple cursor implementation |
|
||||
| 2 | **BEGIN IMMEDIATE CAS for lock** | Prevents race condition in lock acquisition |
|
||||
| 3 | **Discussions pagination + concurrency** | Prevents silent data loss on large discussion threads |
|
||||
| 4 | **last_seen_at columns** | Enables debugging of sync coverage without deletions |
|
||||
| 5 | **Raw payload compression** | Reduces DB size significantly at scale |
|
||||
| 6 | **Scope discussions unique by project** | Defensive uniqueness for multi-project safety |
|
||||
| 7 | **Dirty queue for doc regen** | Makes document regeneration deterministic and fast |
|
||||
| 8 | **document_paths + --path filter** | High-value file search with minimal scope |
|
||||
| 9 | **Character-based truncation** | Practical implementation without tokenizer dependency |
|
||||
| 10 | **Lexical search in CP3** | Earlier usable artifact; better checkpoint validation |
|
||||
|
||||
**Net effect:** These changes fix several correctness gaps (cursor, lock, pagination) while adding high-value features (--path filter) and operational improvements (compression, dirty queue, last_seen_at).
|
||||
Reference in New Issue
Block a user