Full Sync Overview
4 stagesStage 1
Ingest Issues
Fetch issues + discussions + resource events from GitLab API
Cursor-based incremental sync.
Sequential discussion fetch.
Queue-based resource events.
Sequential discussion fetch.
Queue-based resource events.
→
Stage 2
Ingest MRs
Fetch merge requests + discussions + resource events
Page-based incremental sync.
Parallel prefetch discussions.
Queue-based resource events.
Parallel prefetch discussions.
Queue-based resource events.
→
Stage 3
Generate Docs
Regenerate searchable documents for changed entities
Driven by
Triple-hash skip optimization.
FTS5 index auto-updated.
dirty_sources table.Triple-hash skip optimization.
FTS5 index auto-updated.
→
Stage 4
Embed
Generate vector embeddings via Ollama for semantic search
Hash-based change detection.
Chunked, batched API calls.
Non-fatal — graceful if Ollama down.
Chunked, batched API calls.
Non-fatal — graceful if Ollama down.
Concurrency Model
- Stages 1 & 2 process projects concurrently via
buffer_unordered(primary_concurrency) - Each project gets its own SQLite connection; rate limiter is shared
- Discussions: sequential (issues) or batched parallel prefetch (MRs)
- Resource events use a persistent job queue with atomic claim + exponential backoff
Sync Flags
--full— Resets all cursors & watermarks, forces complete re-fetch--no-docs— Skips Stage 3 (document generation)--no-embed— Skips Stage 4 (embedding generation)--force— Overrides stale single-flight lock--project <path>— Sync only one project (fuzzy matching)
Single-Flight Lock
- Table-based lock (
AppLock) prevents concurrent syncs - Heartbeat keeps the lock alive; stale locks auto-detected
- Use
--forceto override a stale lock
API Call
Transform
Database
Decision
Error Path
Queue
1
Fetch Issues Cursor-Based Incremental Sync
GitLab API Call
paginate_issues() with
updated_after = cursor - rewind
updated_after = cursor - rewind
→
Cursor Filter
updated_at > cursor_ts
OR tie_breaker check
OR tie_breaker check
→
transform_issue()
GitLab API shape →
local DB row shape
local DB row shape
→
Transaction
store_payload → upsert →
mark_dirty → relink
mark_dirty → relink
↓
Update Cursor
Every 100 issues + final
sync_cursors table
sync_cursors table
2
Discussion Sync Sequential, Watermark-Based
Query Stale Issues
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
discussions_synced_for_
updated_at, 0)
→
Paginate Discussions
Sequential per issue
paginate_issue_discussions()
paginate_issue_discussions()
→
Transform
transform_discussion()
transform_notes()
transform_notes()
→
Write Discussion
store_payload → upsert
DELETE notes → INSERT notes
DELETE notes → INSERT notes
✓ On Success (all pages fetched)
Remove Stale
DELETE discussions not
seen in this fetch
seen in this fetch
→
Advance Watermark
discussions_synced_for_
updated_at = updated_at
updated_at = updated_at
✗ On Pagination Error
Skip Stale Removal
Watermark NOT advanced
Will retry next sync
Will retry next sync
3
Resource Events Queue-Based, Concurrent Fetch
Cleanup Obsolete
DELETE jobs where entity
watermark is current
watermark is current
→
Enqueue Jobs
INSERT for entities where
updated_at > watermark
updated_at > watermark
→
Claim Jobs
Atomic UPDATE...RETURNING
with lock acquisition
with lock acquisition
→
Fetch Events
3 concurrent: state +
label + milestone
label + milestone
✓ On Success
Store Events
Transaction: upsert all
3 event types
3 event types
→
Complete + Watermark
DELETE job row
Advance watermark
Advance watermark
✗ Permanent Error (404 / 403)
Skip Permanently
complete_job + advance
watermark (coalesced)
watermark (coalesced)
↻ Transient Error
Backoff Retry
fail_job: 30s x 2^(n-1)
capped at 480s
capped at 480s
API Call
Transform
Database
Diff from Issues
Error Path
Queue
1
Fetch MRs Page-Based Incremental Sync
GitLab API Call
fetch_merge_requests_page()
with cursor rewind
with cursor rewind
Page-based, not streaming
→
Cursor Filter
Same logic as issues:
timestamp + tie-breaker
timestamp + tie-breaker
Same as issues
→
transform_merge_request()
Maps API shape →
local DB row
local DB row
→
Transaction
store → upsert → dirty →
labels + assignees + reviewers
labels + assignees + reviewers
3 junction tables (not 2)
↓
Update Cursor
Per page (not every 100)
Per page boundary
2
MR Discussion Sync Parallel Prefetch + Serial Write
Key Differences from Issue Discussions
- Parallel prefetch — fetches all discussions for a batch concurrently via
join_all() - Upsert pattern — notes use INSERT...ON CONFLICT (not delete-all + re-insert)
- Sweep stale — uses
last_seen_attimestamp comparison (not set difference) - Sync health tracking — records
discussions_sync_attemptsandlast_error
Query Stale MRs
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
discussions_synced_for_
updated_at, 0)
Same watermark logic
→
Batch by Concurrency
dependent_concurrency
MRs per batch
MRs per batch
Batched processing
↓
Parallel Prefetch
join_all() fetches all
discussions for batch
discussions for batch
Parallel (not sequential)
→
Transform In-Memory
transform_mr_discussion()
+ diff position notes
+ diff position notes
→
Serial Write
upsert discussion
upsert notes (ON CONFLICT)
upsert notes (ON CONFLICT)
Upsert, not delete+insert
✓ On Full Success
Sweep Stale
DELETE WHERE last_seen_at
< run_seen_at (disc + notes)
< run_seen_at (disc + notes)
last_seen_at sweep
→
Advance Watermark
discussions_synced_for_
updated_at = updated_at
updated_at = updated_at
✗ On Failure
Record Sync Health
Watermark NOT advanced
Tracks attempts + last_error
Tracks attempts + last_error
Health tracking
3
Resource Events Same as Issues
Identical to Issue Resource Events
- Same queue-based approach: cleanup → enqueue → claim → fetch → store/fail
- Same watermark column:
resource_events_synced_for_updated_at - Same error handling: 404/403 coalesced to empty, transient errors get backoff
- entity_type =
"merge_request"instead of"issue"
Trigger
Extract
Database
Decision
Error
1
Dirty Source Queue Populated During Ingestion
mark_dirty_tx()
Called during every issue/
MR/discussion upsert
MR/discussion upsert
→
dirty_sources Table
INSERT (source_type, source_id)
ON CONFLICT reset backoff
ON CONFLICT reset backoff
2
Drain Loop Batch 500, Respects Backoff
Get Dirty Sources
Batch 500, ORDER BY
attempt_count, queued_at
attempt_count, queued_at
→
Dispatch by Type
issue / mr / discussion
→ extract function
→ extract function
→
Source Exists?
If deleted: remove doc row
(cascade cleans FTS + embeds)
(cascade cleans FTS + embeds)
↓
Extract Content
Structured text:
header + metadata + body
header + metadata + body
→
Triple-Hash Check
content_hash + labels_hash
+ paths_hash all match?
+ paths_hash all match?
→
SAVEPOINT Write
Atomic: document row +
labels + paths
labels + paths
✓ On Success
clear_dirty()
Remove from dirty_sources
✗ On Error
record_dirty_error()
Increment attempt_count
Exponential backoff
Exponential backoff
≡ Triple-Hash Match (skip)
Skip Write
All 3 hashes match →
no WAL churn, clear dirty
no WAL churn, clear dirty
Full Mode (
--full)- Seeds ALL entities into
dirty_sourcesvia keyset pagination - Triple-hash optimization prevents redundant writes even in full mode
- Runs FTS
OPTIMIZEafter drain completes
API (Ollama)
Processing
Database
Decision
Error
1
Change Detection Hash + Config Drift
find_pending_documents()
No metadata row? OR
document_hash mismatch? OR
config drift?
document_hash mismatch? OR
config drift?
→
Keyset Pagination
500 documents per page
ordered by doc ID
ordered by doc ID
2
Chunking Split + Overflow Guard
split_into_chunks()
Split by paragraph boundaries
with configurable overlap
with configurable overlap
→
Overflow Guard
Too many chunks?
Skip to prevent rowid collision
Skip to prevent rowid collision
→
Build ChunkWork
Assign encoded chunk IDs
per document
per document
3
Ollama Embedding Batched API Calls
Batch Embed
32 chunks per Ollama
API call
API call
→
Store Vectors
sqlite-vec embeddings table
+ embedding_metadata
+ embedding_metadata
✓ On Success
SAVEPOINT Commit
Atomic per page:
clear old + write new
clear old + write new
↻ Context-Length Error
Retry Individually
Re-embed each chunk solo
to isolate oversized one
to isolate oversized one
✗ Other Error
Record Error
Store in embedding_metadata
for retry next run
for retry next run
Full Mode (
--full)- DELETEs all
embedding_metadataandembeddingsrows first - Every document re-processed from scratch
Non-Fatal in Sync
- Stage 4 failures (Ollama down, model missing) are graceful
- Sync completes successfully; embeddings just won't be updated
- Semantic search degrades to FTS-only mode
▲
Watermark & Cursor Reference
| Table | Column(s) | Purpose |
|---|---|---|
| sync_cursors | updated_at_cursor + tie_breaker_id | Incremental fetch: "last entity we saw" per project+type |
| issues | discussions_synced_for_updated_at | Per-issue discussion watermark |
| issues | resource_events_synced_for_updated_at | Per-issue resource event watermark |
| merge_requests | discussions_synced_for_updated_at | Per-MR discussion watermark |
| merge_requests | resource_events_synced_for_updated_at | Per-MR resource event watermark |
| dirty_sources | queued_at + next_attempt_at | Document regeneration queue with backoff |
| embedding_metadata | document_hash + chunk_max_bytes + model + dims | Embedding staleness detection |
| pending_dependent_fetches | locked_at + next_retry_at + attempts | Resource event job queue with backoff |