Full Sync Overview

4 stages
Stage 1
Ingest Issues
Fetch issues + discussions + resource events from GitLab API
Cursor-based incremental sync.
Sequential discussion fetch.
Queue-based resource events.
Stage 2
Ingest MRs
Fetch merge requests + discussions + resource events
Page-based incremental sync.
Parallel prefetch discussions.
Queue-based resource events.
Stage 3
Generate Docs
Regenerate searchable documents for changed entities
Driven by dirty_sources table.
Triple-hash skip optimization.
FTS5 index auto-updated.
Stage 4
Embed
Generate vector embeddings via Ollama for semantic search
Hash-based change detection.
Chunked, batched API calls.
Non-fatal — graceful if Ollama down.
Concurrency Model
  • Stages 1 & 2 process projects concurrently via buffer_unordered(primary_concurrency)
  • Each project gets its own SQLite connection; rate limiter is shared
  • Discussions: sequential (issues) or batched parallel prefetch (MRs)
  • Resource events use a persistent job queue with atomic claim + exponential backoff
Sync Flags
  • --full — Resets all cursors & watermarks, forces complete re-fetch
  • --no-docs — Skips Stage 3 (document generation)
  • --no-embed — Skips Stage 4 (embedding generation)
  • --force — Overrides stale single-flight lock
  • --project <path> — Sync only one project (fuzzy matching)
Single-Flight Lock
  • Table-based lock (AppLock) prevents concurrent syncs
  • Heartbeat keeps the lock alive; stale locks auto-detected
  • Use --force to override a stale lock
API Call
Transform
Database
Decision
Error Path
Queue
1
Fetch Issues Cursor-Based Incremental Sync
GitLab API Call
paginate_issues() with
updated_after = cursor - rewind
Cursor Filter
updated_at > cursor_ts
OR tie_breaker check
transform_issue()
GitLab API shape →
local DB row shape
Transaction
store_payload → upsert →
mark_dirty → relink
Update Cursor
Every 100 issues + final
sync_cursors table
2
Discussion Sync Sequential, Watermark-Based
Query Stale Issues
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
Paginate Discussions
Sequential per issue
paginate_issue_discussions()
Transform
transform_discussion()
transform_notes()
Write Discussion
store_payload → upsert
DELETE notes → INSERT notes
✓ On Success (all pages fetched)
Remove Stale
DELETE discussions not
seen in this fetch
Advance Watermark
discussions_synced_for_
updated_at = updated_at
✗ On Pagination Error
Skip Stale Removal
Watermark NOT advanced
Will retry next sync
3
Resource Events Queue-Based, Concurrent Fetch
Cleanup Obsolete
DELETE jobs where entity
watermark is current
Enqueue Jobs
INSERT for entities where
updated_at > watermark
Claim Jobs
Atomic UPDATE...RETURNING
with lock acquisition
Fetch Events
3 concurrent: state +
label + milestone
✓ On Success
Store Events
Transaction: upsert all
3 event types
Complete + Watermark
DELETE job row
Advance watermark
✗ Permanent Error (404 / 403)
Skip Permanently
complete_job + advance
watermark (coalesced)
↻ Transient Error
Backoff Retry
fail_job: 30s x 2^(n-1)
capped at 480s
API Call
Transform
Database
Diff from Issues
Error Path
Queue
1
Fetch MRs Page-Based Incremental Sync
GitLab API Call
fetch_merge_requests_page()
with cursor rewind
Page-based, not streaming
Cursor Filter
Same logic as issues:
timestamp + tie-breaker
Same as issues
transform_merge_request()
Maps API shape →
local DB row
Transaction
store → upsert → dirty →
labels + assignees + reviewers
3 junction tables (not 2)
Update Cursor
Per page (not every 100)
Per page boundary
2
MR Discussion Sync Parallel Prefetch + Serial Write
Key Differences from Issue Discussions
  • Parallel prefetch — fetches all discussions for a batch concurrently via join_all()
  • Upsert pattern — notes use INSERT...ON CONFLICT (not delete-all + re-insert)
  • Sweep stale — uses last_seen_at timestamp comparison (not set difference)
  • Sync health tracking — records discussions_sync_attempts and last_error
Query Stale MRs
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
Same watermark logic
Batch by Concurrency
dependent_concurrency
MRs per batch
Batched processing
Parallel Prefetch
join_all() fetches all
discussions for batch
Parallel (not sequential)
Transform In-Memory
transform_mr_discussion()
+ diff position notes
Serial Write
upsert discussion
upsert notes (ON CONFLICT)
Upsert, not delete+insert
✓ On Full Success
Sweep Stale
DELETE WHERE last_seen_at
< run_seen_at (disc + notes)
last_seen_at sweep
Advance Watermark
discussions_synced_for_
updated_at = updated_at
✗ On Failure
Record Sync Health
Watermark NOT advanced
Tracks attempts + last_error
Health tracking
3
Resource Events Same as Issues
Identical to Issue Resource Events
  • Same queue-based approach: cleanup → enqueue → claim → fetch → store/fail
  • Same watermark column: resource_events_synced_for_updated_at
  • Same error handling: 404/403 coalesced to empty, transient errors get backoff
  • entity_type = "merge_request" instead of "issue"
Trigger
Extract
Database
Decision
Error
1
Dirty Source Queue Populated During Ingestion
mark_dirty_tx()
Called during every issue/
MR/discussion upsert
dirty_sources Table
INSERT (source_type, source_id)
ON CONFLICT reset backoff
2
Drain Loop Batch 500, Respects Backoff
Get Dirty Sources
Batch 500, ORDER BY
attempt_count, queued_at
Dispatch by Type
issue / mr / discussion
→ extract function
Source Exists?
If deleted: remove doc row
(cascade cleans FTS + embeds)
Extract Content
Structured text:
header + metadata + body
Triple-Hash Check
content_hash + labels_hash
+ paths_hash all match?
SAVEPOINT Write
Atomic: document row +
labels + paths
✓ On Success
clear_dirty()
Remove from dirty_sources
✗ On Error
record_dirty_error()
Increment attempt_count
Exponential backoff
≡ Triple-Hash Match (skip)
Skip Write
All 3 hashes match →
no WAL churn, clear dirty
Full Mode (--full)
  • Seeds ALL entities into dirty_sources via keyset pagination
  • Triple-hash optimization prevents redundant writes even in full mode
  • Runs FTS OPTIMIZE after drain completes
API (Ollama)
Processing
Database
Decision
Error
1
Change Detection Hash + Config Drift
find_pending_documents()
No metadata row? OR
document_hash mismatch? OR
config drift?
Keyset Pagination
500 documents per page
ordered by doc ID
2
Chunking Split + Overflow Guard
split_into_chunks()
Split by paragraph boundaries
with configurable overlap
Overflow Guard
Too many chunks?
Skip to prevent rowid collision
Build ChunkWork
Assign encoded chunk IDs
per document
3
Ollama Embedding Batched API Calls
Batch Embed
32 chunks per Ollama
API call
Store Vectors
sqlite-vec embeddings table
+ embedding_metadata
✓ On Success
SAVEPOINT Commit
Atomic per page:
clear old + write new
↻ Context-Length Error
Retry Individually
Re-embed each chunk solo
to isolate oversized one
✗ Other Error
Record Error
Store in embedding_metadata
for retry next run
Full Mode (--full)
  • DELETEs all embedding_metadata and embeddings rows first
  • Every document re-processed from scratch
Non-Fatal in Sync
  • Stage 4 failures (Ollama down, model missing) are graceful
  • Sync completes successfully; embeddings just won't be updated
  • Semantic search degrades to FTS-only mode
Watermark & Cursor Reference
TableColumn(s)Purpose
sync_cursorsupdated_at_cursor + tie_breaker_idIncremental fetch: "last entity we saw" per project+type
issuesdiscussions_synced_for_updated_atPer-issue discussion watermark
issuesresource_events_synced_for_updated_atPer-issue resource event watermark
merge_requestsdiscussions_synced_for_updated_atPer-MR discussion watermark
merge_requestsresource_events_synced_for_updated_atPer-MR resource event watermark
dirty_sourcesqueued_at + next_attempt_atDocument regeneration queue with backoff
embedding_metadatadocument_hash + chunk_max_bytes + model + dimsEmbedding staleness detection
pending_dependent_fetcheslocked_at + next_retry_at + attemptsResource event job queue with backoff

Node Details