Gitlore Sync Pipeline Explorer

Stage 1

Ingest Issues

Fetch issues + discussions + resource events from GitLab API

Cursor-based incremental sync.
Sequential discussion fetch.
Queue-based resource events.

→

Stage 2

Ingest MRs

Fetch merge requests + discussions + resource events

Page-based incremental sync.
Parallel prefetch discussions.
Queue-based resource events.

→

Stage 3

Generate Docs

Regenerate searchable documents for changed entities

Driven by dirty_sources table.
Triple-hash skip optimization.
FTS5 index auto-updated.

→

Stage 4

Embed

Generate vector embeddings via Ollama for semantic search

Hash-based change detection.
Chunked, batched API calls.
Non-fatal — graceful if Ollama down.

Concurrency Model

Stages 1 & 2 process projects concurrently via buffer_unordered(primary_concurrency)
Each project gets its own SQLite connection; rate limiter is shared
Discussions: sequential (issues) or batched parallel prefetch (MRs)
Resource events use a persistent job queue with atomic claim + exponential backoff

Sync Flags

--full — Resets all cursors & watermarks, forces complete re-fetch
--no-docs — Skips Stage 3 (document generation)
--no-embed — Skips Stage 4 (embedding generation)
--force — Overrides stale single-flight lock
--project <path> — Sync only one project (fuzzy matching)

Single-Flight Lock

Table-based lock (AppLock) prevents concurrent syncs
Heartbeat keeps the lock alive; stale locks auto-detected
Use --force to override a stale lock

API Call

Transform

Database

Decision

Error Path

Queue

Fetch Issues Cursor-Based Incremental Sync

GitLab API Call

paginate_issues() with
updated_after = cursor - rewind

→

Cursor Filter

updated_at > cursor_ts
OR tie_breaker check

→

transform_issue()

GitLab API shape →
local DB row shape

→

Transaction

store_payload → upsert →
mark_dirty → relink

↓

Update Cursor

Every 100 issues + final
sync_cursors table

Discussion Sync Sequential, Watermark-Based

Query Stale Issues

updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)

→

Paginate Discussions

Sequential per issue
paginate_issue_discussions()

→

Transform

transform_discussion()
transform_notes()

→

Write Discussion

store_payload → upsert
DELETE notes → INSERT notes

✓ On Success (all pages fetched)

Remove Stale

DELETE discussions not
seen in this fetch

→

Advance Watermark

discussions_synced_for_
updated_at = updated_at

✗ On Pagination Error

Skip Stale Removal

Watermark NOT advanced
Will retry next sync

Resource Events Queue-Based, Concurrent Fetch

Cleanup Obsolete

DELETE jobs where entity
watermark is current

→

Enqueue Jobs

INSERT for entities where
updated_at > watermark

→

Claim Jobs

Atomic UPDATE...RETURNING
with lock acquisition

→

Fetch Events

3 concurrent: state +
label + milestone

✓ On Success

Store Events

Transaction: upsert all
3 event types

→

Complete + Watermark

DELETE job row
Advance watermark

✗ Permanent Error (404 / 403)

Skip Permanently

complete_job + advance
watermark (coalesced)

↻ Transient Error

Backoff Retry

fail_job: 30s x 2^(n-1)
capped at 480s

API Call

Transform

Database

Diff from Issues

Error Path

Queue

Fetch MRs Page-Based Incremental Sync

GitLab API Call

fetch_merge_requests_page()
with cursor rewind

Page-based, not streaming

→

Cursor Filter

Same logic as issues:
timestamp + tie-breaker

Same as issues

→

transform_merge_request()

Maps API shape →
local DB row

→

Transaction

store → upsert → dirty →
labels + assignees + reviewers

3 junction tables (not 2)

↓

Update Cursor

Per page (not every 100)

Per page boundary

MR Discussion Sync Parallel Prefetch + Serial Write

Key Differences from Issue Discussions

Parallel prefetch — fetches all discussions for a batch concurrently via join_all()
Upsert pattern — notes use INSERT...ON CONFLICT (not delete-all + re-insert)
Sweep stale — uses last_seen_at timestamp comparison (not set difference)
Sync health tracking — records discussions_sync_attempts and last_error

Query Stale MRs

updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)

Same watermark logic

→

Batch by Concurrency

dependent_concurrency
MRs per batch

Batched processing

↓

Parallel Prefetch

join_all() fetches all
discussions for batch

Parallel (not sequential)

→

Transform In-Memory

transform_mr_discussion()
+ diff position notes

→

Serial Write

upsert discussion
upsert notes (ON CONFLICT)

Upsert, not delete+insert

✓ On Full Success

Sweep Stale

DELETE WHERE last_seen_at
< run_seen_at (disc + notes)

last_seen_at sweep

→

Advance Watermark

discussions_synced_for_
updated_at = updated_at

✗ On Failure

Record Sync Health

Watermark NOT advanced
Tracks attempts + last_error

Health tracking

Resource Events Same as Issues

Identical to Issue Resource Events

Same queue-based approach: cleanup → enqueue → claim → fetch → store/fail
Same watermark column: resource_events_synced_for_updated_at
Same error handling: 404/403 coalesced to empty, transient errors get backoff
entity_type = "merge_request" instead of "issue"

Trigger

Extract

Database

Decision

Error

Dirty Source Queue Populated During Ingestion

mark_dirty_tx()

Called during every issue/
MR/discussion upsert

→

dirty_sources Table

INSERT (source_type, source_id)
ON CONFLICT reset backoff

Drain Loop Batch 500, Respects Backoff

Get Dirty Sources

Batch 500, ORDER BY
attempt_count, queued_at

→

Dispatch by Type

issue / mr / discussion
→ extract function

→

Source Exists?

If deleted: remove doc row
(cascade cleans FTS + embeds)

↓

Extract Content

Structured text:
header + metadata + body

→

Triple-Hash Check

content_hash + labels_hash
+ paths_hash all match?

→

SAVEPOINT Write

Atomic: document row +
labels + paths

✓ On Success

clear_dirty()

Remove from dirty_sources

✗ On Error

record_dirty_error()

Increment attempt_count
Exponential backoff

≡ Triple-Hash Match (skip)

Skip Write

All 3 hashes match →
no WAL churn, clear dirty

Full Mode (--full)

Seeds ALL entities into dirty_sources via keyset pagination
Triple-hash optimization prevents redundant writes even in full mode
Runs FTS OPTIMIZE after drain completes

API (Ollama)

Processing

Database

Decision

Error

Change Detection Hash + Config Drift

find_pending_documents()

No metadata row? OR
document_hash mismatch? OR
config drift?

→

Keyset Pagination

500 documents per page
ordered by doc ID

Chunking Split + Overflow Guard

split_into_chunks()

Split by paragraph boundaries
with configurable overlap

→

Overflow Guard

Too many chunks?
Skip to prevent rowid collision

→

Build ChunkWork

Assign encoded chunk IDs
per document

Ollama Embedding Batched API Calls

Batch Embed

32 chunks per Ollama
API call

→

Store Vectors

sqlite-vec embeddings table
+ embedding_metadata

✓ On Success

SAVEPOINT Commit

Atomic per page:
clear old + write new

↻ Context-Length Error

Retry Individually

Re-embed each chunk solo
to isolate oversized one

✗ Other Error

Record Error

Store in embedding_metadata
for retry next run

Full Mode (--full)

DELETEs all embedding_metadata and embeddings rows first
Every document re-processed from scratch

Non-Fatal in Sync

Stage 4 failures (Ollama down, model missing) are graceful
Sync completes successfully; embeddings just won't be updated
Semantic search degrades to FTS-only mode

▲ Watermark & Cursor Reference

Table	Column(s)	Purpose
sync_cursors	updated_at_cursor + tie_breaker_id	Incremental fetch: "last entity we saw" per project+type
issues	discussions_synced_for_updated_at	Per-issue discussion watermark
issues	resource_events_synced_for_updated_at	Per-issue resource event watermark
merge_requests	discussions_synced_for_updated_at	Per-MR discussion watermark
merge_requests	resource_events_synced_for_updated_at	Per-MR resource event watermark
dirty_sources	queued_at + next_attempt_at	Document regeneration queue with backoff
embedding_metadata	document_hash + chunk_max_bytes + model + dims	Embedding staleness detection
pending_dependent_fetches	locked_at + next_retry_at + attempts	Resource event job queue with backoff

Full Sync Overview

Node Details