diff --git a/api-review.html b/api-review.html deleted file mode 100644 index 9ac6b9d..0000000 --- a/api-review.html +++ /dev/null @@ -1,1654 +0,0 @@ - - - - - -Gitlore ↔ GitLab API Review - - - - -
-

Gitlore CLI ↔ GitLab API Review

-

Interactive mapping of the gitlore local sync engine to the GitLab REST API v4

-
- -
-
Overview
-
CLI Architecture
-
API Endpoints Used
-
Full GitLab API
-
Mapping
-
Data Model
-
Ingestion Pipeline
-
Field Coverage
-
Efficiency Analysis
-
- -
- - -
-

What is Gitlore?

-
-

Gitlore (lore) is a read-only local sync engine for GitLab data. It fetches issues, merge requests, and discussions from the GitLab REST API v4 and stores them in a local SQLite database for fast offline querying.

-

~11,148 lines of Rust. Noun-first CLI. Robot mode for automation. Cursor-based incremental sync.

-
- -

API Surface Coverage

-

Gitlore uses a small, focused subset of the GitLab API. It is read-only — it never creates, updates, or deletes anything on GitLab.

- -
-
-
Endpoints Used: 7
-
Issues (list)GET
-
Merge Requests (list)GET
-
Issue Discussions (list)GET
-
MR Discussions (list)GET
-
Current UserGET
-
Project DetailsGET
-
GitLab VersionGET
-
-
-
Coverage by Resource
-
Issues API2 of ~20 endpoints
-
-
MR API2 of ~30 endpoints
-
-
Discussions API2 of 30 endpoints
-
-
Users API1 of ~15 endpoints
-
-
Projects API1 of ~50 endpoints
-
-
-
- -

Data Flow

-
-
GitLab API v4
-
-
Rate Limiter
10 req/s + jitter
-
-
Async Streams
Paginated fetch
-
-
Transformers
Normalize data
-
-
SQLite (WAL)
Local DB
-
- -

Key Design Decisions

-
-
-
Read-only sync
-

Only GET requests. Never mutates GitLab state. Safe to run repeatedly.

-
-
-
Cursor-based incremental
-

Uses updated_after parameter to only fetch changed data. 2-second rewind overlap for safety.

-
-
-
Raw payload archival
-

Stores original JSON responses with SHA-256 dedup and optional gzip compression.

-
-
-
Discussion full-refresh
-

Discussions use DELETE+INSERT strategy per parent (no incremental). Parallel prefetch, serial write.

-
-
-
- - -
-

CLI Architecture

- -

Command Structure (Noun-First)

-
-
lore <noun> [verb/arg]    # Primary pattern
-lore issues              # List all issues
-lore issues 42           # Show issue #42
-lore mrs                 # List all merge requests
-lore mrs 17              # Show MR #17
-lore ingest issues       # Fetch issues from GitLab
-lore ingest mrs          # Fetch MRs from GitLab
-lore count issues        # Count local issues
-lore count discussions   # Count local discussions
-lore status              # Show sync state
-lore auth                # Verify GitLab auth
-lore doctor              # Health check
-lore init                # Initialize config + DB
-lore migrate             # Run DB migrations
-lore version             # Show version
-
- -

Global Flags

- - - - - -
FlagDescription
-c, --configPath to config file
--robotMachine-readable JSON output
-J, --jsonJSON shorthand (same as --robot)
- -

Robot Mode Detection

-
-

Three ways to activate:

-
lore --robot list issues          # Explicit flag
-lore list issues | jq .           # Auto: stdout not a TTY
-LORE_ROBOT=1 lore list issues     # Environment variable
-

Robot mode returns JSON: {"ok":true,"data":{...},"meta":{...}}

-

Errors go to stderr: {"error":{"code":"...","message":"...","suggestion":"..."}}

-
- -

Exit Codes

- - - - - - - - - - - - - - - - -
CodeMeaning
0Success
1Internal error
2Config not found
3Config invalid
4Token not set
5GitLab auth failed
6Resource not found
7Rate limited
8Network error
9Database locked
10Database error
11Migration failed
12I/O error
13Transform error
- -

Configuration

-
{
-  "gitlab": {
-    "baseUrl": "https://gitlab.com",
-    "tokenEnvVar": "GITLAB_TOKEN"
-  },
-  "projects": [
-    { "path": "group/project" }
-  ],
-  "sync": {
-    "backfillDays": 14,
-    "staleLockMinutes": 10,
-    "heartbeatIntervalSeconds": 30,
-    "cursorRewindSeconds": 2,
-    "primaryConcurrency": 4,
-    "dependentConcurrency": 2
-  },
-  "storage": {
-    "dbPath": "~/.local/share/lore/lore.db",
-    "compressRawPayloads": true
-  }
-}
- -

Deprecated Commands (Hidden)

- - - - - - -
OldNewNotes
lore listlore issues / lore mrsShows deprecation warning
lore showlore issues <iid>Shows deprecation warning
lore auth-testlore authAlias
lore sync-statuslore statusAlias
-
- - -
-

GitLab API Endpoints Used by Gitlore

-

All requests use PRIVATE-TOKEN header authentication. Rate limited at 10 req/s with 0-50ms jitter.

- - - - - - - - - - - - - - -
- - -
-

Full GitLab REST API v4 Reference

-

Complete endpoint inventory for the resources relevant to Gitlore. USED = consumed by Gitlore.

- - - -
- -

Issues API

- - - - - - - - - - - - - - - - - - - - - -
MethodEndpointDescriptionStatus
GET/issuesList all issues (global)--
GET/groups/:id/issuesList group issues--
GET/projects/:id/issuesList project issuesUSED
GET/projects/:id/issues/:iidGet single issue--
POST/projects/:id/issuesCreate issue--
PUT/projects/:id/issues/:iidUpdate issue--
DEL/projects/:id/issues/:iidDelete issue--
PUT/projects/:id/issues/:iid/reorderReorder issue--
POST/projects/:id/issues/:iid/moveMove issue--
POST/projects/:id/issues/:iid/cloneClone issue--
POST/projects/:id/issues/:iid/subscribeSubscribe to issue--
POST/projects/:id/issues/:iid/unsubscribeUnsubscribe--
POST/projects/:id/issues/:iid/todoCreate to-do--
POST/projects/:id/issues/:iid/time_estimateSet time estimate--
POST/projects/:id/issues/:iid/add_spent_timeAdd spent time--
GET/projects/:id/issues/:iid/time_statsGet time stats--
GET/projects/:id/issues/:iid/related_merge_requestsRelated MRs--
GET/projects/:id/issues/:iid/closed_byMRs that close issue--
GET/projects/:id/issues/:iid/participantsList participants--
- - -

Merge Requests API

- - - - - - - - - - - - - - - - - - -
MethodEndpointDescriptionStatus
GET/merge_requestsList all MRs (global)--
GET/groups/:id/merge_requestsList group MRs--
GET/projects/:id/merge_requestsList project MRsUSED
GET/projects/:id/merge_requests/:iidGet single MR--
POST/projects/:id/merge_requestsCreate MR--
PUT/projects/:id/merge_requests/:iidUpdate MR--
DEL/projects/:id/merge_requests/:iidDelete MR--
PUT/projects/:id/merge_requests/:iid/mergeMerge an MR--
POST/projects/:id/merge_requests/:iid/cancel_mergeCancel merge--
PUT/projects/:id/merge_requests/:iid/rebaseRebase MR--
GET/projects/:id/merge_requests/:iid/commitsList MR commits--
GET/projects/:id/merge_requests/:iid/changesList MR diffs--
GET/projects/:id/merge_requests/:iid/pipelinesMR pipelines--
GET/projects/:id/merge_requests/:iid/participantsMR participants--
GET/projects/:id/merge_requests/:iid/approvalsMR approvals--
POST/projects/:id/merge_requests/:iid/approveApprove MR--
- - -

Discussions API

- - - - - - - - - - - - - - - - - - -
MethodEndpointDescriptionStatus
GET/projects/:id/issues/:iid/discussionsList issue discussionsUSED
GET/projects/:id/issues/:iid/discussions/:didGet single discussion--
POST/projects/:id/issues/:iid/discussionsCreate issue thread--
POST/projects/:id/issues/:iid/discussions/:did/notesAdd note to thread--
PUT/projects/:id/issues/:iid/discussions/:did/notes/:nidModify note--
DEL/projects/:id/issues/:iid/discussions/:did/notes/:nidDelete note--
GET/projects/:id/merge_requests/:iid/discussionsList MR discussionsUSED
GET/projects/:id/merge_requests/:iid/discussions/:didGet single MR discussion--
POST/projects/:id/merge_requests/:iid/discussionsCreate MR thread--
PUT/projects/:id/merge_requests/:iid/discussions/:didResolve/unresolve thread--
POST/projects/:id/merge_requests/:iid/discussions/:did/notesAdd note to MR thread--
PUT/projects/:id/merge_requests/:iid/discussions/:did/notes/:nidModify MR note--
DEL/projects/:id/merge_requests/:iid/discussions/:did/notes/:nidDelete MR note--
GET/projects/:id/snippets/:sid/discussionsList snippet discussions--
GET/groups/:id/epics/:eid/discussionsList epic discussions--
GET/projects/:id/repository/commits/:sha/discussionsList commit discussions--
- - -

Notes API (Flat, non-threaded)

- - - - - - - -
MethodEndpointDescriptionStatus
GET/projects/:id/issues/:iid/notesList issue notes--
POST/projects/:id/issues/:iid/notesCreate issue note--
GET/projects/:id/merge_requests/:iid/notesList MR notes--
POST/projects/:id/merge_requests/:iid/notesCreate MR note--
GET/projects/:id/snippets/:sid/notesList snippet notes--
-

Gitlore uses the Discussions API (threaded) instead of the flat Notes API. Notes are extracted from discussion responses.

- - -

Other APIs Used

- - - - - -
MethodEndpointDescriptionStatus
GET/userCurrent authenticated userUSED
GET/projects/:pathGet project by pathUSED
GET/versionGitLab instance versionUSED
-
-
- - -
-

CLI Command ↔ API Endpoint Mapping

-

How each CLI command maps to GitLab API calls and local database operations.

- -
-
lore ingest issues
-
-
Phase 1: Fetch primary resources
-
-
GET /projects/:id/issues (paginated, cursor-based)
-
-
-
Phase 2: Identify stale discussions
-
-
SQL: WHERE updated_at > discussions_synced_for_updated_at
-
-
-
Phase 3: Sync discussions
-
-
GET /projects/:id/issues/:iid/discussions (parallel prefetch)
-
-
-
Storage: Write to DB
-
-
Tables: issues, labels, issue_labels, discussions, notes, raw_payloads
-
-
- -
-
lore ingest mrs
-
-
Phase 1: Fetch primary resources
-
-
GET /projects/:id/merge_requests (paginated, cursor-based)
-
-
-
Phase 2: Identify stale discussions
-
-
SQL: WHERE updated_at > discussions_synced_for_updated_at
-
-
-
Phase 3: Sync discussions
-
-
GET /projects/:id/merge_requests/:iid/discussions (parallel prefetch)
-
-
-
Storage: Write to DB
-
-
Tables: merge_requests, labels, mr_labels, mr_assignees, mr_reviewers, discussions, notes, raw_payloads
-
-
- -
-
lore issues / lore mrs
-
-
List mode: Query local DB
-
-
SQL: SELECT ... FROM issues/merge_requests with filters (no API call)
-
-
-
Show mode: Query local DB by IID
-
-
SQL: SELECT ... WHERE iid = ? + join discussions/notes (no API call)
-
-
- -
-
lore auth
-
-
Verify token works
-
-
GET /api/v4/user
-
-
- -
-
lore doctor
-
-
Check auth + GitLab version
-
-
GET /api/v4/user + GET /api/v4/version
-
-
-
Check each configured project
-
-
GET /api/v4/projects/:path
-
-
- -
-
lore count / lore status / lore init / lore migrate
-
-
Local-only operations
-
-
No API calls. Database queries only.
-
-
- -

API Capabilities NOT Used by Gitlore

-
-
-
Write Operations
-
    -
  • Create/update/delete issues
  • -
  • Create/update/delete MRs
  • -
  • Merge MRs
  • -
  • Create/reply to discussions
  • -
  • Resolve/unresolve threads
  • -
  • Approve MRs
  • -
-
-
-
Read Operations
-
    -
  • Single issue/MR fetch (uses list with filters instead)
  • -
  • MR commits, diffs, pipelines
  • -
  • Issue/MR participants
  • -
  • Time tracking stats
  • -
  • Related MRs / closed-by
  • -
  • Labels API (extracted from issue/MR responses)
  • -
  • Milestones API (extracted from issue responses)
  • -
  • Flat Notes API (uses threaded Discussions API)
  • -
  • Snippets, Epics, Commits discussions
  • -
  • Webhooks, CI/CD, Pipelines, Deployments
  • -
-
-
-
- - -
-

Database Schema

-

SQLite with WAL mode. 12 tables across 6 migrations.

- -

Entity Relationship

-
-  projects ──────────────────────────────────────────────────────────
-    │                                                                │
-    ├──< issues ──< issue_labels >── labels                          │
-    │     │                                                          │
-    │     └──< discussions ──< notes                                 │
-    │                                                                │
-    ├──< merge_requests ──< mr_labels >── labels                    │
-    │     │    │    │                                                 │
-    │     │    │    └──< mr_reviewers                                 │
-    │     │    └──< mr_assignees                                     │
-    │     │                                                          │
-    │     └──< discussions ──< notes                                 │
-    │                                                                │
-    ├──< raw_payloads                                                │
-    ├──< sync_cursors                                                │
-    └── sync_runs, app_locks, schema_version                         │
-  ───────────────────────────────────────────────────────────────────
- -

Table Details

- - - - - - - - - - -
- - -
-

Ingestion Pipeline

- -

Three-Phase Architecture

-
-
-
- Phase 1: Primary Fetch
- Paginated API fetch with cursor-based sync.
Stores raw payloads + normalized rows.
-
-
- Phase 2: Identify Stale
- SQL query: which issues/MRs need
their discussions refreshed?
-
-
- Phase 3: Discussion Sync
- Parallel prefetch + serial write.
Full-refresh per parent entity.
-
-
-
- -

Cursor-Based Incremental Sync

-
-
Cursor State: (updated_at_cursor: i64, tie_breaker_id: i64)
-
-First sync:
-  updated_after = (now - backfillDays)
-
-Subsequent syncs:
-  updated_after = cursor.updated_at - cursorRewindSeconds
-
-  For each fetched resource:
-    if (gitlab_id, updated_at) <= cursor:
-      SKIP (already processed in overlap zone)
-    else:
-      UPSERT into database
-
-  After each page boundary:
-    UPDATE sync_cursors  (crash recovery safe)
-
- -

Discussion Sync Strategy

-
-
For each issue/MR where updated_at > discussions_synced_for_updated_at:
-
-  1. PREFETCH (parallel, configurable concurrency):
-     GET /projects/:id/issues/:iid/discussions  (all pages)
-
-  2. WRITE (serial, inside transaction):
-     DELETE FROM discussions WHERE issue_id = ?
-     DELETE FROM notes WHERE discussion_id IN (...)
-     INSERT discussions + notes (fresh data)
-     UPDATE issues SET discussions_synced_for_updated_at = updated_at
-

Full-refresh avoids complexity of detecting deleted/edited notes. Trade-off: more API calls for heavily-discussed items.

-
- -

Rate Limiting

-
-
RateLimiter {
-  min_interval: 100ms  (= 1s / 10 req/s)
-  jitter: 0-50ms random
-
-  acquire():
-    elapsed = now - last_request
-    if elapsed < min_interval:
-      sleep(min_interval - elapsed + random_jitter)
-    last_request = now
-}
-
- -

Pagination

-
-

Async stream-based. Fallback chain for next-page detection:

-
    -
  1. Link header (RFC 8288) — parse rel="next"
  2. -
  3. x-next-page header — direct page number
  4. -
  5. Full-page heuristic — if response has 100 items, assume more pages
  6. -
-
- -

Raw Payload Storage

-
-
-
API Response
JSON bytes
-
-
SHA-256 Hash
Dedup check
-
-
Gzip Compress
(if enabled)
-
-
raw_payloads
BLOB storage
-
-

UNIQUE constraint on (project_id, resource_type, gitlab_id, payload_hash) prevents storing identical payloads.

-
- -

Concurrency Model

-
-
-
Primary Resource Fetch
-

Single-threaded async stream. Rate-limited. Each page written in a transaction. Cursor updated at page boundaries.

-
-
-
Discussion Sync
-

Parallel prefetch (configurable, default 2 concurrent). Serial write phase to avoid DB contention. Each parent entity is one transaction.

-
-
- -

Single-Flight Lock

-
-
AppLock (database-enforced mutex):
-  name: 'sync' (PK)
-  owner: UUIDv4 (unique per process)
-  heartbeat_at: updated every 30s
-
-  Acquire:
-    INSERT OR fail if row exists
-    Check stale: if heartbeat > staleLockMinutes, force-acquire
-
-  Release:
-    DELETE WHERE owner = my_uuid
-
- -

Progress Events

- - - - - - - - - -
EventDescription
IssuesFetchStartedBeginning primary issue fetch
IssueFetchedEach issue processed (for progress bars)
IssuesFetchCompleteAll pages consumed
DiscussionSyncStartedBeginning discussion phase
DiscussionSyncedEach parent's discussions written
DiscussionSyncCompleteAll discussions updated
Same events exist for MRs (MrsFetchStarted, etc.)
-
- - -
-

Field-Level Coverage: API Response vs Gitlore Storage

-

Every field in every GitLab API response, mapped to what Gitlore does with it. Serde silently drops fields not in the Rust structs.

- -
-
Stored in DB
-
Used transiently (logic only)
-
Deserialized but ignored
-
Never deserialized (silently dropped)
-
- - - - - - - - - - - - - - - - - - - -

Field Coverage Summary

-
-
-
Issues Response
-
Stored15 fields
-
-
Deserialized, ignored5 fields
-
-
Never deserialized~22 fields
-
-
-
-
Merge Request Response
-
Stored23 fields
-
-
Transient (fallbacks)3 fields
-
-
Never deserialized~22 fields
-
-
-
-
Discussion/Note Response
-
Stored23 fields
-
-
Never deserialized~13 fields
-
-
-
-
Project Response
-
Stored6 fields
-
-
Deserialized, ignored4 fields
-
-
Never deserialized~30 fields
-
-
-
- -
-
Key insight: Raw payloads preserve everything
-

Although many fields are dropped during transformation, the raw_payloads table stores the complete original JSON response (with SHA-256 dedup and optional gzip). This means all "dropped" data is still recoverable from the blob storage without re-fetching from GitLab. The normalized tables are optimized for query patterns, not completeness.

-
-
- - -
-

Efficiency Analysis & Opportunities

-

Observations on how gitlore could leverage the GitLab API more efficiently, and data it currently leaves on the table.

- -

Current Efficiency Wins

-
-
-
Cursor-based incremental sync
-

Uses updated_after + order_by=updated_at&sort=asc to only fetch changed records. Avoids full re-fetch on every sync. This is the single biggest efficiency feature.

-
-
-
Raw payload dedup
-

SHA-256 hashing prevents storing identical payloads. If an issue's updated_at changes but the actual content is identical, the raw blob is deduplicated.

-
-
-
Discussion watermark
-

Only re-syncs discussions for issues/MRs whose updated_at has advanced past their discussions_synced_for_updated_at watermark. Skips unchanged entities.

-
-
-
Parallel discussion prefetch
-

Fetches discussions for multiple issues/MRs concurrently (configurable, default 2). Dramatically reduces wall-clock time for discussion sync.

-
-
- -

Potential Inefficiencies

- -
-
1. Discussion full-refresh strategy
-

Every time an issue/MR is updated, ALL its discussions are re-fetched and replaced (DELETE + INSERT). For heavily-discussed items (50+ comments), this is expensive.

- - - - -
ScenarioCurrentAlternative
Issue with 100 notes gets 1 new commentRe-fetch all 100 notes (multiple pages)Could use GET .../notes?order_by=updated_at&updated_after=... for incremental note sync
MR label change (no new comments)Re-fetch all discussions anywayCould check user_notes_count delta or use Notes API with updated_after
-

Trade-off: Full-refresh is simpler and guarantees consistency (catches edits, deletes). Incremental would miss deleted notes.

-
- -
-
2. Offset pagination instead of keyset
-

Gitlore uses page=N&per_page=100 offset pagination. GitLab supports keyset pagination for some endpoints (Issues, MRs), which is more efficient for large datasets and recommended by GitLab.

-
Current:  GET /projects/:id/issues?page=5&per_page=100
-Keyset:   GET /projects/:id/issues?pagination=keyset&per_page=100
-          (uses Link header rel="next" with cursor)
-

Benefit: Keyset pagination is O(1) per page (vs O(N) for offset). GitLab recommends it for >10,000 records. Gitlore already parses Link headers, so the client-side support partially exists.

-
- -
-
3. No ETag / conditional request support
-

GitLab returns ETag headers on API responses. Sending If-None-Match on subsequent requests would return 304 Not Modified without consuming rate limit quota on some endpoints. Currently all requests are unconditional.

-

Impact: Moderate. The cursor-based sync already avoids re-fetching unchanged data, so ETag would mainly help with the discussions full-refresh scenario where nothing changed.

-
- -
-
4. Labels extracted from embedded data, not dedicated API
-

Gitlore extracts labels from the labels[] string array embedded in issue/MR responses. The dedicated GET /projects/:id/labels endpoint returns richer data:

- - - -
From issues responseFrom Labels API
Label name (string only)name, color, description, text_color, priority, is_project_label, subscribed, open_issues_count, closed_issues_count, open_merge_requests_count
-

Impact: The labels table has color and description columns but they may not be populated from the embedded string array. A single Labels API call (one request, non-paginated for most projects) would enrich the local label catalog.

-
- -

Dropped Data Worth Capturing

-

Fields currently silently dropped that could add value to local queries:

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
FieldSourceValue PropositionEffort
user_notes_countIssues, MRsCould skip discussion re-sync when count hasn't changed. Quick "activity" sort without joining notes table.Low
upvotes / downvotesIssues, MRsEngagement metrics for triage. "Most upvoted issues" is a common query.Low
confidentialIssuesSecurity-sensitive filtering. Avoid exposing confidential issues in outputs.Low
weightIssuesEffort estimation for sprint planning (Premium/Ultimate only).Low
time_statsIssues, MRsTime tracking data for project reporting. Already in the response, free to capture.Low
has_conflictsMRsIdentify MRs needing rebase. Useful for "stale MR" alerts.Low
blocking_discussions_resolvedMRsMR readiness indicator without joining discussions table.Low
merge_commit_shaMRsTrace merged MRs to specific commits. Useful for git correlation.Low
suggestions[]Discussion notesCode review suggestions with from/to content. Rich data for code review analysis.Medium
task_completion_statusIssues, MRsTrack task-list checkbox progress without parsing description markdown.Low
issue_typeIssuesDistinguish issues vs incidents vs test cases.Low
discussion_lockedIssues, MRsKnow if new comments can be added.Low
- -

Structural Optimization Opportunities

- -
-
5. User denormalization
-

Currently stores only username for authors, assignees, reviewers. The API returns name, avatar_url, web_url, and state for every user reference. A users table could deduplicate this data and provide richer displays.

-
-- Potential schema
-CREATE TABLE users (
-  username TEXT PRIMARY KEY,
-  name TEXT,
-  gitlab_id INTEGER,
-  avatar_url TEXT,
-  state TEXT,           -- "active", "blocked", etc.
-  last_seen_at INTEGER  -- auto-updated on encounter
-);
-

Cost: No additional API calls. Data is already in every issue/MR/note response. Just needs extraction during transform.

-
- -
-
6. MR milestone not captured
-

Milestones are stored for issues but the MR transformer does not extract the milestone object from MR responses, even though GitLab returns it. The merge_requests table has no milestone_id column.

-

Impact: Cannot query "which MRs are in milestone X?" locally. The data is in the raw payload but not indexed.

-
- -
-
7. Issue references not captured
-

MRs store references.short and references.full, but the issue transformer drops the references object entirely. This means issues lack the cross-project reference format (e.g., group/project#42).

-
- -

API Strategies Not Yet Used

- -
-
Webhooks (push-based sync)
-

Instead of polling, GitLab can push events via POST /projects/:id/hooks. Would enable near-real-time sync without rate-limit cost. Requires a listener endpoint.

-
- -
-
Events API (lightweight change detection)
-

GET /projects/:id/events returns a stream of all project activity. Could be used as a fast "has anything changed?" check before running expensive issue/MR sync. Much lighter than fetching full issue lists.

-
- -
-
GraphQL API (precise field selection)
-

GitLab's GraphQL API allows requesting exactly the fields needed. Would eliminate bandwidth waste from ~50% of response fields being silently dropped. Trade-off: different pagination model, potentially less stable API surface.

-
- -

Summary Verdict

-
-

Gitlore is well-optimized for its core use case (read-only local sync). The cursor-based incremental sync and raw payload archival are sophisticated. The main opportunities are:

-
    -
  1. Capture more "free" data — Fields like user_notes_count, upvotes, has_conflicts are already in API responses. Storing them costs zero API calls and enables richer queries.
  2. -
  3. Discussion sync efficiency — The full-refresh strategy is the biggest source of redundant API calls. Even a simple user_notes_count comparison could skip unchanged discussions.
  4. -
  5. Keyset pagination — A meaningful improvement for large projects (>10K issues), and Gitlore already has partial infrastructure for it.
  6. -
  7. MR milestone parity — Low-effort gap to close with issue milestone support.
  8. -
-
-
- -
- - - - - diff --git a/gitlore-sync-explorer.html b/gitlore-sync-explorer.html deleted file mode 100644 index c62551a..0000000 --- a/gitlore-sync-explorer.html +++ /dev/null @@ -1,844 +0,0 @@ - - - - - -Gitlore Sync Pipeline Explorer - - - - - - -
-
-

Full Sync Overview

- 4 stages -
- -
- - -
-
-
-
Stage 1
-
Ingest Issues
-
Fetch issues + discussions + resource events from GitLab API
-
Cursor-based incremental sync.
Sequential discussion fetch.
Queue-based resource events.
-
-
-
-
Stage 2
-
Ingest MRs
-
Fetch merge requests + discussions + resource events
-
Page-based incremental sync.
Parallel prefetch discussions.
Queue-based resource events.
-
-
-
-
Stage 3
-
Generate Docs
-
Regenerate searchable documents for changed entities
-
Driven by dirty_sources table.
Triple-hash skip optimization.
FTS5 index auto-updated.
-
-
-
-
Stage 4
-
Embed
-
Generate vector embeddings via Ollama for semantic search
-
Hash-based change detection.
Chunked, batched API calls.
Non-fatal — graceful if Ollama down.
-
-
-
-
Concurrency Model
-
    -
  • Stages 1 & 2 process projects concurrently via buffer_unordered(primary_concurrency)
  • -
  • Each project gets its own SQLite connection; rate limiter is shared
  • -
  • Discussions: sequential (issues) or batched parallel prefetch (MRs)
  • -
  • Resource events use a persistent job queue with atomic claim + exponential backoff
  • -
-
-
-
Sync Flags
-
    -
  • --full — Resets all cursors & watermarks, forces complete re-fetch
  • -
  • --no-docs — Skips Stage 3 (document generation)
  • -
  • --no-embed — Skips Stage 4 (embedding generation)
  • -
  • --force — Overrides stale single-flight lock
  • -
  • --project <path> — Sync only one project (fuzzy matching)
  • -
-
-
-
Single-Flight Lock
-
    -
  • Table-based lock (AppLock) prevents concurrent syncs
  • -
  • Heartbeat keeps the lock alive; stale locks auto-detected
  • -
  • Use --force to override a stale lock
  • -
-
-
- - -
-
-
API Call
-
Transform
-
Database
-
Decision
-
Error Path
-
Queue
-
-
-
-
1
-
Fetch Issues Cursor-Based Incremental Sync
-
-
-
GitLab API Call
paginate_issues() with
updated_after = cursor - rewind
-
-
Cursor Filter
updated_at > cursor_ts
OR tie_breaker check
-
-
transform_issue()
GitLab API shape →
local DB row shape
-
-
Transaction
store_payload → upsert →
mark_dirty → relink
-
-
-
-
Update Cursor
Every 100 issues + final
sync_cursors table
-
-
-
-
-
2
-
Discussion Sync Sequential, Watermark-Based
-
-
-
Query Stale Issues
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
-
-
Paginate Discussions
Sequential per issue
paginate_issue_discussions()
-
-
Transform
transform_discussion()
transform_notes()
-
-
Write Discussion
store_payload → upsert
DELETE notes → INSERT notes
-
-
-
✓ On Success (all pages fetched)
-
-
Remove Stale
DELETE discussions not
seen in this fetch
-
-
Advance Watermark
discussions_synced_for_
updated_at = updated_at
-
-
✗ On Pagination Error
-
-
Skip Stale Removal
Watermark NOT advanced
Will retry next sync
-
-
-
-
-
-
3
-
Resource Events Queue-Based, Concurrent Fetch
-
-
-
Cleanup Obsolete
DELETE jobs where entity
watermark is current
-
-
Enqueue Jobs
INSERT for entities where
updated_at > watermark
-
-
Claim Jobs
Atomic UPDATE...RETURNING
with lock acquisition
-
-
Fetch Events
3 concurrent: state +
label + milestone
-
-
-
✓ On Success
-
-
Store Events
Transaction: upsert all
3 event types
-
-
Complete + Watermark
DELETE job row
Advance watermark
-
-
✗ Permanent Error (404 / 403)
-
-
Skip Permanently
complete_job + advance
watermark (coalesced)
-
-
↻ Transient Error
-
-
Backoff Retry
fail_job: 30s x 2^(n-1)
capped at 480s
-
-
-
-
- - -
-
-
API Call
-
Transform
-
Database
-
Diff from Issues
-
Error Path
-
Queue
-
-
-
-
1
-
Fetch MRs Page-Based Incremental Sync
-
-
-
GitLab API Call
fetch_merge_requests_page()
with cursor rewind
Page-based, not streaming
-
-
Cursor Filter
Same logic as issues:
timestamp + tie-breaker
Same as issues
-
-
transform_merge_request()
Maps API shape →
local DB row
-
-
Transaction
store → upsert → dirty →
labels + assignees + reviewers
3 junction tables (not 2)
-
-
-
-
Update Cursor
Per page (not every 100)
Per page boundary
-
-
-
-
-
2
-
MR Discussion Sync Parallel Prefetch + Serial Write
-
-
-
Key Differences from Issue Discussions
-
    -
  • Parallel prefetch — fetches all discussions for a batch concurrently via join_all()
  • -
  • Upsert pattern — notes use INSERT...ON CONFLICT (not delete-all + re-insert)
  • -
  • Sweep stale — uses last_seen_at timestamp comparison (not set difference)
  • -
  • Sync health tracking — records discussions_sync_attempts and last_error
  • -
-
-
-
Query Stale MRs
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
Same watermark logic
-
-
Batch by Concurrency
dependent_concurrency
MRs per batch
Batched processing
-
-
-
-
Parallel Prefetch
join_all() fetches all
discussions for batch
Parallel (not sequential)
-
-
Transform In-Memory
transform_mr_discussion()
+ diff position notes
-
-
Serial Write
upsert discussion
upsert notes (ON CONFLICT)
Upsert, not delete+insert
-
-
-
✓ On Full Success
-
-
Sweep Stale
DELETE WHERE last_seen_at
< run_seen_at (disc + notes)
last_seen_at sweep
-
-
Advance Watermark
discussions_synced_for_
updated_at = updated_at
-
-
✗ On Failure
-
-
Record Sync Health
Watermark NOT advanced
Tracks attempts + last_error
Health tracking
-
-
-
-
-
-
3
-
Resource Events Same as Issues
-
-
-
Identical to Issue Resource Events
-
    -
  • Same queue-based approach: cleanup → enqueue → claim → fetch → store/fail
  • -
  • Same watermark column: resource_events_synced_for_updated_at
  • -
  • Same error handling: 404/403 coalesced to empty, transient errors get backoff
  • -
  • entity_type = "merge_request" instead of "issue"
  • -
-
-
-
- - -
-
-
Trigger
-
Extract
-
Database
-
Decision
-
Error
-
-
-
-
1
-
Dirty Source Queue Populated During Ingestion
-
-
-
mark_dirty_tx()
Called during every issue/
MR/discussion upsert
-
-
dirty_sources Table
INSERT (source_type, source_id)
ON CONFLICT reset backoff
-
-
-
-
-
2
-
Drain Loop Batch 500, Respects Backoff
-
-
-
Get Dirty Sources
Batch 500, ORDER BY
attempt_count, queued_at
-
-
Dispatch by Type
issue / mr / discussion
→ extract function
-
-
Source Exists?
If deleted: remove doc row
(cascade cleans FTS + embeds)
-
-
-
-
Extract Content
Structured text:
header + metadata + body
-
-
Triple-Hash Check
content_hash + labels_hash
+ paths_hash all match?
-
-
SAVEPOINT Write
Atomic: document row +
labels + paths
-
-
-
✓ On Success
-
-
clear_dirty()
Remove from dirty_sources
-
-
✗ On Error
-
-
record_dirty_error()
Increment attempt_count
Exponential backoff
-
-
≡ Triple-Hash Match (skip)
-
-
Skip Write
All 3 hashes match →
no WAL churn, clear dirty
-
-
-
-
-
Full Mode (--full)
-
    -
  • Seeds ALL entities into dirty_sources via keyset pagination
  • -
  • Triple-hash optimization prevents redundant writes even in full mode
  • -
  • Runs FTS OPTIMIZE after drain completes
  • -
-
-
- - -
-
-
API (Ollama)
-
Processing
-
Database
-
Decision
-
Error
-
-
-
-
1
-
Change Detection Hash + Config Drift
-
-
-
find_pending_documents()
No metadata row? OR
document_hash mismatch? OR
config drift?
-
-
Keyset Pagination
500 documents per page
ordered by doc ID
-
-
-
-
-
2
-
Chunking Split + Overflow Guard
-
-
-
split_into_chunks()
Split by paragraph boundaries
with configurable overlap
-
-
Overflow Guard
Too many chunks?
Skip to prevent rowid collision
-
-
Build ChunkWork
Assign encoded chunk IDs
per document
-
-
-
-
-
3
-
Ollama Embedding Batched API Calls
-
-
-
Batch Embed
32 chunks per Ollama
API call
-
-
Store Vectors
sqlite-vec embeddings table
+ embedding_metadata
-
-
-
✓ On Success
-
-
SAVEPOINT Commit
Atomic per page:
clear old + write new
-
-
↻ Context-Length Error
-
-
Retry Individually
Re-embed each chunk solo
to isolate oversized one
-
-
✗ Other Error
-
-
Record Error
Store in embedding_metadata
for retry next run
-
-
-
-
-
Full Mode (--full)
-
    -
  • DELETEs all embedding_metadata and embeddings rows first
  • -
  • Every document re-processed from scratch
  • -
-
-
-
Non-Fatal in Sync
-
    -
  • Stage 4 failures (Ollama down, model missing) are graceful
  • -
  • Sync completes successfully; embeddings just won't be updated
  • -
  • Semantic search degrades to FTS-only mode
  • -
-
-
- -
- - -
-
- - Watermark & Cursor Reference -
-
- - - - - - - - - - - - -
TableColumn(s)Purpose
sync_cursorsupdated_at_cursor + tie_breaker_idIncremental fetch: "last entity we saw" per project+type
issuesdiscussions_synced_for_updated_atPer-issue discussion watermark
issuesresource_events_synced_for_updated_atPer-issue resource event watermark
merge_requestsdiscussions_synced_for_updated_atPer-MR discussion watermark
merge_requestsresource_events_synced_for_updated_atPer-MR resource event watermark
dirty_sourcesqueued_at + next_attempt_atDocument regeneration queue with backoff
embedding_metadatadocument_hash + chunk_max_bytes + model + dimsEmbedding staleness detection
pending_dependent_fetcheslocked_at + next_retry_at + attemptsResource event job queue with backoff
-
-
-
- - -
-
-

Node Details

- -
-
-
- - - - diff --git a/phase-a-review.html b/phase-a-review.html deleted file mode 100644 index 5baf68e..0000000 --- a/phase-a-review.html +++ /dev/null @@ -1,1260 +0,0 @@ - - - - - -Phase A: Complete API Field Capture — Review - - - - -
-

Phase A: Complete API Field Capture

-
Migration 007 — Mirror all GitLab API response data into the local DB
-
- No new API calls - No new tables - 1 migration - 6 files touched - Independent of CP3 -
-
- -
-
Overview
-
Issues
-
Merge Requests
-
Migration SQL
-
Serde Structs
-
Transformers
-
Files Touched
-
Decisions
-
- -
- - -
-
- Guiding Principle: Mirror everything GitLab gives us. The local DB should be a complete representation of all data returned by the API. This ensures maximum context for processing and analysis in later steps. -
- -
-
-
0
-
New columns total
-
-
-
0
-
New issue columns
-
-
-
0
-
New MR columns
-
-
-
6
-
New serde helper types
-
-
-
2
-
Fields excluded
-
-
-
6
-
Files touched
-
-
- -
-
-

Issues table

-
-
-
-
-
- 18 existing - 0 new -
-
-
-

Merge Requests table

-
-
-
-
-
- 29 existing - 0 new -
-
-
- -
-
Scope
-
-

One migration. Three categories of work:

-
    -
  1. New columns on issues and merge_requests for fields currently dropped by serde or during transform
  2. -
  3. New serde fields on GitLabIssue and GitLabMergeRequest to deserialize currently-silently-dropped JSON fields
  4. -
  5. Transformer + insert updates to pass the new fields through to the DB
  6. -
-
-
- -
-
What this does NOT include
-
-
    -
  • No new API endpoints called
  • -
  • No new tables (except reusing existing milestones for MRs)
  • -
  • No CLI changes (new fields stored but not surfaced in lore issues / lore mrs)
  • -
  • No changes to discussion/note ingestion (Phase A is issues + MRs only)
  • -
  • No observability instrumentation (that's Phase B)
  • -
-
-
-
- - -
-

Issues: Field Gap Inventory

- -
-
Currently Stored
-
- id, iid, project_id, title, description, state, author_username, created_at, updated_at, web_url, due_date, milestone_id, milestone_title, raw_payload_id, last_seen_at, discussions_synced_for_updated_at, labels (junction), assignees (junction) -
-
- -
- - - - -
- -
-
- - - - - - - - - - - - -
API FieldTypeDB ColumnCategoryStatusNotes
-
-
-
- - -
-

Merge Requests: Field Gap Inventory

- -
-
Currently Stored
-
- id, iid, project_id, title, description, state, draft, author_username, source_branch, target_branch, head_sha, references_short, references_full, detailed_merge_status, merge_user_username, created_at, updated_at, merged_at, closed_at, last_seen_at, web_url, raw_payload_id, discussions_synced_for_updated_at, labels (junction), assignees (junction), reviewers (junction) -
-
- -
- - - - -
- -
-
- - - - - - - - - - - - -
API FieldTypeDB ColumnCategoryStatusNotes
-
-
-
- - -
-

Migration 007: complete_field_capture.sql

- -
-
- Issues columns - -
-
-
-
-
- -
-
- Merge Requests columns - -
-
-
-
-
-
- - -
-

Serde Struct Changes

- -
-
New Helper Types
-
-
-
-
- -
-
- GitLabIssue: new fields - -
-
-
-
-
- -
-
- GitLabMergeRequest: new fields - -
-
-
-
-
-
- - -
-

Transformer Changes

- -
-
IssueRow: transform rules
-
-

- All new fields map 1:1 from the serde struct except these special cases: -

-
-
-
- -
-
NormalizedMergeRequest: transform rules
-
-

- Same patterns as issues, plus: -

-
-
-
- -
-
Insert statement changes
-
- Both process_issue_in_transaction and process_mr_in_transaction need their - INSERT and ON CONFLICT DO UPDATE - statements extended with all new columns. The ON CONFLICT clause should update all new fields on re-sync. -
-
-
- - -
-

Files Touched

-
-
-
-
- - -
-

Resolved Decisions

- -
-
Exclusions (2 fields)
-
-
-
subscribed — Excluded
-
User-relative field. Reflects the token holder's subscription state, not a property of the entity itself. Changes meaning if the token is rotated to a different user. Not entity data.
-
-
-
_links — Excluded
-
HATEOAS API navigation metadata, not entity data. Every URL is deterministically constructable from project_id + iid + GitLab base URL.
-
Note: closed_as_duplicate_of inside _links contains a real entity reference. Extracting that is deferred to a future phase.
-
-
-
- -
-
Included with special handling
-
-
-
epic / iteration — Flatten to columns
-
Same denormalization pattern as milestones. Epic gets 5 columns (epic_id, epic_iid, epic_title, epic_url, epic_group_id). Iteration gets 6 columns (iteration_id, iteration_iid, iteration_title, iteration_state, iteration_start_date, iteration_due_date). Both nullable (null on Free tier).
-
-
-
-
- -
- - - -