Files
gitlore/docs/phase-a-spec.md
Taylor Eernisse 9b63671df9 docs: Update documentation for search pipeline and Phase A spec
- README.md: Add hybrid search and robot mode to feature list. Update
  quick start to use new noun-first CLI syntax (lore issues, lore mrs,
  lore search). Add embedding configuration section. Update command
  examples throughout.

- AGENTS.md: Update robot mode examples to new CLI syntax. Add search,
  sync, stats, and generate-docs commands to the robot mode reference.
  Update flag conventions (-n for limit, -s for state, -J for JSON).

- docs/prd/checkpoint-3.md: Major expansion with gated milestone
  structure (Gate A: lexical, Gate B: hybrid, Gate C: sync). Add
  prerequisite rename note, code sample conventions, chunking strategy
  details, and sqlite-vec rowid encoding scheme. Clarify that Gate A
  requires only SQLite + FTS5 with no sqlite-vec dependency.

- docs/phase-a-spec.md: New detailed specification for Gate A (lexical
  search MVP) covering document schema, FTS5 configuration, dirty
  queue mechanics, CLI interface, and acceptance criteria.

- docs/api-efficiency-findings.md: Analysis of GitLab API pagination
  behavior and efficiency observations from production sync runs.
  Documents the missing x-next-page header issue and heuristic fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:47:33 -05:00

24 KiB

Phase A: Complete API Field Capture

Status: Draft Guiding principle: Mirror everything GitLab gives us.

  • Lossless mirror: the raw API JSON stored behind raw_payload_id. This is the true complete representation of every API response.
  • Relational projection: a stable, query-optimized subset of fields we commit to keeping current on every re-sync. This preserves maximum context for processing and analysis while avoiding unbounded schema growth. Migration: 007_complete_field_capture.sql Prerequisite: None (independent of CP3)

Scope

One migration. Three categories of work:

  1. New columns on issues and merge_requests for fields currently dropped by serde or dropped during transform
  2. New serde fields on GitLabIssue and GitLabMergeRequest to deserialize currently-silently-dropped JSON fields
  3. Transformer + insert updates to pass the new fields through to the DB

No new tables. No new API calls. No new endpoints. All data comes from responses we already receive.


Issues: Field Gap Inventory

Currently stored

id, iid, project_id, title, description, state, author_username, created_at, updated_at, web_url, due_date, milestone_id, milestone_title, raw_payload_id, last_seen_at, discussions_synced_for_updated_at, labels (junction), assignees (junction)

Currently deserialized but dropped during transform

API Field Status Action
closed_at Deserialized in serde struct, but no DB column exists and transformer never populates it Add column in migration 007, wire up in IssueRow + transform + INSERT
author.id Deserialized Store as author_id column
author.name Deserialized Store as author_name column

Currently silently dropped by serde (not in GitLabIssue struct)

API Field Type DB Column Notes
issue_type Option<String> issue_type Canonical field (lowercase, e.g. "issue"); preferred for DB storage
upvotes i64 upvotes
downvotes i64 downvotes
user_notes_count i64 user_notes_count Useful for discussion sync optimization
merge_requests_count i64 merge_requests_count Count of linked MRs
confidential bool confidential 0/1
discussion_locked bool discussion_locked 0/1
weight Option<i64> weight Premium/Ultimate, null on Free
time_stats.time_estimate i64 time_estimate Seconds
time_stats.total_time_spent i64 time_spent Seconds
time_stats.human_time_estimate Option<String> human_time_estimate e.g. "3h 30m"
time_stats.human_total_time_spent Option<String> human_time_spent e.g. "1h 15m"
task_completion_status.count i64 task_count Checkbox total
task_completion_status.completed_count i64 task_completed_count Checkboxes checked
has_tasks bool has_tasks 0/1
severity Option<String> severity Incident severity
closed_by Option<object> closed_by_username Who closed it (username only, consistent with author pattern)
imported bool imported 0/1
imported_from Option<String> imported_from Import source
moved_to_id Option<i64> moved_to_id Target issue if moved
references.short String references_short e.g. "#42"
references.relative String references_relative e.g. "#42" or "group/proj#42"
references.full String references_full e.g. "group/project#42"
health_status Option<String> health_status Ultimate only
type Option<String> (transform-only) Uppercase category (e.g. "ISSUE"); fallback for issue_type -- lowercased before storage. Not stored as separate column; raw JSON remains lossless.
epic.id Option<i64> epic_id Premium/Ultimate, null on Free
epic.iid Option<i64> epic_iid
epic.title Option<String> epic_title
epic.url Option<String> epic_url
epic.group_id Option<i64> epic_group_id
iteration.id Option<i64> iteration_id Premium/Ultimate, null on Free
iteration.iid Option<i64> iteration_iid
iteration.title Option<String> iteration_title
iteration.state Option<i64> iteration_state Enum: 1=upcoming, 2=current, 3=closed
iteration.start_date Option<String> iteration_start_date ISO date
iteration.due_date Option<String> iteration_due_date ISO date

Merge Requests: Field Gap Inventory

Currently stored

id, iid, project_id, title, description, state, draft, author_username, source_branch, target_branch, head_sha, references_short, references_full, detailed_merge_status, merge_user_username, created_at, updated_at, merged_at, closed_at, last_seen_at, web_url, raw_payload_id, discussions_synced_for_updated_at, discussions_sync_last_attempt_at, discussions_sync_attempts, discussions_sync_last_error, labels (junction), assignees (junction), reviewers (junction)

Currently deserialized but dropped during transform

API Field Status Action
author.id Deserialized Store as author_id column
author.name Deserialized Store as author_name column
work_in_progress Used transiently for draft fallback Already handled, no change needed
merge_status (legacy) Used transiently for detailed_merge_status fallback Already handled, no change needed
merged_by Used transiently for merge_user fallback Already handled, no change needed

Currently silently dropped by serde (not in GitLabMergeRequest struct)

API Field Type DB Column Notes
upvotes i64 upvotes
downvotes i64 downvotes
user_notes_count i64 user_notes_count
source_project_id i64 source_project_id Fork source
target_project_id i64 target_project_id Fork target
milestone Option<object> milestone_id, milestone_title Reuse issue milestone pattern
merge_when_pipeline_succeeds bool merge_when_pipeline_succeeds 0/1, auto-merge flag
merge_commit_sha Option<String> merge_commit_sha Commit ref after merge
squash_commit_sha Option<String> squash_commit_sha Commit ref after squash
discussion_locked bool discussion_locked 0/1
should_remove_source_branch Option<bool> should_remove_source_branch 0/1
force_remove_source_branch Option<bool> force_remove_source_branch 0/1
squash bool squash 0/1
squash_on_merge bool squash_on_merge 0/1
has_conflicts bool has_conflicts 0/1
blocking_discussions_resolved bool blocking_discussions_resolved 0/1
time_stats.time_estimate i64 time_estimate Seconds
time_stats.total_time_spent i64 time_spent Seconds
time_stats.human_time_estimate Option<String> human_time_estimate
time_stats.human_total_time_spent Option<String> human_time_spent
task_completion_status.count i64 task_count
task_completion_status.completed_count i64 task_completed_count
closed_by Option<object> closed_by_username
prepared_at Option<String> prepared_at ISO datetime in API; store as ms epoch via iso_to_ms(), nullable
merge_after Option<String> merge_after ISO datetime in API; store as ms epoch via iso_to_ms(), nullable (scheduled merge)
imported bool imported 0/1
imported_from Option<String> imported_from
approvals_before_merge Option<i64> approvals_before_merge Deprecated, scheduled for removal in GitLab API v5; store best-effort, keep nullable
references.relative String references_relative Currently only short + full stored
confidential bool confidential 0/1 (MRs can be confidential too)
iteration.id Option<i64> iteration_id Premium/Ultimate, null on Free
iteration.iid Option<i64> iteration_iid
iteration.title Option<String> iteration_title
iteration.state Option<i64> iteration_state
iteration.start_date Option<String> iteration_start_date ISO date
iteration.due_date Option<String> iteration_due_date ISO date

Migration 007: complete_field_capture.sql

-- Migration 007: Capture all remaining GitLab API response fields.
-- Principle: mirror everything GitLab returns. No field left behind.

-- ============================================================
-- ISSUES: new columns
-- ============================================================

-- Fields currently deserialized but not stored
ALTER TABLE issues ADD COLUMN closed_at INTEGER;             -- ms epoch, deserialized but never stored until now
ALTER TABLE issues ADD COLUMN author_id INTEGER;             -- GitLab user ID
ALTER TABLE issues ADD COLUMN author_name TEXT;              -- Display name

-- Issue metadata
ALTER TABLE issues ADD COLUMN issue_type TEXT;               -- 'issue' | 'incident' | 'test_case'
ALTER TABLE issues ADD COLUMN confidential INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN discussion_locked INTEGER NOT NULL DEFAULT 0;

-- Engagement
ALTER TABLE issues ADD COLUMN upvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN downvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN user_notes_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN merge_requests_count INTEGER NOT NULL DEFAULT 0;

-- Time tracking
ALTER TABLE issues ADD COLUMN time_estimate INTEGER NOT NULL DEFAULT 0;       -- seconds
ALTER TABLE issues ADD COLUMN time_spent INTEGER NOT NULL DEFAULT 0;          -- seconds
ALTER TABLE issues ADD COLUMN human_time_estimate TEXT;
ALTER TABLE issues ADD COLUMN human_time_spent TEXT;

-- Task lists
ALTER TABLE issues ADD COLUMN task_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN task_completed_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN has_tasks INTEGER NOT NULL DEFAULT 0;

-- References (MRs already have short + full)
ALTER TABLE issues ADD COLUMN references_short TEXT;         -- e.g. "#42"
ALTER TABLE issues ADD COLUMN references_relative TEXT;      -- context-dependent
ALTER TABLE issues ADD COLUMN references_full TEXT;          -- e.g. "group/project#42"

-- Close/move tracking
ALTER TABLE issues ADD COLUMN closed_by_username TEXT;

-- Premium/Ultimate fields (nullable, null on Free tier)
ALTER TABLE issues ADD COLUMN weight INTEGER;
ALTER TABLE issues ADD COLUMN severity TEXT;
ALTER TABLE issues ADD COLUMN health_status TEXT;

-- Import tracking
ALTER TABLE issues ADD COLUMN imported INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN imported_from TEXT;
ALTER TABLE issues ADD COLUMN moved_to_id INTEGER;

-- Epic (Premium/Ultimate, null on Free)
ALTER TABLE issues ADD COLUMN epic_id INTEGER;
ALTER TABLE issues ADD COLUMN epic_iid INTEGER;
ALTER TABLE issues ADD COLUMN epic_title TEXT;
ALTER TABLE issues ADD COLUMN epic_url TEXT;
ALTER TABLE issues ADD COLUMN epic_group_id INTEGER;

-- Iteration (Premium/Ultimate, null on Free)
ALTER TABLE issues ADD COLUMN iteration_id INTEGER;
ALTER TABLE issues ADD COLUMN iteration_iid INTEGER;
ALTER TABLE issues ADD COLUMN iteration_title TEXT;
ALTER TABLE issues ADD COLUMN iteration_state INTEGER;
ALTER TABLE issues ADD COLUMN iteration_start_date TEXT;
ALTER TABLE issues ADD COLUMN iteration_due_date TEXT;

-- ============================================================
-- MERGE REQUESTS: new columns
-- ============================================================

-- Author enrichment
ALTER TABLE merge_requests ADD COLUMN author_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN author_name TEXT;

-- Engagement
ALTER TABLE merge_requests ADD COLUMN upvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN downvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN user_notes_count INTEGER NOT NULL DEFAULT 0;

-- Fork tracking
ALTER TABLE merge_requests ADD COLUMN source_project_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN target_project_id INTEGER;

-- Milestone (parity with issues)
ALTER TABLE merge_requests ADD COLUMN milestone_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN milestone_title TEXT;

-- Merge behavior
ALTER TABLE merge_requests ADD COLUMN merge_when_pipeline_succeeds INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN squash_on_merge INTEGER NOT NULL DEFAULT 0;

-- Merge readiness
ALTER TABLE merge_requests ADD COLUMN has_conflicts INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN blocking_discussions_resolved INTEGER NOT NULL DEFAULT 0;

-- Branch cleanup
ALTER TABLE merge_requests ADD COLUMN should_remove_source_branch INTEGER;
ALTER TABLE merge_requests ADD COLUMN force_remove_source_branch INTEGER;

-- Discussion lock
ALTER TABLE merge_requests ADD COLUMN discussion_locked INTEGER NOT NULL DEFAULT 0;

-- Time tracking
ALTER TABLE merge_requests ADD COLUMN time_estimate INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN time_spent INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN human_time_estimate TEXT;
ALTER TABLE merge_requests ADD COLUMN human_time_spent TEXT;

-- Task lists
ALTER TABLE merge_requests ADD COLUMN task_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN task_completed_count INTEGER NOT NULL DEFAULT 0;

-- Close tracking
ALTER TABLE merge_requests ADD COLUMN closed_by_username TEXT;

-- Scheduling (API returns ISO datetimes; we store ms epoch for consistency)
ALTER TABLE merge_requests ADD COLUMN prepared_at INTEGER;       -- ms epoch after iso_to_ms()
ALTER TABLE merge_requests ADD COLUMN merge_after INTEGER;       -- ms epoch after iso_to_ms()

-- References (add relative, short + full already exist)
ALTER TABLE merge_requests ADD COLUMN references_relative TEXT;

-- Import tracking
ALTER TABLE merge_requests ADD COLUMN imported INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN imported_from TEXT;

-- Premium/Ultimate
ALTER TABLE merge_requests ADD COLUMN approvals_before_merge INTEGER;
ALTER TABLE merge_requests ADD COLUMN confidential INTEGER NOT NULL DEFAULT 0;

-- Iteration (Premium/Ultimate, null on Free)
ALTER TABLE merge_requests ADD COLUMN iteration_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN iteration_iid INTEGER;
ALTER TABLE merge_requests ADD COLUMN iteration_title TEXT;
ALTER TABLE merge_requests ADD COLUMN iteration_state INTEGER;
ALTER TABLE merge_requests ADD COLUMN iteration_start_date TEXT;
ALTER TABLE merge_requests ADD COLUMN iteration_due_date TEXT;

-- Record migration version
INSERT INTO schema_version (version, applied_at, description)
VALUES (7, strftime('%s', 'now') * 1000, 'Complete API field capture for issues and merge requests');

Serde Struct Changes

Existing type changes

GitLabReferences                              // Add: relative: Option<String> (with #[serde(default)])
                                              // Existing fields short + full remain unchanged
GitLabIssue                                   // Add #[derive(Default)] for test ergonomics
GitLabMergeRequest                            // Add #[derive(Default)] for test ergonomics

New helper types needed

GitLabTimeStats { time_estimate, total_time_spent, human_time_estimate, human_total_time_spent }
GitLabTaskCompletionStatus { count, completed_count }
GitLabClosedBy (reuse GitLabAuthor shape: id, username, name)
GitLabEpic { id, iid, title, url, group_id }
GitLabIteration { id, iid, title, state, start_date, due_date }

GitLabIssue: add fields

type: Option<String>                          // #[serde(rename = "type")]  -- fallback-only (uppercase category); "type" is reserved in Rust
upvotes: i64                                  // #[serde(default)]
downvotes: i64                                // #[serde(default)]
user_notes_count: i64                         // #[serde(default)]
merge_requests_count: i64                     // #[serde(default)]
confidential: bool                            // #[serde(default)]
discussion_locked: bool                       // #[serde(default)]
weight: Option<i64>
time_stats: Option<GitLabTimeStats>
task_completion_status: Option<GitLabTaskCompletionStatus>
has_tasks: bool                               // #[serde(default)]
references: Option<GitLabReferences>
closed_by: Option<GitLabAuthor>
severity: Option<String>
health_status: Option<String>
imported: bool                                // #[serde(default)]
imported_from: Option<String>
moved_to_id: Option<i64>
issue_type: Option<String>                    // canonical field (lowercase); preferred for DB storage over `type`
epic: Option<GitLabEpic>
iteration: Option<GitLabIteration>

GitLabMergeRequest: add fields

upvotes: i64                                  // #[serde(default)]
downvotes: i64                                // #[serde(default)]
user_notes_count: i64                         // #[serde(default)]
source_project_id: Option<i64>
target_project_id: Option<i64>
milestone: Option<GitLabMilestone>            // reuse existing type
merge_when_pipeline_succeeds: bool            // #[serde(default)]
merge_commit_sha: Option<String>
squash_commit_sha: Option<String>
squash: bool                                  // #[serde(default)]
squash_on_merge: bool                         // #[serde(default)]
has_conflicts: bool                           // #[serde(default)]
blocking_discussions_resolved: bool           // #[serde(default)]
should_remove_source_branch: Option<bool>
force_remove_source_branch: Option<bool>
discussion_locked: bool                       // #[serde(default)]
time_stats: Option<GitLabTimeStats>
task_completion_status: Option<GitLabTaskCompletionStatus>
closed_by: Option<GitLabAuthor>
prepared_at: Option<String>
merge_after: Option<String>
imported: bool                                // #[serde(default)]
imported_from: Option<String>
approvals_before_merge: Option<i64>
confidential: bool                            // #[serde(default)]
iteration: Option<GitLabIteration>

Transformer Changes

IssueRow: add fields

All new fields map 1:1 from the serde struct except:

  • closed_at -> iso_to_ms() conversion (already in serde struct, just not passed through)
  • time_stats -> flatten to 4 individual fields
  • task_completion_status -> flatten to 2 individual fields
  • references -> flatten to 3 individual fields
  • closed_by -> extract username only (consistent with author pattern)
  • author -> additionally extract id and name (currently only username)
  • issue_type -> store as-is (canonical, lowercase); fallback to lowercased type field if issue_type absent
  • epic -> flatten to 5 individual fields (id, iid, title, url, group_id)
  • iteration -> flatten to 6 individual fields (id, iid, title, state, start_date, due_date)

NormalizedMergeRequest: add fields

Same patterns as issues, plus:

  • milestone -> reuse upsert_milestone_tx from issue pipeline, add milestone_id + milestone_title
  • prepared_at, merge_after -> iso_to_ms() conversion (API provides ISO datetimes)
  • source_project_id, target_project_id -> direct pass-through
  • iteration -> flatten to 6 individual fields (same as issues)

Insert statement changes

Both process_issue_in_transaction and process_mr_in_transaction need their INSERT and ON CONFLICT DO UPDATE statements extended with all new columns. The ON CONFLICT clause should update all new fields on re-sync.

Implementation note (reliability): Define a single authoritative list of persisted columns per entity and generate/compose both SQL fragments from it:

  • INSERT column list + VALUES placeholders
  • ON CONFLICT DO UPDATE assignments

This prevents drift where a new field is added to one clause but not the other -- the most likely bug class with 40+ new columns.


Prerequisite refactors (prep commits before main Phase A work)

1. Align issue transformer on core::time

The issue transformer (transformers/issue.rs) has a local parse_timestamp() that duplicates iso_to_ms_strict() from core::time. The MR transformer already uses the shared module. Before adding Phase A's optional timestamp fields (especially closed_at as Option<String>), migrate the issue transformer to use iso_to_ms_strict() and iso_to_ms_opt_strict() from core::time. This avoids duplicating the opt variant locally and establishes one timestamp parsing path across the codebase.

Changes: Replace parse_timestamp() calls with iso_to_ms_strict(), adapt or remove TransformError::TimestampParse (MR transformer uses String errors; align on that or on a shared error type).

2. Extract shared ingestion helpers

upsert_milestone_tx (in ingestion/issues.rs) and upsert_label_tx (duplicated in both ingestion/issues.rs and ingestion/merge_requests.rs) should be moved to a shared module (e.g., src/ingestion/shared.rs). MR ingestion needs upsert_milestone_tx for Phase A milestone support, and the label helper is already copy-pasted between files.

Changes: Create src/ingestion/shared.rs, move upsert_milestone_tx, upsert_label_tx, and MilestoneRow there. Update imports in both issue and MR ingestion modules.


Files touched

File Change
migrations/007_complete_field_capture.sql New file
src/gitlab/types.rs Add #[derive(Default)] to GitLabIssue and GitLabMergeRequest; add relative: Option<String> to GitLabReferences; add fields to both structs; add GitLabTimeStats, GitLabTaskCompletionStatus, GitLabEpic, GitLabIteration
src/gitlab/transformers/issue.rs Remove local parse_timestamp(), switch to core::time; extend IssueRow, IssueWithMetadata, transform_issue()
src/gitlab/transformers/merge_request.rs Extend NormalizedMergeRequest, MergeRequestWithMetadata, transform_merge_request(); extract references_relative
src/ingestion/shared.rs New file: shared upsert_milestone_tx, upsert_label_tx, MilestoneRow
src/ingestion/issues.rs Extend INSERT/UPSERT SQL; import from shared module
src/ingestion/merge_requests.rs Extend INSERT/UPSERT SQL; import from shared module; add milestone upsert
src/core/db.rs Register migration 007 in MIGRATIONS array

What this does NOT include

  • No new API endpoints called
  • No new tables (except reusing existing milestones for MRs)
  • No CLI changes (new fields are stored but not yet surfaced in lore issues / lore mrs output)
  • No changes to discussion/note ingestion (Phase A is issues + MRs only)
  • No observability instrumentation (that's Phase B)

Rollout / Backfill Note

After applying Migration 007 and shipping transformer + UPSERT updates, existing rows will not have the new columns populated until issues/MRs are reprocessed. Plan on a one-time full re-sync (lore ingest --type issues --full and lore ingest --type mrs --full) to backfill the new fields. Until then, queries on new columns will return NULL/default values for previously-synced entities.


Resolved decisions

Field Decision Rationale
subscribed Excluded User-relative field (reflects token holder's subscription state, not an entity property). Changes meaning if the token is rotated to a different user. Not entity data.
_links Excluded HATEOAS API navigation metadata, not entity data. Every URL is deterministically constructable from project_id + iid + GitLab base URL. Note: closed_as_duplicate_of inside _links contains a real entity reference -- extracting that is deferred to a future phase.
epic / iteration Flatten to columns Same denormalization pattern as milestones. Epic gets 5 columns (epic_id, epic_iid, epic_title, epic_url, epic_group_id). Iteration gets 6 columns (iteration_id, iteration_iid, iteration_title, iteration_state, iteration_start_date, iteration_due_date). Both nullable (null on Free tier).
approvals_before_merge Store best-effort Deprecated and scheduled for removal in GitLab API v5. Keep as Option<i64> / nullable column. Never depend on it for correctness -- it may disappear in a future GitLab release.