Files

Taylor Eernisse 9b63671df9 docs: Update documentation for search pipeline and Phase A spec

- README.md: Add hybrid search and robot mode to feature list. Update
  quick start to use new noun-first CLI syntax (lore issues, lore mrs,
  lore search). Add embedding configuration section. Update command
  examples throughout.

- AGENTS.md: Update robot mode examples to new CLI syntax. Add search,
  sync, stats, and generate-docs commands to the robot mode reference.
  Update flag conventions (-n for limit, -s for state, -J for JSON).

- docs/prd/checkpoint-3.md: Major expansion with gated milestone
  structure (Gate A: lexical, Gate B: hybrid, Gate C: sync). Add
  prerequisite rename note, code sample conventions, chunking strategy
  details, and sqlite-vec rowid encoding scheme. Clarify that Gate A
  requires only SQLite + FTS5 with no sqlite-vec dependency.

- docs/phase-a-spec.md: New detailed specification for Gate A (lexical
  search MVP) covering document schema, FTS5 configuration, dirty
  queue mechanics, CLI interface, and acceptance criteria.

- docs/api-efficiency-findings.md: Analysis of GitLab API pagination
  behavior and efficiency observations from production sync runs.
  Documents the missing x-next-page header issue and heuristic fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-30 15:47:33 -05:00

17 KiB

Raw Blame History

API Efficiency & Observability Findings

Status: Draft - working through items Context: Audit of gitlore's GitLab API usage, data processing, and observability gaps Interactive reference: api-review.html (root of repo, open in browser)

Checkpoint 3 Alignment

Checkpoint 3 (docs/prd/checkpoint-3.md) introduces lore sync orchestration, document generation, and search. Several findings here overlap with that work. This section maps the relationship so effort isn't duplicated and so CP3 implementation can absorb the right instrumentation as it's built.

Direct overlaps (CP3 partially addresses)

Finding	CP3 coverage	Remaining gap
P0-1 sync_runs never written	`lore sync` step 7 says "record sync_run". `SyncResult` struct defined with counts.	Only covers the new `lore sync` command. Existing `lore ingest` still won't write sync_runs. Either instrument `lore ingest` separately or have `lore sync` subsume it entirely.
P0-2 No timing	`print_sync` captures wall-clock `elapsed_secs` / `elapsed_ms` in robot mode JSON `meta` envelope.	Wall-clock only. No per-phase, per-API-call, or per-DB-write breakdown. The `SyncResult` struct has counts but no duration fields.
P2-1 Discussion full-refresh	CP3 introduces `pending_discussion_fetches` queue with exponential backoff and bounded processing per sync. Structures the work better.	Same full-refresh strategy per entity. The queue adds retry resilience but doesn't reduce the number of API calls for unchanged discussions.

Different scope (complementary, no overlap)

Finding	Why no overlap
P0-3 metrics_json schema	CP3 doesn't reference the `metrics_json` column. `SyncResult` is printed/returned but not persisted there.
P0-4 Discussion sync telemetry columns	CP3's queue system (`pending_discussion_fetches`) is a replacement architecture. The existing per-MR telemetry columns (`discussions_sync_attempts`, `_last_error`) aren't referenced in CP3. Decide: use CP3's queue table or wire up the existing columns?
P0-5 Progress events lack timing	CP3 lists "Progress visible during long syncs" as acceptance criteria but doesn't spec timing in events.
P1-* Free data capture	CP3 doesn't touch GitLab API response field coverage at all. These are independent.
P2-2 Keyset pagination (GitLab API)	CP3 uses keyset pagination for local SQLite queries (document seeding, embedding pipelines). Completely different from using GitLab API keyset pagination.
P2-3 ETags	Not mentioned in CP3.
P2-4 Labels enrichment	Not mentioned in CP3.
P3-* Structural improvements	Not in CP3 scope.

Recommendation

CP3's lore sync orchestrator is the natural integration point for P0 instrumentation. Rather than retrofitting lore ingest separately, the most efficient path is:

Build P0 timing instrumentation as a reusable layer (e.g., a SyncMetrics struct that accumulates phase timings)
Wire it into the CP3 run_sync implementation as it's built
Have run_sync persist the full metrics (counts + timing) to sync_runs.metrics_json
Decide whether lore ingest becomes a thin wrapper around lore sync --no-docs --no-embed or stays separate with its own sync_runs recording

This avoids building instrumentation twice and ensures the new sync pipeline is observable from day one.

Decision: `lore ingest` goes away

lore sync becomes the single command for all data fetching. First run does a full fetch (equivalent to today's lore ingest), subsequent runs are incremental via cursors. lore ingest becomes a hidden deprecated alias.

Implications:

P0 instrumentation only needs to be built in one place (run_sync)
CP3 Gate C owns the sync_runs lifecycle end-to-end
The existing lore ingest issues / lore ingest mrs code becomes internal functions called by run_sync, not standalone CLI commands
lore sync always syncs everything: issues, MRs, discussions, documents, embeddings (with --no-embed / --no-docs to opt out of later stages)

Implementation Sequence

Phase A: Before CP3 (independent, enriches data model)

Do first. Migration + struct changes only. No architectural dependency. Gets richer source data into the DB before CP3's document generation pipeline locks in its schema.

P1 batch: free data capture - All ~11 fields in a single migration. user_notes_count, upvotes, downvotes, confidential, has_conflicts, blocking_discussions_resolved, merge_commit_sha, discussion_locked, task_completion_status, issue_type, issue references.
P1-10: MR milestones - Reuse existing issue milestone transformer. Slightly more work, same migration.

Phase B: During CP3 Gate C (`lore sync`)

Build instrumentation into the sync orchestrator as it's constructed. Not a separate effort.

P0-1 + P0-2 + P0-3 - SyncMetrics struct accumulating phase timings. run_sync writes to sync_runs with full metrics_json on completion.
P0-4 - Decide: use CP3's pending_discussion_fetches queue or existing per-MR telemetry columns. Wire up the winner.
P0-5 - Add elapsed_ms to *Complete progress event variants.
Deprecate lore ingest - Hidden alias pointing to lore sync. Remove from help output.

Phase C: After CP3 ships, informed by real metrics

Only pursue items that P0 data proves matter.

P2-1: Discussion optimization - Check metrics_json from real runs. If discussion phase is <10% of wall-clock, skip.
P2-2: Keyset pagination - Check primary fetch timing on largest project. If fast, skip.
P2-4: Labels enrichment - If label colors are needed for any UI surface.

Phase D: Future (needs a forcing function)

P3-1: Users table - When a UI needs display names / avatars.
P2-3: ETags - Only if P2-1 doesn't sufficiently reduce discussion overhead.
P3-2/3/4: GraphQL, Events API, Webhooks - Architectural shifts. Only if pull-based sync hits a scaling wall.

Priority 0: Observability (prerequisite for everything else)

We can't evaluate any efficiency question without measurement. Gitlore has no runtime performance instrumentation. The infrastructure for it was scaffolded (sync_runs table, metrics_json column, discussion sync telemetry columns) but never wired up.

P0-1: sync_runs table is never written to

Location: Schema in migrations/001_initial.sql:25-34, read in src/cli/commands/sync_status.rs:69-72

The table exists and lore status reads from it, but no code ever INSERTs or UPDATEs rows. The entire audit trail is empty.

-- Exists in schema, never populated
CREATE TABLE sync_runs (
  id INTEGER PRIMARY KEY,
  started_at INTEGER NOT NULL,
  heartbeat_at INTEGER NOT NULL,
  finished_at INTEGER,
  status TEXT NOT NULL,        -- 'running' | 'succeeded' | 'failed'
  command TEXT NOT NULL,
  error TEXT,
  metrics_json TEXT            -- never written
);

What to do: Instrument the ingest orchestrator to record sync runs. Each lore ingest issues / lore ingest mrs invocation should:

INSERT a row with status='running' at start
UPDATE with status='succeeded'/'failed' + finished_at on completion
Populate metrics_json with the IngestProjectResult / IngestMrProjectResult counters

P0-2: No operation timing anywhere

Location: Rate limiter in src/gitlab/client.rs:20-65, orchestrator in src/ingestion/orchestrator.rs

Instant::now() is used only for rate limiter enforcement. No operation durations are measured or logged. We don't know:

How long a full issue ingest takes
How long discussion sync takes per entity
How long individual API requests take (network latency)
How long database writes take per batch
How long rate limiter sleeps accumulate to
How long pagination takes across pages

What to do: Add timing instrumentation at these levels:

Level	What to time	Where
Run	Total ingest wall-clock time	orchestrator entry/exit
Phase	Primary fetch vs discussion sync	orchestrator phase boundaries
API call	Individual HTTP request round-trip	client.rs request method
DB write	Transaction duration per batch	ingestion store functions
Rate limiter	Cumulative sleep time per run	client.rs acquire()

Store phase-level and run-level timing in metrics_json. Log API-call-level timing at debug level.

P0-3: metrics_json has no defined schema

What to do: Define what goes in there. Strawman based on existing IngestProjectResult fields plus timing:

{
  "wall_clock_ms": 14200,
  "phases": {
    "primary_fetch": {
      "duration_ms": 8400,
      "api_calls": 12,
      "items_fetched": 1143,
      "items_upserted": 87,
      "pages": 12,
      "rate_limit_sleep_ms": 1200
    },
    "discussion_sync": {
      "duration_ms": 5800,
      "entities_checked": 87,
      "entities_synced": 14,
      "entities_skipped": 73,
      "api_calls": 22,
      "discussions_fetched": 156,
      "notes_upserted": 412,
      "rate_limit_sleep_ms": 2200
    }
  },
  "db": {
    "labels_created": 3,
    "raw_payloads_stored": 87,
    "raw_payloads_deduped": 42
  }
}

P0-4: Discussion sync telemetry columns are dead code

Location: merge_requests table columns: discussions_sync_last_attempt_at, discussions_sync_attempts, discussions_sync_last_error

These exist in the schema but are never read or written. They were designed for tracking retry behavior on failed discussion syncs.

What to do: Wire these up during discussion sync. On attempt: set last_attempt_at and increment attempts. On failure: set last_error. On success: reset attempts to 0. This provides per-entity visibility into discussion sync health.

P0-5: Progress events carry no timing

Location: src/ingestion/orchestrator.rs:28-53

ProgressEvent variants (IssueFetched, DiscussionSynced, etc.) carry only counts. Adding elapsed_ms to at least *Complete variants would give callers (CLI progress bars, robot mode output) real throughput numbers.

Priority 1: Free data capture (zero API cost)

These fields are already in the API responses gitlore receives. Storing them requires only Rust struct additions and DB column migrations. No additional API calls.

P1-1: user_notes_count (Issues + MRs)

API field: user_notes_count (integer) Value: Could short-circuit discussion re-sync. If count hasn't changed, discussions probably haven't changed either. Also useful for "most discussed" queries. Effort: Add field to serde struct, add DB column, store during transform.

P1-2: upvotes / downvotes (Issues + MRs)

API field: upvotes, downvotes (integers) Value: Engagement metrics for triage. "Most upvoted open issues" is a common query. Effort: Same pattern as above.

P1-3: confidential (Issues)

API field: confidential (boolean) Value: Security-sensitive filtering. Important to know when exposing issue data. Effort: Low.

P1-4: has_conflicts (MRs)

API field: has_conflicts (boolean) Value: Identify MRs needing rebase. Useful for "stale MR" detection. Effort: Low.

P1-5: blocking_discussions_resolved (MRs)

API field: blocking_discussions_resolved (boolean) Value: MR readiness indicator without joining the discussions table. Effort: Low.

P1-6: merge_commit_sha (MRs)

API field: merge_commit_sha (string, nullable) Value: Trace merged MRs to specific commits in git history. Effort: Low.

P1-7: discussion_locked (Issues + MRs)

API field: discussion_locked (boolean) Value: Know if new comments can be added. Useful for robot mode consumers. Effort: Low.

P1-8: task_completion_status (Issues + MRs)

API field: task_completion_status (object: {count, completed_count}) Value: Track task-list checkbox progress without parsing markdown. Effort: Low. Store as two integer columns or a small JSON blob.

P1-9: issue_type (Issues)

API field: issue_type (string: "issue" | "incident" | "test_case") Value: Distinguish issues vs incidents vs test cases for filtering. Effort: Low.

P1-10: MR milestone (MRs)

API field: milestone (object, same structure as on issues) Current state: Milestones are fully stored for issues but completely ignored for MRs. Value: "Which MRs are in milestone X?" Currently impossible to query locally. Effort: Medium - reuse existing milestone transformer from issue pipeline.

P1-11: Issue references (Issues)

API field: references (object: {short, relative, full}) Current state: Stored for MRs (references_short, references_full), dropped for issues. Value: Cross-project issue references (e.g., group/project#42). Effort: Low.

Priority 2: Efficiency improvements (requires measurement from P0 first)

These are potential optimizations. Do not implement until P0 instrumentation proves they matter.

P2-1: Discussion full-refresh strategy

Current behavior: When an issue/MR's updated_at advances, ALL its discussions are deleted and re-fetched from scratch.

Potential optimization: Use user_notes_count (P1-1) to detect whether discussions actually changed. Skip re-sync if count is unchanged.

Why we need P0 first: The full-refresh may be fast enough. Since we already fetch the data from GitLab, the DELETE+INSERT is just local SQLite I/O. If discussion sync for a typical entity takes <100ms locally, this isn't worth optimizing. We need the per-entity timing from P0-2 to know.

Trade-offs to consider:

Full-refresh catches edited and deleted notes. Incremental would miss those.
user_notes_count doesn't change when notes are edited, only when added/removed.
Full-refresh is simpler to reason about for consistency.

P2-2: Keyset pagination

Current behavior: Offset-based (page=N&per_page=100). Alternative: Keyset pagination (pagination=keyset), O(1) per page instead of O(N).

Why we need P0 first: Only matters for large projects (>10K issues). Most projects will never hit enough pages for this to be measurable. P0 timing of pagination will show if this is a bottleneck.

Note: Gitlore already parses Link headers for next-page detection, which is the client-side mechanism keyset pagination uses. So partial support exists.

P2-3: ETag / conditional requests

Current behavior: All requests are unconditional. Alternative: Cache ETags, send If-None-Match, get 304s back.

Why we need P0 first: The cursor-based sync already avoids re-fetching unchanged data for primary resources. ETags would mainly help with discussion re-fetches where nothing changed. If P2-1 (user_notes_count skip) is implemented, ETags become less valuable.

P2-4: Labels API enrichment

Current behavior: Labels extracted from the labels[] string array in issue/MR responses. The labels table has color and description columns that may not be populated. Alternative: Single call to GET /projects/:id/labels per project per sync to populate label metadata. Cost: 1 API call per project per sync run. Value: Label colors for UI rendering, descriptions for tooltips.

Priority 3: Structural improvements (future consideration)

P3-1: Users table

Current state: Only username stored. Author name, avatar_url, web_url, state are in every API response but discarded. Proposal: Create a users table, upsert on every encounter. Zero API cost. Value: Richer user display, detect blocked/deactivated users.

P3-2: GraphQL API for field-precise fetching

Current state: REST API returns ~40-50 fields per entity. Gitlore uses ~15-23. Alternative: GraphQL API allows requesting exactly the fields needed. Trade-offs: Different pagination model, potentially less stable API, more complex client code. The bandwidth savings are real but likely minor compared to discussion re-fetch overhead.

P3-3: Events API for lightweight change detection

Endpoint: GET /projects/:id/events Value: Lightweight "has anything changed?" check before running full issue/MR sync. Could replace or supplement the cursor-based approach for very active projects.

P3-4: Webhook-based push sync

Endpoint: POST /projects/:id/hooks (setup), then receive pushes. Value: Near-real-time sync without polling cost. Eliminates all rate-limit concerns. Barrier: Requires a listener endpoint, which changes the architecture from pull-only CLI to something with a daemon/server component.

Working notes

Space for recording decisions as we work through items.

Decisions made

Item	Decision	Rationale
`lore ingest`	Remove. `lore sync` is the single entry point.	No reason to separate initial load from incremental updates. First run = full fetch, subsequent = cursor-based delta.
CP3 alignment	Build P0 instrumentation into CP3 Gate C, not separately.	Avoids building in two places. `lore sync` owns the full lifecycle.
P2 timing	Defer all efficiency optimizations until P0 metrics from real runs are available.	Can't evaluate trade-offs without measurement.

Open questions

What's the typical project size (issue/MR count) for gitlore users? This determines whether keyset pagination (P2-2) matters.
Is there a plan for a web UI or TUI? That would increase the value of P3-1 (users table) and P2-4 (label colors).

17 KiB Raw Blame History