Migration 015 adds merge_commit_sha/squash_commit_sha to merge_requests
(Gate 4/5 prerequisites), closes_issues_synced_for_updated_at watermark
for incremental sync, and the missing idx_label_events_label index.
The MR transformer and ingestion pipeline now populate commit SHAs during
sync. The orchestrator uses watermark-based filtering for closes_issues
jobs instead of re-enqueuing all MRs every sync.
The Phase B PRD is updated to match the actual codebase: corrected
migration numbering (011-015), documented nullable label/milestone
fields (migration 012), watermark patterns (013), observability
infrastructure (014), simplified source_method values, and updated
entity_references schema to match implementation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HTTP client initialization (embedding/ollama.rs, gitlab/client.rs):
- Replace expect/panic with unwrap_or_else fallback to default Client
- Log warning when configured client fails to build
- Prevents crash on TLS/system configuration issues
Doctor command (cli/commands/doctor.rs):
- Handle reqwest Client::builder() failure in Ollama health check
- Return Warning status with descriptive message instead of panicking
- Ensures doctor command remains operational even with HTTP issues
These changes improve resilience when running in unusual environments
(containers with limited TLS, restrictive network policies, etc.)
without affecting normal operation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extends the GitLab client to fetch the list of issues that an MR will close
when merged, using the /projects/:id/merge_requests/:iid/closes_issues endpoint.
New type:
- GitLabIssueRef: Lightweight issue reference with id, iid, project_id, title,
state, and web_url. Used for the closes_issues response which returns a list
of issue summaries rather than full GitLabIssue objects.
New client method:
- fetch_mr_closes_issues(gitlab_project_id, iid): Returns Vec<GitLabIssueRef>
for all issues that the MR's description/commits indicate will be closed.
This enables building the entity_references table from API data in addition to
parsing system notes, providing more reliable cross-reference discovery.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three defensive improvements from peer code review:
Replace unreachable!() in GitLab client retry loops:
Both request() and request_with_headers() had unreachable!() after
their for loops. While the logic was sound (the final iteration always
reaches the return/break), any refactor to the loop condition would
turn this into a runtime panic. Restructured both to store
last_response with explicit break, making the control flow
self-documenting and the .expect() message useful if ever violated.
Doctor model name comparison asymmetry:
Ollama model names were stripped of their tag (:latest, :v1.5) for
comparison, but the configured model name was compared as-is. A config
value like "nomic-embed-text:v1.5" would never match. Now strips the
tag from both sides before comparing.
Regenerator savepoint cleanup and progress accuracy:
- upsert_document's error path did ROLLBACK TO but never RELEASE,
leaving a dangling savepoint that could nest on the next call. Added
RELEASE after rollback so the connection is clean.
- estimated_total for progress reporting was computed once at start but
the dirty queue can grow during processing. Now recounts each loop
iteration with max() so the progress fraction never goes backwards.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Targeted fixes across multiple subsystems:
dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
(None returns global counts, Some(pid) returns per-project counts)
gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
behavior, not warnings) and add structured fields (path, status_code)
for better log filtering and aggregation
gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
to epoch 0, making data quality issues visible in logs
ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx
search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
eliminating the NaN edge case entirely (total_cmp treats NaN consistently)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 isomorphic performance fixes from deep audit (no behavior changes):
- Eliminate double serialization: store_payload now accepts pre-serialized
bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The sync pipeline was bottlenecked at 10 req/s (hardcoded) with
sequential project processing and no retry on rate limiting. These
changes target 3-5x throughput improvement.
Rate limit configuration:
- Add requestsPerSecond to SyncConfig (default 30.0, was hardcoded 10)
- Pass configured rate through to GitLabClient::new from ingest
- Floor rate at 0.1 rps in RateLimiter::new to prevent panic on
Duration::from_secs_f64(1.0 / 0.0) — now reachable via user config
429 auto-retry:
- Both request() and request_with_headers() retry up to 3 times on
HTTP 429, respecting the retry-after header (default 60s)
- Extract parse_retry_after helper, reused by handle_response fallback
- After exhausting retries, the 429 error propagates as before
- Improved JSON decode errors now include a response body preview
Concurrent project ingestion:
- Derive Clone on GitLabClient (cheap: shares Arc<Mutex<RateLimiter>>
and reqwest::Client which is already Arc-backed)
- Restructure project loop to use futures::stream::buffer_unordered
with primary_concurrency (default 4) as the parallelism bound
- Each project gets its own SQLite connection (WAL mode + busy_timeout
handles concurrent writes)
- Add show_spinner field to IngestDisplay to separate the per-project
spinner from the sync-level stage spinner
- Error aggregation defers failures: all successful projects get their
summaries printed and results counted before returning the first error
- Bump dependentConcurrency default from 2 to 8 for discussion prefetch
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GitLab returns null for the label/milestone fields on resource_label_events
and resource_milestone_events when the referenced label or milestone has
been deleted. This caused deserialization failures during sync.
- Add migration 012 to recreate both event tables with nullable
label_name, milestone_title, and milestone_id columns (SQLite
requires table recreation to alter NOT NULL constraints)
- Change GitLabLabelEvent.label and GitLabMilestoneEvent.milestone
to Option<> in the Rust types
- Update upsert functions to pass through None values correctly
- Add tests for null label and null milestone deserialization
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
client.rs:
- fetch_all_resource_events() now uses tokio::try_join!() to fire all
three API requests (state, label, milestone events) concurrently
instead of awaiting each sequentially. For entities with many events,
this reduces wall-clock time by up to ~3x since the three independent
HTTP round-trips overlap.
main.rs:
- Removed async from handle_issues() and handle_mrs(). These functions
perform only synchronous database queries and formatting; they never
await anything. Removing the async annotation avoids the overhead of
an unnecessary Future state machine and makes the sync nature of
these code paths explicit.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extends GitLabClient with methods for fetching resource events from
GitLab's per-entity API endpoints. Adds a new impl block containing:
- fetch_all_pages<T>: Generic paginated collector that handles
x-next-page header parsing with fallback to page-size heuristics.
Uses per_page=100 and respects the existing rate limiter via
request_with_headers. Terminates when: (a) x-next-page header is
absent/stale, (b) response is empty, or (c) page is not full.
- Six typed endpoint methods:
- fetch_issue_state_events / fetch_mr_state_events
- fetch_issue_label_events / fetch_mr_label_events
- fetch_issue_milestone_events / fetch_mr_milestone_events
- fetch_all_resource_events: Convenience method that fetches all three
event types for an entity (issue or merge_request) in sequence,
returning a tuple of (state, label, milestone) event vectors.
Routes to issue or MR endpoints based on entity_type string.
All methods follow the existing client patterns: path formatting with
gitlab_project_id and iid, error propagation via Result, and rate
limiter integration through the shared request_with_headers path.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds six new types for deserializing responses from GitLab's three
Resource Events API endpoints (state, label, milestone):
- GitLabStateEvent: State transitions with optional user, source_commit,
and source_merge_request reference
- GitLabLabelEvent: Label add/remove events with nested GitLabLabelRef
- GitLabMilestoneEvent: Milestone assignment changes with nested
GitLabMilestoneRef
- GitLabMergeRequestRef: Lightweight MR reference (iid, title, web_url)
- GitLabLabelRef: Label metadata (id, name, color, description)
- GitLabMilestoneRef: Milestone metadata (id, iid, title)
All types derive Deserialize + Serialize and use Option<T> for nullable
fields (user, source_commit, color, description) to match GitLab's API
contract where these fields may be null.
Includes 8 new test cases covering:
- State events with/without user, with/without source_merge_request
- Label events for add and remove actions, including null color handling
- Milestone event deserialization
- Standalone ref type deserialization (MR, label, milestone)
Uses r##"..."## raw string delimiters where JSON contains hex color
codes (#FF0000) that would conflict with r#"..."# delimiters.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two targeted fixes to the GitLab API client:
1. Pagination: When the x-next-page header is missing but the current
page returned a full page of results, heuristically advance to the
next page instead of stopping. This fixes silent data truncation
observed with certain GitLab instances that omit pagination headers
on intermediate pages. The existing early-exit on empty or partial
pages remains as the termination condition.
2. Rate limiter: Refactor the async acquire() method into a synchronous
check_delay() that computes the required sleep duration and updates
last_request time while holding the mutex, then releases the lock
before sleeping. This eliminates holding the Mutex<RateLimiter>
across an await point, which previously could block other request
tasks unnecessarily during the sleep interval.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Duplicate ISO 8601 timestamp parsing functions existed in both
discussion.rs and merge_request.rs transformers. This extracts
iso_to_ms_strict() and iso_to_ms_opt_strict() into core::time
as the single source of truth, and updates both transformer
modules to use the shared implementations.
Also removes the private now_ms() from merge_request.rs in
favor of the existing core::time::now_ms(), and replaces the
local parse_timestamp_opt() in discussion.rs with the public
iso_to_ms() from core::time.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces NormalizedMergeRequest transformer and updates discussion
normalization to handle both issue and MR discussions polymorphically.
New transformers:
- NormalizedMergeRequest: Transforms API MergeRequest to database row,
extracting labels/assignees/reviewers into separate collections for
junction table insertion. Handles draft detection, detailed_merge_status
preference over deprecated merge_status, and merge_user over merged_by.
Discussion transformer updates:
- NormalizedDiscussion now takes noteable_type ("Issue" | "MergeRequest")
and noteable_id for polymorphic FK binding
- normalize_discussions_for_issue(): Convenience wrapper for issues
- normalize_discussions_for_mr(): Convenience wrapper for MRs
- DiffNote position fields (type, line_range, SHA triplet) now extracted
from API position object for code review context
Design decisions:
- Transformer returns (normalized_item, labels, assignees, reviewers)
tuple for efficient batch insertion without re-querying
- Timestamps converted to ms epoch for SQLite storage consistency
- Optional fields use map() chains for clean null handling
The polymorphic discussion approach allows reusing the same discussions
and notes tables for both issues and MRs, with noteable_type + FK
determining the parent relationship.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extends GitLabClient with endpoints for fetching merge requests and
their discussions, following the same patterns established for issues.
New methods:
- fetch_merge_requests(): Paginated MR listing with cursor support,
using updated_after filter for incremental sync. Uses 'all' scope
to include MRs where user is author/assignee/reviewer.
- fetch_merge_requests_single_page(): Single page variant for callers
managing their own pagination (used by parallel prefetch)
- fetch_mr_discussions(): Paginated discussion listing for a single MR,
returns full discussion trees with notes
API design notes:
- Uses keyset pagination (order_by=updated_at, keyset=true) for
consistent results during sync operations
- MR endpoint uses /merge_requests (not /mrs) per GitLab API naming
- Discussion endpoint matches issue pattern for consistency
- Per_page defaults to 100 (GitLab max) for efficiency
The fetch_merge_requests_single_page method enables the parallel
prefetch strategy used in mr_discussions.rs, where multiple MRs'
discussions are fetched concurrently during the sweep phase.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extends GitLab type definitions with comprehensive merge request support,
matching the API response structure for /projects/:id/merge_requests.
New types:
- MergeRequest: Full MR metadata including draft status, branch info,
detailed_merge_status, merge_user (modern API fields replacing
deprecated alternatives), and references for cross-project support
- MrReviewer: Reviewer user info (MR-specific, distinct from assignees)
- MrAssignee: Assignee user info with consistent structure
- MrDiscussion: MR discussion wrapper for polymorphic handling
- DiffNotePosition: Rich position data for code review comments with
line ranges and SHA triplet for commit context
Design decisions:
- Use Option<T> for all nullable API fields to handle partial responses
- Include deprecated fields (merged_by, merge_status) alongside modern
alternatives for backward compatibility with older GitLab instances
- DiffNotePosition uses Option for all fields since different position
types (text/image/file) populate different subsets
These types enable type-safe deserialization of GitLab MR API responses
with full coverage of the fields needed for CP2 ingestion.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This is a P0 fix from the CP1-CP2 alignment audit. The original
NormalizedDiscussion struct had issue_id as a non-optional i64 and
hardcoded noteable_type to "Issue", making it incompatible with merge
request discussions even though the database schema already supports
both via nullable columns and a CHECK constraint.
Changes:
- Add NoteableRef enum with Issue(i64) and MergeRequest(i64) variants
to provide compile-time safety against mixing up issue vs MR IDs
- Change NormalizedDiscussion.issue_id from i64 to Option<i64>
- Add NormalizedDiscussion.merge_request_id: Option<i64>
- Update transform_discussion() signature to take NoteableRef instead
of local_issue_id, deriving issue_id/merge_request_id/noteable_type
from the enum variant
- Update upsert_discussion() SQL to include merge_request_id column
(now 12 parameters instead of 11)
- Export NoteableRef from transformers module
- Add test for MergeRequest discussion transformation
- Update all existing tests to use NoteableRef::Issue(id)
The database schema (migration 002) was forward-thinking and already
supports both issue_id and merge_request_id as nullable columns with
a CHECK constraint. This change prepares the application layer for
CP2 merge request support without requiring any migrations.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Provides a typed interface to the GitLab API with pagination support.
src/gitlab/types.rs - API response type definitions:
- GitLabIssue: Full issue payload with author, assignees, labels
- GitLabDiscussion: Discussion thread with notes array
- GitLabNote: Individual note with author, timestamps, body
- GitLabAuthor/GitLabUser: User information with avatar URLs
- GitLabProject: Project metadata from /api/v4/projects
- GitLabVersion: GitLab instance version from /api/v4/version
- GitLabNotePosition: Line-level position for diff notes
- All types derive Deserialize for JSON parsing
src/gitlab/client.rs - HTTP client with authentication:
- Bearer token authentication from config
- Base URL configuration for self-hosted instances
- Paginated iteration via keyset or offset pagination
- Automatic Link header parsing for next page URLs
- Per-page limit control (default 100)
- Methods: get_user(), get_version(), get_project()
- Async stream for issues: list_issues_paginated()
- Async stream for discussions: list_issue_discussions_paginated()
- Respects GitLab rate limiting via response headers
src/gitlab/transformers/ - API to database mapping:
transformers/issue.rs - Issue transformation:
- Maps GitLabIssue to IssueRow for database insert
- Extracts milestone ID and due date
- Normalizes author/assignee usernames
- Preserves label IDs for junction table
- Returns IssueWithMetadata including label/assignee lists
transformers/discussion.rs - Discussion transformation:
- Maps GitLabDiscussion to NormalizedDiscussion
- Extracts thread metadata (resolvable, resolved)
- Flattens notes to NormalizedNote with foreign keys
- Handles system notes vs user notes
- Preserves note position for diff discussions
transformers/mod.rs - Re-exports all transformer types
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>