gitlore

Author	SHA1	Message	Date
Taylor Eernisse	233eb546af	feat: Add commit SHAs, closes_issues watermark, and PRD alignment Migration 015 adds merge_commit_sha/squash_commit_sha to merge_requests (Gate 4/5 prerequisites), closes_issues_synced_for_updated_at watermark for incremental sync, and the missing idx_label_events_label index. The MR transformer and ingestion pipeline now populate commit SHAs during sync. The orchestrator uses watermark-based filtering for closes_issues jobs instead of re-enqueuing all MRs every sync. The Phase B PRD is updated to match the actual codebase: corrected migration numbering (011-015), documented nullable label/milestone fields (migration 012), watermark patterns (013), observability infrastructure (014), simplified source_method values, and updated entity_references schema to match implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-05 15:29:51 -05:00
Taylor Eernisse	db750e4fc5	fix: Graceful HTTP client fallbacks and overflow protection HTTP client initialization (embedding/ollama.rs, gitlab/client.rs): - Replace expect/panic with unwrap_or_else fallback to default Client - Log warning when configured client fails to build - Prevents crash on TLS/system configuration issues Doctor command (cli/commands/doctor.rs): - Handle reqwest Client::builder() failure in Ollama health check - Return Warning status with descriptive message instead of panicking - Ensures doctor command remains operational even with HTTP issues These changes improve resilience when running in unusual environments (containers with limited TLS, restrictive network policies, etc.) without affecting normal operation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 11:21:40 -05:00
Taylor Eernisse	65583ed5d6	refactor: Remove redundant doc comments throughout codebase Removes module-level doc comments (//! lines) and excessive inline doc comments that were duplicating information already evident from: - Function/struct names (self-documenting code) - Type signatures (the what is clear from types) - Implementation context (the how is clear from code) Affected modules: - cli/* - Removed command descriptions duplicating clap help text - core/* - Removed module headers and obvious function docs - documents/* - Removed extractor/regenerator/truncation docs - embedding/* - Removed pipeline and chunking docs - gitlab/* - Removed client and transformer docs (kept type definitions) - ingestion/* - Removed orchestrator and ingestion docs - search/* - Removed FTS and vector search docs Philosophy: Code should be self-documenting. Comments should explain "why" (business decisions, non-obvious constraints) not "what" (which the code itself shows). This change reduces noise and maintenance burden while keeping the codebase just as understandable. Retains comments for: - Non-obvious business logic - Important safety invariants - Complex algorithm explanations - Public API boundaries where generated docs matter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:04:32 -05:00
Taylor Eernisse	26cf13248d	feat(gitlab): Add MR closes_issues API endpoint and GitLabIssueRef type Extends the GitLab client to fetch the list of issues that an MR will close when merged, using the /projects/:id/merge_requests/:iid/closes_issues endpoint. New type: - GitLabIssueRef: Lightweight issue reference with id, iid, project_id, title, state, and web_url. Used for the closes_issues response which returns a list of issue summaries rather than full GitLabIssue objects. New client method: - fetch_mr_closes_issues(gitlab_project_id, iid): Returns Vec<GitLabIssueRef> for all issues that the MR's description/commits indicate will be closed. This enables building the entity_references table from API data in addition to parsing system notes, providing more reliable cross-reference discovery. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:03:30 -05:00
Taylor Eernisse	925ec9f574	fix: Retry loop safety, doctor model matching, regenerator robustness Three defensive improvements from peer code review: Replace unreachable!() in GitLab client retry loops: Both request() and request_with_headers() had unreachable!() after their for loops. While the logic was sound (the final iteration always reaches the return/break), any refactor to the loop condition would turn this into a runtime panic. Restructured both to store last_response with explicit break, making the control flow self-documenting and the .expect() message useful if ever violated. Doctor model name comparison asymmetry: Ollama model names were stripped of their tag (:latest, :v1.5) for comparison, but the configured model name was compared as-is. A config value like "nomic-embed-text:v1.5" would never match. Now strips the tag from both sides before comparing. Regenerator savepoint cleanup and progress accuracy: - upsert_document's error path did ROLLBACK TO but never RELEASE, leaving a dangling savepoint that could nest on the next call. Added RELEASE after rollback so the connection is clean. - estimated_total for progress reporting was computed once at start but the dirty queue can grow during processing. Now recounts each loop iteration with max() so the progress fraction never goes backwards. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 14:16:54 -05:00
teernisse	86a51cddef	fix: Project-scoped job claiming, structured rate-limit logging, RRF total_cmp Targeted fixes across multiple subsystems: dependent_queue: - Add project_id parameter to claim_jobs() for project-scoped job claiming, preventing cross-project job theft during concurrent multi-project ingestion - Add project_id parameter to count_pending_jobs() with optional scoping (None returns global counts, Some(pid) returns per-project counts) gitlab/client: - Downgrade rate-limit log from warn to info (429s are expected operational behavior, not warnings) and add structured fields (path, status_code) for better log filtering and aggregation gitlab/transformers/discussion: - Add tracing::warn on invalid timestamp parse instead of silent fallback to epoch 0, making data quality issues visible in logs ingestion/merge_requests: - Remove duplicate doc comment on upsert_label_tx search/rrf: - Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting, eliminating the NaN edge case entirely (total_cmp treats NaN consistently) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 13:39:13 -05:00
Taylor Eernisse	ee5c5f9645	perf: Eliminate double serialization, add SQLite tuning, optimize hot paths 11 isomorphic performance fixes from deep audit (no behavior changes): - Eliminate double serialization: store_payload now accepts pre-serialized bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses Cow<[u8]> for zero-copy when compression is disabled. - Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas - Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT RETURNING in both issues.rs and merge_requests.rs - Replace INSERT + SELECT milestone upsert with RETURNING - Use prepare_cached for 5 hot-path queries in extractor.rs - Optimize compute_list_hash: index-sort + incremental SHA-256 instead of clone+sort+join+hash - Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity - Replace RandomState::new() in rand_jitter with atomic counter XOR nanos - Remove redundant per-note payload storage (discussion payload contains all notes already) - Change transform_issue to accept &GitLabIssue (avoids full struct clone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 08:12:37 -05:00
Taylor Eernisse	f5b4a765b7	perf: Configurable rate limit, 429 auto-retry, concurrent project ingestion The sync pipeline was bottlenecked at 10 req/s (hardcoded) with sequential project processing and no retry on rate limiting. These changes target 3-5x throughput improvement. Rate limit configuration: - Add requestsPerSecond to SyncConfig (default 30.0, was hardcoded 10) - Pass configured rate through to GitLabClient::new from ingest - Floor rate at 0.1 rps in RateLimiter::new to prevent panic on Duration::from_secs_f64(1.0 / 0.0) — now reachable via user config 429 auto-retry: - Both request() and request_with_headers() retry up to 3 times on HTTP 429, respecting the retry-after header (default 60s) - Extract parse_retry_after helper, reused by handle_response fallback - After exhausting retries, the 429 error propagates as before - Improved JSON decode errors now include a response body preview Concurrent project ingestion: - Derive Clone on GitLabClient (cheap: shares Arc<Mutex<RateLimiter>> and reqwest::Client which is already Arc-backed) - Restructure project loop to use futures::stream::buffer_unordered with primary_concurrency (default 4) as the parallelism bound - Each project gets its own SQLite connection (WAL mode + busy_timeout handles concurrent writes) - Add show_spinner field to IngestDisplay to separate the per-project spinner from the sync-level stage spinner - Error aggregation defers failures: all successful projects get their summaries printed and results counted before returning the first error - Bump dependentConcurrency default from 2 to 8 for discussion prefetch Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:37:06 -05:00
Taylor Eernisse	a92e176bb6	fix(events): Handle nullable label and milestone in resource events GitLab returns null for the label/milestone fields on resource_label_events and resource_milestone_events when the referenced label or milestone has been deleted. This caused deserialization failures during sync. - Add migration 012 to recreate both event tables with nullable label_name, milestone_title, and milestone_id columns (SQLite requires table recreation to alter NOT NULL constraints) - Change GitLabLabelEvent.label and GitLabMilestoneEvent.milestone to Option<> in the Rust types - Update upsert functions to pass through None values correctly - Add tests for null label and null milestone deserialization Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:36:17 -05:00
Taylor Eernisse	deafa88af5	perf: Concurrent resource event fetching, remove unnecessary async client.rs: - fetch_all_resource_events() now uses tokio::try_join!() to fire all three API requests (state, label, milestone events) concurrently instead of awaiting each sequentially. For entities with many events, this reduces wall-clock time by up to ~3x since the three independent HTTP round-trips overlap. main.rs: - Removed async from handle_issues() and handle_mrs(). These functions perform only synchronous database queries and formatting; they never await anything. Removing the async annotation avoids the overhead of an unnecessary Future state machine and makes the sync nature of these code paths explicit. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 14:09:44 -05:00
Taylor Eernisse	a50fc78823	style: Apply cargo fmt and clippy fixes across codebase Automated formatting and lint corrections from parallel agent work: - cargo fmt: import reordering (alphabetical), line wrapping to respect max width, trailing comma normalization, destructuring alignment, function signature reformatting, match arm formatting - clippy (pedantic): Range::contains() instead of manual comparisons, i64::from() instead of `as i64` casts, .clamp() instead of .max().min() chains, let-chain refactors (if-let with &&), #[allow(clippy::too_many_arguments)] and #[allow(clippy::field_reassign_with_default)] where warranted - Removed trailing blank lines and extra whitespace No behavioral changes. All existing tests pass unmodified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 13:01:59 -05:00
Taylor Eernisse	e73d2907dc	feat(client): Add Resource Events API endpoints with generic paginated fetcher Extends GitLabClient with methods for fetching resource events from GitLab's per-entity API endpoints. Adds a new impl block containing: - fetch_all_pages<T>: Generic paginated collector that handles x-next-page header parsing with fallback to page-size heuristics. Uses per_page=100 and respects the existing rate limiter via request_with_headers. Terminates when: (a) x-next-page header is absent/stale, (b) response is empty, or (c) page is not full. - Six typed endpoint methods: - fetch_issue_state_events / fetch_mr_state_events - fetch_issue_label_events / fetch_mr_label_events - fetch_issue_milestone_events / fetch_mr_milestone_events - fetch_all_resource_events: Convenience method that fetches all three event types for an entity (issue or merge_request) in sequence, returning a tuple of (state, label, milestone) event vectors. Routes to issue or MR endpoints based on entity_type string. All methods follow the existing client patterns: path formatting with gitlab_project_id and iid, error propagation via Result, and rate limiter integration through the shared request_with_headers path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 12:07:19 -05:00
Taylor Eernisse	92ff255909	feat(types): Add GitLab Resource Event serde types with deserialization tests Adds six new types for deserializing responses from GitLab's three Resource Events API endpoints (state, label, milestone): - GitLabStateEvent: State transitions with optional user, source_commit, and source_merge_request reference - GitLabLabelEvent: Label add/remove events with nested GitLabLabelRef - GitLabMilestoneEvent: Milestone assignment changes with nested GitLabMilestoneRef - GitLabMergeRequestRef: Lightweight MR reference (iid, title, web_url) - GitLabLabelRef: Label metadata (id, name, color, description) - GitLabMilestoneRef: Milestone metadata (id, iid, title) All types derive Deserialize + Serialize and use Option<T> for nullable fields (user, source_commit, color, description) to match GitLab's API contract where these fields may be null. Includes 8 new test cases covering: - State events with/without user, with/without source_merge_request - Label events for add and remove actions, including null color handling - Milestone event deserialization - Standalone ref type deserialization (MR, label, milestone) Uses r##"..."## raw string delimiters where JSON contains hex color codes (#FF0000) that would conflict with r#"..."# delimiters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 12:06:56 -05:00
Taylor Eernisse	d31d5292f2	fix(gitlab): Improve pagination heuristics and fix rate limiter lock contention Two targeted fixes to the GitLab API client: 1. Pagination: When the x-next-page header is missing but the current page returned a full page of results, heuristically advance to the next page instead of stopping. This fixes silent data truncation observed with certain GitLab instances that omit pagination headers on intermediate pages. The existing early-exit on empty or partial pages remains as the termination condition. 2. Rate limiter: Refactor the async acquire() method into a synchronous check_delay() that computes the required sleep duration and updates last_request time while holding the mutex, then releases the lock before sleeping. This eliminates holding the Mutex<RateLimiter> across an await point, which previously could block other request tasks unnecessarily during the sleep interval. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 15:46:05 -05:00
Taylor Eernisse	390f8a9288	refactor(core): Centralize timestamp parsing in core::time Duplicate ISO 8601 timestamp parsing functions existed in both discussion.rs and merge_request.rs transformers. This extracts iso_to_ms_strict() and iso_to_ms_opt_strict() into core::time as the single source of truth, and updates both transformer modules to use the shared implementations. Also removes the private now_ms() from merge_request.rs in favor of the existing core::time::now_ms(), and replaces the local parse_timestamp_opt() in discussion.rs with the public iso_to_ms() from core::time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 08:41:34 -05:00
Taylor Eernisse	d33f24c91b	feat(transformers): Add MR transformer and polymorphic discussion support Introduces NormalizedMergeRequest transformer and updates discussion normalization to handle both issue and MR discussions polymorphically. New transformers: - NormalizedMergeRequest: Transforms API MergeRequest to database row, extracting labels/assignees/reviewers into separate collections for junction table insertion. Handles draft detection, detailed_merge_status preference over deprecated merge_status, and merge_user over merged_by. Discussion transformer updates: - NormalizedDiscussion now takes noteable_type ("Issue" \| "MergeRequest") and noteable_id for polymorphic FK binding - normalize_discussions_for_issue(): Convenience wrapper for issues - normalize_discussions_for_mr(): Convenience wrapper for MRs - DiffNote position fields (type, line_range, SHA triplet) now extracted from API position object for code review context Design decisions: - Transformer returns (normalized_item, labels, assignees, reviewers) tuple for efficient batch insertion without re-querying - Timestamps converted to ms epoch for SQLite storage consistency - Optional fields use map() chains for clean null handling The polymorphic discussion approach allows reusing the same discussions and notes tables for both issues and MRs, with noteable_type + FK determining the parent relationship. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 22:45:29 -05:00
Taylor Eernisse	cc8c489fd2	feat(gitlab): Add MR and MR discussion API endpoints to client Extends GitLabClient with endpoints for fetching merge requests and their discussions, following the same patterns established for issues. New methods: - fetch_merge_requests(): Paginated MR listing with cursor support, using updated_after filter for incremental sync. Uses 'all' scope to include MRs where user is author/assignee/reviewer. - fetch_merge_requests_single_page(): Single page variant for callers managing their own pagination (used by parallel prefetch) - fetch_mr_discussions(): Paginated discussion listing for a single MR, returns full discussion trees with notes API design notes: - Uses keyset pagination (order_by=updated_at, keyset=true) for consistent results during sync operations - MR endpoint uses /merge_requests (not /mrs) per GitLab API naming - Discussion endpoint matches issue pattern for consistency - Per_page defaults to 100 (GitLab max) for efficiency The fetch_merge_requests_single_page method enables the parallel prefetch strategy used in mr_discussions.rs, where multiple MRs' discussions are fetched concurrently during the sweep phase. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 22:45:13 -05:00
Taylor Eernisse	a18908c377	feat(gitlab): Add MergeRequest and related types for API deserialization Extends GitLab type definitions with comprehensive merge request support, matching the API response structure for /projects/:id/merge_requests. New types: - MergeRequest: Full MR metadata including draft status, branch info, detailed_merge_status, merge_user (modern API fields replacing deprecated alternatives), and references for cross-project support - MrReviewer: Reviewer user info (MR-specific, distinct from assignees) - MrAssignee: Assignee user info with consistent structure - MrDiscussion: MR discussion wrapper for polymorphic handling - DiffNotePosition: Rich position data for code review comments with line ranges and SHA triplet for commit context Design decisions: - Use Option<T> for all nullable API fields to handle partial responses - Include deprecated fields (merged_by, merge_status) alongside modern alternatives for backward compatibility with older GitLab instances - DiffNotePosition uses Option for all fields since different position types (text/image/file) populate different subsets These types enable type-safe deserialization of GitLab MR API responses with full coverage of the fields needed for CP2 ingestion. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 22:44:58 -05:00
Taylor Eernisse	d9d749ac57	fix(discussion): Make NormalizedDiscussion polymorphic for MR support This is a P0 fix from the CP1-CP2 alignment audit. The original NormalizedDiscussion struct had issue_id as a non-optional i64 and hardcoded noteable_type to "Issue", making it incompatible with merge request discussions even though the database schema already supports both via nullable columns and a CHECK constraint. Changes: - Add NoteableRef enum with Issue(i64) and MergeRequest(i64) variants to provide compile-time safety against mixing up issue vs MR IDs - Change NormalizedDiscussion.issue_id from i64 to Option<i64> - Add NormalizedDiscussion.merge_request_id: Option<i64> - Update transform_discussion() signature to take NoteableRef instead of local_issue_id, deriving issue_id/merge_request_id/noteable_type from the enum variant - Update upsert_discussion() SQL to include merge_request_id column (now 12 parameters instead of 11) - Export NoteableRef from transformers module - Add test for MergeRequest discussion transformation - Update all existing tests to use NoteableRef::Issue(id) The database schema (migration 002) was forward-thinking and already supports both issue_id and merge_request_id as nullable columns with a CHECK constraint. This change prepares the application layer for CP2 merge request support without requiring any migrations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 17:00:49 -05:00
Taylor Eernisse	dd5eb04953	feat(gitlab): Implement GitLab REST API client and type definitions Provides a typed interface to the GitLab API with pagination support. src/gitlab/types.rs - API response type definitions: - GitLabIssue: Full issue payload with author, assignees, labels - GitLabDiscussion: Discussion thread with notes array - GitLabNote: Individual note with author, timestamps, body - GitLabAuthor/GitLabUser: User information with avatar URLs - GitLabProject: Project metadata from /api/v4/projects - GitLabVersion: GitLab instance version from /api/v4/version - GitLabNotePosition: Line-level position for diff notes - All types derive Deserialize for JSON parsing src/gitlab/client.rs - HTTP client with authentication: - Bearer token authentication from config - Base URL configuration for self-hosted instances - Paginated iteration via keyset or offset pagination - Automatic Link header parsing for next page URLs - Per-page limit control (default 100) - Methods: get_user(), get_version(), get_project() - Async stream for issues: list_issues_paginated() - Async stream for discussions: list_issue_discussions_paginated() - Respects GitLab rate limiting via response headers src/gitlab/transformers/ - API to database mapping: transformers/issue.rs - Issue transformation: - Maps GitLabIssue to IssueRow for database insert - Extracts milestone ID and due date - Normalizes author/assignee usernames - Preserves label IDs for junction table - Returns IssueWithMetadata including label/assignee lists transformers/discussion.rs - Discussion transformation: - Maps GitLabDiscussion to NormalizedDiscussion - Extracts thread metadata (resolvable, resolved) - Flattens notes to NormalizedNote with foreign keys - Handles system notes vs user notes - Preserves note position for diff discussions transformers/mod.rs - Re-exports all transformer types Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 11:28:21 -05:00

20 Commits