Commit Graph

22 Commits

Author SHA1 Message Date
Taylor Eernisse
95b7183add feat(who): expand expert + overlap queries with mr_file_changes and mr_reviewers
Chain: bd-jec (config flag) -> bd-2yo (fetch MR diffs) -> bd-3qn6 (rewrite who queries)

- Add fetch_mr_file_changes config option and --no-file-changes CLI flag
- Add GitLab MR diffs API fetch pipeline with watermark-based sync
- Create migration 020 for diffs_synced_for_updated_at watermark column
- Rewrite query_expert() and query_overlap() to use 4-signal UNION ALL:
  DiffNote reviewers, DiffNote MR authors, file-change authors, file-change reviewers
- Deduplicate across signal types via COUNT(DISTINCT CASE WHEN ... THEN mr_id END)
- Add insert_file_change test helper, 8 new who tests, all 397 tests pass
- Also includes: list performance migration 019, autocorrect module, README updates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 13:35:14 -05:00
Taylor Eernisse
e6b880cbcb fix: prevent panics in robot-mode JSON output and arithmetic paths
Peer code review found multiple panic-reachable paths:

1. serde_json::to_string().unwrap() in 4 robot-mode output functions
   (who.rs, main.rs x3). If serialization ever failed (e.g., NaN from
   edge-case division), the CLI would panic with an unhelpful stack trace.
   Replaced with unwrap_or_else that emits a structured JSON error fallback.

2. encode_rowid() in chunk_ids.rs used unchecked multiplication
   (document_id * 1000). On extreme document IDs this could silently wrap
   in release mode, causing embedding rowid collisions. Now uses
   checked_mul + checked_add with a diagnostic panic message.

3. HTTP response body truncation at byte index 500 in client.rs could
   split a multi-byte UTF-8 character, causing a panic. Now uses
   floor_char_boundary(500) for safe truncation.

4. who.rs reviews mode: SQL used `m.author_username != ?1` which silently
   dropped MRs with NULL author_username (SQL NULL != anything = NULL).
   Changed to `(m.author_username IS NULL OR m.author_username != ?1)`
   to match the pattern already used in expert mode.

5. handle_auth_test hardcoded exit code 5 for all errors regardless of
   type. Config not found (20), token not set (4), and network errors (8)
   all incorrectly returned 5. Now uses e.exit_code() from the actual
   LoreError, with proper suggestion hints in human mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 07:55:20 -05:00
Taylor Eernisse
233eb546af feat: Add commit SHAs, closes_issues watermark, and PRD alignment
Migration 015 adds merge_commit_sha/squash_commit_sha to merge_requests
(Gate 4/5 prerequisites), closes_issues_synced_for_updated_at watermark
for incremental sync, and the missing idx_label_events_label index.

The MR transformer and ingestion pipeline now populate commit SHAs during
sync. The orchestrator uses watermark-based filtering for closes_issues
jobs instead of re-enqueuing all MRs every sync.

The Phase B PRD is updated to match the actual codebase: corrected
migration numbering (011-015), documented nullable label/milestone
fields (migration 012), watermark patterns (013), observability
infrastructure (014), simplified source_method values, and updated
entity_references schema to match implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-05 15:29:51 -05:00
Taylor Eernisse
db750e4fc5 fix: Graceful HTTP client fallbacks and overflow protection
HTTP client initialization (embedding/ollama.rs, gitlab/client.rs):
- Replace expect/panic with unwrap_or_else fallback to default Client
- Log warning when configured client fails to build
- Prevents crash on TLS/system configuration issues

Doctor command (cli/commands/doctor.rs):
- Handle reqwest Client::builder() failure in Ollama health check
- Return Warning status with descriptive message instead of panicking
- Ensures doctor command remains operational even with HTTP issues

These changes improve resilience when running in unusual environments
(containers with limited TLS, restrictive network policies, etc.)
without affecting normal operation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 11:21:40 -05:00
Taylor Eernisse
65583ed5d6 refactor: Remove redundant doc comments throughout codebase
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)

Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs

Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.

Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:04:32 -05:00
Taylor Eernisse
26cf13248d feat(gitlab): Add MR closes_issues API endpoint and GitLabIssueRef type
Extends the GitLab client to fetch the list of issues that an MR will close
when merged, using the /projects/:id/merge_requests/:iid/closes_issues endpoint.

New type:
- GitLabIssueRef: Lightweight issue reference with id, iid, project_id, title,
  state, and web_url. Used for the closes_issues response which returns a list
  of issue summaries rather than full GitLabIssue objects.

New client method:
- fetch_mr_closes_issues(gitlab_project_id, iid): Returns Vec<GitLabIssueRef>
  for all issues that the MR's description/commits indicate will be closed.

This enables building the entity_references table from API data in addition to
parsing system notes, providing more reliable cross-reference discovery.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:03:30 -05:00
Taylor Eernisse
925ec9f574 fix: Retry loop safety, doctor model matching, regenerator robustness
Three defensive improvements from peer code review:

Replace unreachable!() in GitLab client retry loops:
Both request() and request_with_headers() had unreachable!() after
their for loops. While the logic was sound (the final iteration always
reaches the return/break), any refactor to the loop condition would
turn this into a runtime panic. Restructured both to store
last_response with explicit break, making the control flow
self-documenting and the .expect() message useful if ever violated.

Doctor model name comparison asymmetry:
Ollama model names were stripped of their tag (:latest, :v1.5) for
comparison, but the configured model name was compared as-is. A config
value like "nomic-embed-text:v1.5" would never match. Now strips the
tag from both sides before comparing.

Regenerator savepoint cleanup and progress accuracy:
- upsert_document's error path did ROLLBACK TO but never RELEASE,
  leaving a dangling savepoint that could nest on the next call. Added
  RELEASE after rollback so the connection is clean.
- estimated_total for progress reporting was computed once at start but
  the dirty queue can grow during processing. Now recounts each loop
  iteration with max() so the progress fraction never goes backwards.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:16:54 -05:00
teernisse
86a51cddef fix: Project-scoped job claiming, structured rate-limit logging, RRF total_cmp
Targeted fixes across multiple subsystems:

dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
  preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
  (None returns global counts, Some(pid) returns per-project counts)

gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
  behavior, not warnings) and add structured fields (path, status_code)
  for better log filtering and aggregation

gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
  to epoch 0, making data quality issues visible in logs

ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx

search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
  eliminating the NaN edge case entirely (total_cmp treats NaN consistently)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 13:39:13 -05:00
Taylor Eernisse
ee5c5f9645 perf: Eliminate double serialization, add SQLite tuning, optimize hot paths
11 isomorphic performance fixes from deep audit (no behavior changes):

- Eliminate double serialization: store_payload now accepts pre-serialized
  bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
  Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
  RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
  of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
  all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 08:12:37 -05:00
Taylor Eernisse
f5b4a765b7 perf: Configurable rate limit, 429 auto-retry, concurrent project ingestion
The sync pipeline was bottlenecked at 10 req/s (hardcoded) with
sequential project processing and no retry on rate limiting. These
changes target 3-5x throughput improvement.

Rate limit configuration:
- Add requestsPerSecond to SyncConfig (default 30.0, was hardcoded 10)
- Pass configured rate through to GitLabClient::new from ingest
- Floor rate at 0.1 rps in RateLimiter::new to prevent panic on
  Duration::from_secs_f64(1.0 / 0.0) — now reachable via user config

429 auto-retry:
- Both request() and request_with_headers() retry up to 3 times on
  HTTP 429, respecting the retry-after header (default 60s)
- Extract parse_retry_after helper, reused by handle_response fallback
- After exhausting retries, the 429 error propagates as before
- Improved JSON decode errors now include a response body preview

Concurrent project ingestion:
- Derive Clone on GitLabClient (cheap: shares Arc<Mutex<RateLimiter>>
  and reqwest::Client which is already Arc-backed)
- Restructure project loop to use futures::stream::buffer_unordered
  with primary_concurrency (default 4) as the parallelism bound
- Each project gets its own SQLite connection (WAL mode + busy_timeout
  handles concurrent writes)
- Add show_spinner field to IngestDisplay to separate the per-project
  spinner from the sync-level stage spinner
- Error aggregation defers failures: all successful projects get their
  summaries printed and results counted before returning the first error
- Bump dependentConcurrency default from 2 to 8 for discussion prefetch

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:37:06 -05:00
Taylor Eernisse
a92e176bb6 fix(events): Handle nullable label and milestone in resource events
GitLab returns null for the label/milestone fields on resource_label_events
and resource_milestone_events when the referenced label or milestone has
been deleted. This caused deserialization failures during sync.

- Add migration 012 to recreate both event tables with nullable
  label_name, milestone_title, and milestone_id columns (SQLite
  requires table recreation to alter NOT NULL constraints)
- Change GitLabLabelEvent.label and GitLabMilestoneEvent.milestone
  to Option<> in the Rust types
- Update upsert functions to pass through None values correctly
- Add tests for null label and null milestone deserialization

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:36:17 -05:00
Taylor Eernisse
deafa88af5 perf: Concurrent resource event fetching, remove unnecessary async
client.rs:
- fetch_all_resource_events() now uses tokio::try_join!() to fire all
  three API requests (state, label, milestone events) concurrently
  instead of awaiting each sequentially. For entities with many events,
  this reduces wall-clock time by up to ~3x since the three independent
  HTTP round-trips overlap.

main.rs:
- Removed async from handle_issues() and handle_mrs(). These functions
  perform only synchronous database queries and formatting; they never
  await anything. Removing the async annotation avoids the overhead of
  an unnecessary Future state machine and makes the sync nature of
  these code paths explicit.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 14:09:44 -05:00
Taylor Eernisse
a50fc78823 style: Apply cargo fmt and clippy fixes across codebase
Automated formatting and lint corrections from parallel agent work:

- cargo fmt: import reordering (alphabetical), line wrapping to respect
  max width, trailing comma normalization, destructuring alignment,
  function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
  i64::from() instead of `as i64` casts, .clamp() instead of
  .max().min() chains, let-chain refactors (if-let with &&),
  #[allow(clippy::too_many_arguments)] and
  #[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace

No behavioral changes. All existing tests pass unmodified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 13:01:59 -05:00
Taylor Eernisse
e73d2907dc feat(client): Add Resource Events API endpoints with generic paginated fetcher
Extends GitLabClient with methods for fetching resource events from
GitLab's per-entity API endpoints. Adds a new impl block containing:

- fetch_all_pages<T>: Generic paginated collector that handles
  x-next-page header parsing with fallback to page-size heuristics.
  Uses per_page=100 and respects the existing rate limiter via
  request_with_headers. Terminates when: (a) x-next-page header is
  absent/stale, (b) response is empty, or (c) page is not full.

- Six typed endpoint methods:
  - fetch_issue_state_events / fetch_mr_state_events
  - fetch_issue_label_events / fetch_mr_label_events
  - fetch_issue_milestone_events / fetch_mr_milestone_events

- fetch_all_resource_events: Convenience method that fetches all three
  event types for an entity (issue or merge_request) in sequence,
  returning a tuple of (state, label, milestone) event vectors.
  Routes to issue or MR endpoints based on entity_type string.

All methods follow the existing client patterns: path formatting with
gitlab_project_id and iid, error propagation via Result, and rate
limiter integration through the shared request_with_headers path.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 12:07:19 -05:00
Taylor Eernisse
92ff255909 feat(types): Add GitLab Resource Event serde types with deserialization tests
Adds six new types for deserializing responses from GitLab's three
Resource Events API endpoints (state, label, milestone):

- GitLabStateEvent: State transitions with optional user, source_commit,
  and source_merge_request reference
- GitLabLabelEvent: Label add/remove events with nested GitLabLabelRef
- GitLabMilestoneEvent: Milestone assignment changes with nested
  GitLabMilestoneRef
- GitLabMergeRequestRef: Lightweight MR reference (iid, title, web_url)
- GitLabLabelRef: Label metadata (id, name, color, description)
- GitLabMilestoneRef: Milestone metadata (id, iid, title)

All types derive Deserialize + Serialize and use Option<T> for nullable
fields (user, source_commit, color, description) to match GitLab's API
contract where these fields may be null.

Includes 8 new test cases covering:
- State events with/without user, with/without source_merge_request
- Label events for add and remove actions, including null color handling
- Milestone event deserialization
- Standalone ref type deserialization (MR, label, milestone)

Uses r##"..."## raw string delimiters where JSON contains hex color
codes (#FF0000) that would conflict with r#"..."# delimiters.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 12:06:56 -05:00
Taylor Eernisse
d31d5292f2 fix(gitlab): Improve pagination heuristics and fix rate limiter lock contention
Two targeted fixes to the GitLab API client:

1. Pagination: When the x-next-page header is missing but the current
   page returned a full page of results, heuristically advance to the
   next page instead of stopping. This fixes silent data truncation
   observed with certain GitLab instances that omit pagination headers
   on intermediate pages. The existing early-exit on empty or partial
   pages remains as the termination condition.

2. Rate limiter: Refactor the async acquire() method into a synchronous
   check_delay() that computes the required sleep duration and updates
   last_request time while holding the mutex, then releases the lock
   before sleeping. This eliminates holding the Mutex<RateLimiter>
   across an await point, which previously could block other request
   tasks unnecessarily during the sleep interval.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:46:05 -05:00
Taylor Eernisse
390f8a9288 refactor(core): Centralize timestamp parsing in core::time
Duplicate ISO 8601 timestamp parsing functions existed in both
discussion.rs and merge_request.rs transformers. This extracts
iso_to_ms_strict() and iso_to_ms_opt_strict() into core::time
as the single source of truth, and updates both transformer
modules to use the shared implementations.

Also removes the private now_ms() from merge_request.rs in
favor of the existing core::time::now_ms(), and replaces the
local parse_timestamp_opt() in discussion.rs with the public
iso_to_ms() from core::time.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 08:41:34 -05:00
Taylor Eernisse
d33f24c91b feat(transformers): Add MR transformer and polymorphic discussion support
Introduces NormalizedMergeRequest transformer and updates discussion
normalization to handle both issue and MR discussions polymorphically.

New transformers:
- NormalizedMergeRequest: Transforms API MergeRequest to database row,
  extracting labels/assignees/reviewers into separate collections for
  junction table insertion. Handles draft detection, detailed_merge_status
  preference over deprecated merge_status, and merge_user over merged_by.

Discussion transformer updates:
- NormalizedDiscussion now takes noteable_type ("Issue" | "MergeRequest")
  and noteable_id for polymorphic FK binding
- normalize_discussions_for_issue(): Convenience wrapper for issues
- normalize_discussions_for_mr(): Convenience wrapper for MRs
- DiffNote position fields (type, line_range, SHA triplet) now extracted
  from API position object for code review context

Design decisions:
- Transformer returns (normalized_item, labels, assignees, reviewers)
  tuple for efficient batch insertion without re-querying
- Timestamps converted to ms epoch for SQLite storage consistency
- Optional fields use map() chains for clean null handling

The polymorphic discussion approach allows reusing the same discussions
and notes tables for both issues and MRs, with noteable_type + FK
determining the parent relationship.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 22:45:29 -05:00
Taylor Eernisse
cc8c489fd2 feat(gitlab): Add MR and MR discussion API endpoints to client
Extends GitLabClient with endpoints for fetching merge requests and
their discussions, following the same patterns established for issues.

New methods:
- fetch_merge_requests(): Paginated MR listing with cursor support,
  using updated_after filter for incremental sync. Uses 'all' scope
  to include MRs where user is author/assignee/reviewer.
- fetch_merge_requests_single_page(): Single page variant for callers
  managing their own pagination (used by parallel prefetch)
- fetch_mr_discussions(): Paginated discussion listing for a single MR,
  returns full discussion trees with notes

API design notes:
- Uses keyset pagination (order_by=updated_at, keyset=true) for
  consistent results during sync operations
- MR endpoint uses /merge_requests (not /mrs) per GitLab API naming
- Discussion endpoint matches issue pattern for consistency
- Per_page defaults to 100 (GitLab max) for efficiency

The fetch_merge_requests_single_page method enables the parallel
prefetch strategy used in mr_discussions.rs, where multiple MRs'
discussions are fetched concurrently during the sweep phase.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 22:45:13 -05:00
Taylor Eernisse
a18908c377 feat(gitlab): Add MergeRequest and related types for API deserialization
Extends GitLab type definitions with comprehensive merge request support,
matching the API response structure for /projects/:id/merge_requests.

New types:
- MergeRequest: Full MR metadata including draft status, branch info,
  detailed_merge_status, merge_user (modern API fields replacing
  deprecated alternatives), and references for cross-project support
- MrReviewer: Reviewer user info (MR-specific, distinct from assignees)
- MrAssignee: Assignee user info with consistent structure
- MrDiscussion: MR discussion wrapper for polymorphic handling
- DiffNotePosition: Rich position data for code review comments with
  line ranges and SHA triplet for commit context

Design decisions:
- Use Option<T> for all nullable API fields to handle partial responses
- Include deprecated fields (merged_by, merge_status) alongside modern
  alternatives for backward compatibility with older GitLab instances
- DiffNotePosition uses Option for all fields since different position
  types (text/image/file) populate different subsets

These types enable type-safe deserialization of GitLab MR API responses
with full coverage of the fields needed for CP2 ingestion.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 22:44:58 -05:00
Taylor Eernisse
d9d749ac57 fix(discussion): Make NormalizedDiscussion polymorphic for MR support
This is a P0 fix from the CP1-CP2 alignment audit. The original
NormalizedDiscussion struct had issue_id as a non-optional i64 and
hardcoded noteable_type to "Issue", making it incompatible with merge
request discussions even though the database schema already supports
both via nullable columns and a CHECK constraint.

Changes:
- Add NoteableRef enum with Issue(i64) and MergeRequest(i64) variants
  to provide compile-time safety against mixing up issue vs MR IDs
- Change NormalizedDiscussion.issue_id from i64 to Option<i64>
- Add NormalizedDiscussion.merge_request_id: Option<i64>
- Update transform_discussion() signature to take NoteableRef instead
  of local_issue_id, deriving issue_id/merge_request_id/noteable_type
  from the enum variant
- Update upsert_discussion() SQL to include merge_request_id column
  (now 12 parameters instead of 11)
- Export NoteableRef from transformers module
- Add test for MergeRequest discussion transformation
- Update all existing tests to use NoteableRef::Issue(id)

The database schema (migration 002) was forward-thinking and already
supports both issue_id and merge_request_id as nullable columns with
a CHECK constraint. This change prepares the application layer for
CP2 merge request support without requiring any migrations.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 17:00:49 -05:00
Taylor Eernisse
dd5eb04953 feat(gitlab): Implement GitLab REST API client and type definitions
Provides a typed interface to the GitLab API with pagination support.

src/gitlab/types.rs - API response type definitions:
- GitLabIssue: Full issue payload with author, assignees, labels
- GitLabDiscussion: Discussion thread with notes array
- GitLabNote: Individual note with author, timestamps, body
- GitLabAuthor/GitLabUser: User information with avatar URLs
- GitLabProject: Project metadata from /api/v4/projects
- GitLabVersion: GitLab instance version from /api/v4/version
- GitLabNotePosition: Line-level position for diff notes
- All types derive Deserialize for JSON parsing

src/gitlab/client.rs - HTTP client with authentication:
- Bearer token authentication from config
- Base URL configuration for self-hosted instances
- Paginated iteration via keyset or offset pagination
- Automatic Link header parsing for next page URLs
- Per-page limit control (default 100)
- Methods: get_user(), get_version(), get_project()
- Async stream for issues: list_issues_paginated()
- Async stream for discussions: list_issue_discussions_paginated()
- Respects GitLab rate limiting via response headers

src/gitlab/transformers/ - API to database mapping:

transformers/issue.rs - Issue transformation:
- Maps GitLabIssue to IssueRow for database insert
- Extracts milestone ID and due date
- Normalizes author/assignee usernames
- Preserves label IDs for junction table
- Returns IssueWithMetadata including label/assignee lists

transformers/discussion.rs - Discussion transformation:
- Maps GitLabDiscussion to NormalizedDiscussion
- Extracts thread metadata (resolvable, resolved)
- Flattens notes to NormalizedNote with foreign keys
- Handles system notes vs user notes
- Preserves note position for diff discussions

transformers/mod.rs - Re-exports all transformer types

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 11:28:21 -05:00