gitlore

Author	SHA1	Message	Date
teernisse	9ec1344945	feat(surgical-sync): add per-IID surgical sync pipeline with preflight validation Add the ability to sync specific issues or merge requests by IID without running a full incremental sync. This enables fast, targeted data refresh for individual entities — useful for agent workflows, debugging, and real-time investigation of specific issues or MRs. Architecture: - New CLI flags: --issue <IID> and --mr <IID> (repeatable, up to 100 total) scoped to a single project via -p/--project - Preflight phase validates all IIDs exist on GitLab before any DB writes, with TOCTOU-aware soft verification at ingest time - 6-stage pipeline: preflight -> fetch -> ingest -> dependents -> docs -> embed - Each stage is cancellation-aware via ShutdownSignal - Dedicated SyncRunRecorder extensions track surgical-specific counters (issues_fetched, mrs_ingested, docs_regenerated, etc.) New modules: - src/ingestion/surgical.rs: Core surgical fetch/ingest/dependent logic with preflight_fetch(), ingest_issue_by_iid(), ingest_mr_by_iid(), and fetch_dependents_for_{issue,mr}() - src/cli/commands/sync_surgical.rs: Full CLI orchestrator with progress spinners, human/robot output, and cancellation handling - src/embedding/pipeline.rs: embed_documents_by_ids() for scoped embedding - src/documents/regenerator.rs: regenerate_dirty_documents_for_sources() for scoped document regeneration Database changes: - Migration 027: Extends sync_runs with mode, phase, surgical_iids_json, per-entity counters, and cancelled_at column - New indexes: idx_sync_runs_mode_started, idx_sync_runs_status_phase_started GitLab client: - get_issue_by_iid() and get_mr_by_iid() single-entity fetch methods Error handling: - New SurgicalPreflightFailed error variant with entity_type, iid, project, and reason fields. Shares exit code 6 with GitLabNotFound. Includes comprehensive test coverage: - 645 lines of surgical ingestion tests (wiremock-based) - 184 lines of scoped embedding tests - 85 lines of scoped regeneration tests - 113 lines of GitLab client single-entity tests - 236 lines of sync_run surgical column/counter tests - Unit tests for SyncOptions, error codes, and CLI validation	2026-02-18 16:28:21 -05:00
teernisse	eef73decb5	fix(cli): timeline tag width, test env isolation, and logging verbosity Miscellaneous fixes across CLI and core modules: - Timeline: widen TAG_WIDTH from 10 to 11 to accommodate longer event type labels without truncation - render.rs: save and restore LORE_ICONS env var in glyph_mode test to prevent interference from the test environment leaking into or from other tests that set LORE_ICONS - logging.rs: adjust verbose=1 to info level (was debug), verbose=2 to debug — this reduces noise at -v while keeping -vv as the full debug experience - issues.rs, merge_requests.rs: use infodebug! macro consistently for ingestion summary logging Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 11:25:42 -05:00
teernisse	bb6660178c	feat(sync): per-project breakdown, status enrichment progress bars, and summary polish Add per-project detail rows beneath stage completion lines during multi-project syncs, showing itemized counts (issues/MRs, discussions, events, statuses, diffs) for each project. Previously, only aggregate totals were visible, making it hard to diagnose which project contributed what during a sync. Status enrichment gets proper progress bars replacing the old spinner-only display: StatusEnrichmentStarted now carries a total count so the CLI can render a determinate bar with rate and ETA. The enrichment SQL is tightened to use IS NOT comparisons for diff-only UPDATEs (skip rows where values haven't changed), and a follow-up touch_stmt ensures status_synced_at is updated even for unchanged rows so staleness detection works correctly. Other improvements: - New ProjectSummary struct aggregates per-project metrics during ingestion - SyncResult gains statuses_enriched + per-project summary vectors - "Already up to date" message when sync finds zero changes - Remove Arc<AtomicBool> tick_started pattern from docs/embed stages (enable_steady_tick is idempotent, the guard was unnecessary) - Progress bar styling: dim spinner, dark_gray track, per_sec + eta display - Tick intervals tightened from 100ms to 60ms for smoother animation - statuses_without_widget calculation uses fetch_result.statuses.len() instead of subtracting enriched (more accurate when some statuses lack work item widgets) - Status enrichment completion log downgraded from info to debug Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 11:25:33 -05:00
Taylor Eernisse	c6a5461d41	refactor(ingestion): compact log summaries and quieter shutdown messages Migrate all ingestion completion logs to use nonzero_summary() for compact, zero-suppressed output. Before: 8-14 individual key=value structured fields per completion message. After: a single summary field like '42 fetched · 3 labels · 12 notes' that only shows non-zero counters. Also downgrade all 'Shutdown requested...' messages from info! to debug!. These are emitted on every Ctrl+C and add noise to the partial results output that immediately follows. They remain visible at -vv for debugging graceful shutdown behavior. Affected modules: - issues.rs: issue ingestion completion - merge_requests.rs: MR ingestion completion, full-sync cursor reset - mr_discussions.rs: discussion ingestion completion - orchestrator.rs: project-level issue and MR completion summaries, all shutdown-requested checkpoints across discussion sync, resource events drain, closes-issues drain, and MR diffs drain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:31:57 -05:00
Taylor Eernisse	a7f86b26e4	refactor(core): compact human log format, quieter lock lifecycle, nonzero_summary helper Three quality-of-life improvements to reduce log noise and improve readability: 1. logging.rs: Add CompactHumanFormat for stderr tracing output. Replaces the default format with a minimal 'HH:MM:SS LEVEL message key=value' layout — no span context, no full timestamps, no target module. The JSON file log layer is unaffected. This makes watching 'lore sync' output much cleaner. 2. lock.rs: Downgrade AppLock acquire/release messages from info! to debug!. Lock lifecycle events (acquired new, acquired existing, released) are operational bookkeeping that clutters normal output. They remain visible at -vv verbosity for troubleshooting. 3. ingestion/mod.rs: Add nonzero_summary() utility that formats named counters as a compact middle-dot-separated string, suppressing zero values. Produces output like '42 fetched · 3 labels · 12 notes' instead of verbose key=value structured fields. Returns 'nothing to update' when all values are zero. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:31:30 -05:00
teernisse	0aecbf33c0	feat(xref): extract cross-references from descriptions, user notes, and fix system note regex - Fix MENTIONED_RE/CLOSED_BY_RE to match real GitLab format ('mentioned in issue #N' / 'mentioned in merge request !N') - Add GITLAB_URL_RE + parse_url_refs() for full URL extraction - Add extract_refs_from_descriptions() -> source_method='description_parse' - Add extract_refs_from_user_notes() -> source_method='note_parse' - Wire both into orchestrator after system note extraction - 36 tests: regex fix, URL parsing, integration, idempotency	2026-02-13 17:19:36 -05:00
Taylor Eernisse	7e0e6a91f2	refactor: extract unit tests into separate _tests.rs files Move inline #[cfg(test)] mod tests { ... } blocks from 22 source files into dedicated _tests.rs companion files, wired via: #[cfg(test)] #[path = "module_tests.rs"] mod tests; This keeps implementation-focused source files leaner and more scannable while preserving full access to private items through `use super::*;`. Modules extracted: core: db, note_parser, payloads, project, references, sync_run, timeline_collect, timeline_expand, timeline_seed cli: list (55 tests), who (75 tests) documents: extractor (43 tests), regenerator embedding: change_detector, chunking gitlab: graphql (wiremock async tests), transformers/issue ingestion: dirty_tracker, discussions, issues, mr_diffs Also adds conflicts_with("explain_score") to the --detail flag in the who command to prevent mutually exclusive flags from being combined. All 629 unit tests pass. No behavior changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 10:54:02 -05:00
teernisse	83cd16c918	feat: implement per-note search and document pipeline - Add SourceType::Note with extract_note_document() and ParentMetadataCache - Migration 022: composite indexes for notes queries + author_id column - Migration 024: table rebuild adding 'note' to CHECK constraints, defense triggers - Migration 025: backfill existing non-system notes into dirty queue - Add lore notes CLI command with 17 filter options (author, path, resolution, etc.) - Support table/json/jsonl/csv output formats with field selection - Wire note dirty tracking through discussion and MR discussion ingestion - Fix test_migration_024_preserves_existing_data off-by-one (tested wrong migration) - Fix upsert_document_inner returning false for label/path-only changes	2026-02-12 13:31:24 -05:00
Taylor Eernisse	e9af529f6e	feat(ingestion): add progress reporting for status enrichment pipeline Previously the status enrichment phase (GraphQL work item status fetch) ran silently — users saw no feedback between "syncing issues" and the final enrichment summary. For projects with hundreds of issues and adaptive page-size retries, this felt like a hang. Changes across three layers: GraphQL (graphql.rs): - Extract fetch_issue_statuses_with_progress() accepting an optional on_page callback invoked after each paginated fetch with the running count of fetched IIDs - Original fetch_issue_statuses() preserved as a zero-cost delegation wrapper (no callback overhead) Orchestrator (orchestrator.rs): - Three new ProgressEvent variants: StatusEnrichmentStarted, StatusEnrichmentPageFetched, StatusEnrichmentWriting - Wire the page callback through to the new _with_progress fn CLI (ingest.rs): - Handle all three new events in the progress callback, updating both the per-project spinner and the stage bar with live counts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 10:22:20 -05:00
Taylor Eernisse	6b75697638	feat(ingestion): enrich issues with work item status from GraphQL API Add a "Phase 1.5" status enrichment step to the issue ingestion pipeline that fetches work item statuses via the GitLab GraphQL API after the standard REST API ingestion completes. Schema changes (migration 021): - Add status_name, status_category, status_color, status_icon_name, and status_synced_at columns to the issues table (all nullable) Ingestion pipeline changes: - New `enrich_issue_statuses_txn()` function that applies fetched statuses in a single transaction with two phases: clear stale statuses for issues that no longer have a status widget, then apply new/updated statuses from the GraphQL response - ProgressEvent variants for status enrichment (complete/skipped) - IngestProjectResult tracks enrichment metrics (seen, enriched, cleared, without_widget, partial_error_count, enrichment_mode, errors) - Robot mode JSON output includes per-project status enrichment details Configuration: - New `sync.fetchWorkItemStatus` config option (defaults true) to disable GraphQL status enrichment on instances without Premium/Ultimate - `LoreError::GitLabAuthFailed` now treated as permanent API error so status enrichment auth failures don't trigger retries Also removes the unnecessary nested SAVEPOINT in store_closes_issues_refs (already runs within the orchestrator's transaction context). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 08:09:21 -05:00
Taylor Eernisse	7d40a81512	fix(ingestion): remove nested transaction in upsert_mr_file_changes drain_mr_diffs in orchestrator.rs already wraps each MR diff store in an unchecked_transaction (alongside job completion and watermark update). upsert_mr_file_changes was also starting its own inner transaction via conn.unchecked_transaction(), causing every call to fail with "cannot start a transaction within a transaction". Remove the inner transaction management from upsert_mr_file_changes so it operates on whatever Connection (or Transaction deref'd to Connection) the caller provides. The caller in drain_mr_diffs owns the transaction boundary. Standalone callers (tests, future direct use) auto-commit each statement, which is correct for their use case. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 11:56:15 -05:00
Taylor Eernisse	dfa44e5bcd	fix(ingestion): label upsert reliability, init idempotency, and sync health Label upsert (issues + merge_requests): Replace INSERT ... ON CONFLICT DO UPDATE RETURNING with INSERT OR IGNORE + SELECT. The prior RETURNING-based approach relied on last_insert_rowid() matching the returned id, which is not guaranteed when ON CONFLICT triggers an update (SQLite may return 0). The new two-step approach is unambiguous and correctly tracks created_count. Init: Add ON CONFLICT(gitlab_project_id) DO UPDATE to the project insert so re-running `lore init` updates path/branch/url instead of failing with a unique constraint violation. MR discussions sync: Reset discussions_sync_attempts to 0 when clearing a sync health error, so previously-failed MRs get a fresh retry budget after successful sync. Count: format_number now handles negative numbers correctly by extracting the sign before inserting thousand-separators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 10:15:53 -05:00
Taylor Eernisse	53ef21d653	fix: propagate DB errors instead of silently swallowing them Replace .unwrap_or(), .ok(), and .filter_map(\|r\| r.ok()) patterns with proper error propagation using ? and rusqlite::OptionalExtension where the query may legitimately return no rows. Affected areas: - events_db::count_events: three count queries now propagate errors instead of defaulting to (0, 0) on failure - note_parser::extract_refs_from_system_notes: row iteration errors are now propagated instead of silently dropped via filter_map - note_parser::noteable_type_to_entity_type: unknown types now log a debug warning before defaulting to "issue" - payloads::store_payload/read_payload: use .optional()? instead of .ok() to distinguish "no row" from "query failed" - backoff::compute_next_attempt_at: use .clamp(0, 30) to guard against negative attempt_count, not just .min(30) - search::vector::max_chunks_per_document: returns Result<i64> with proper error propagation through .optional()?.flatten() - embedding::chunk_ids::decode_rowid: promote debug_assert to assert since negative rowids indicate data corruption worth failing fast on - ingestion::dirty_tracker::record_dirty_error: use .optional()? to handle missing dirty_sources row gracefully instead of hard error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 10:15:36 -05:00
Taylor Eernisse	6e82f723c3	fix(ingestion): unify store + watermark + job-complete in single transaction Previously, drain_resource_events, drain_mr_closes_issues, and drain_mr_diffs each opened a transaction only for the job-complete + watermark update, but the store operation ran outside that transaction. If the process crashed between the store and the watermark update, data would be persisted without the watermark advancing, causing silent duplicates on the next sync. Now each drain function opens the transaction before the store call and commits it only after both the store and the watermark update succeed. On error, the transaction is explicitly dropped so the connection is not left in a half-committed state. Also: - store_resource_events no longer manages its own transaction; the caller passes in a connection (which is actually the transaction) - upsert_mr_file_changes wraps DELETE + INSERT in a transaction internally - reset_discussion_watermarks now also clears diffs_synced_for_updated_at - Orchestrator error span now includes closes_issues_failed + mr_diffs_failed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 14:33:47 -05:00
Taylor Eernisse	95b7183add	feat(who): expand expert + overlap queries with mr_file_changes and mr_reviewers Chain: bd-jec (config flag) -> bd-2yo (fetch MR diffs) -> bd-3qn6 (rewrite who queries) - Add fetch_mr_file_changes config option and --no-file-changes CLI flag - Add GitLab MR diffs API fetch pipeline with watermark-based sync - Create migration 020 for diffs_synced_for_updated_at watermark column - Rewrite query_expert() and query_overlap() to use 4-signal UNION ALL: DiffNote reviewers, DiffNote MR authors, file-change authors, file-change reviewers - Deduplicate across signal types via COUNT(DISTINCT CASE WHEN ... THEN mr_id END) - Add insert_file_change test helper, 8 new who tests, all 397 tests pass - Also includes: list performance migration 019, autocorrect module, README updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 13:35:14 -05:00
Taylor Eernisse	d3306114eb	fix(ingestion): pass ShutdownSignal into issue and MR pagination loops The orchestrator already accepted a ShutdownSignal but only checked it between phases (after all issues fetched, before discussions). The inner loops in ingest_issues() and ingest_merge_requests() consumed entire paginated streams without checking for cancellation. On a large initial sync (thousands of issues/MRs), Ctrl+C could be unresponsive for minutes while the current entity type finished draining. Now both functions accept &ShutdownSignal and check is_cancelled() at the top of each iteration, breaking out promptly and committing the cursor for whatever was already processed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 07:55:36 -05:00
Taylor Eernisse	121a634653	fix: critical data integrity — timeline dedup, discussion atomicity, index collision Three correctness bugs found via peer code review: 1. TimelineEvent PartialEq/Ord omitted entity_type — issue #42 and MR #42 with the same timestamp and event_type were treated as equal. In a BTreeSet or dedup, one would silently be dropped. Added entity_type to both PartialEq and Ord comparisons. 2. discussions.rs: store_payload() was called outside the transaction (on bare conn) while upsert_discussion/notes were inside. A crash between them left orphaned payload rows. Moved store_payload inside the unchecked_transaction block, matching mr_discussions.rs pattern. 3. Migration 017 created idx_issue_assignees_username(username, issue_id) but migration 005 already created the same index name with just (username). SQLite's IF NOT EXISTS silently skipped the composite version on every existing database. New migration 018 drops and recreates the index with correct composite columns. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 07:54:59 -05:00
Taylor Eernisse	f3f3560e0d	fix(ingestion): proper error propagation and transaction safety Three hardening improvements to the ingestion orchestrator: - Replace .unwrap_or(0) with ? on COUNT(*) queries for total_issues and total_mrs. These are simple aggregate queries that should never fail, but if they do (e.g. table missing after failed migration), propagating the error gives an actionable message instead of silently reporting 0 items. - Wrap store_closes_issues_refs in a SAVEPOINT with proper ROLLBACK/RELEASE. Previously, a failure mid-loop (e.g. on the 5th of 10 close-issue references) would leave partial refs committed. Now the entire batch is atomic. - Replace silent catch-all (_ => {}) arms in enqueue_resource_events and update_resource_event_watermark with explicit warnings for unknown entity_type values. Makes debugging easier when new entity types are added but the match arms aren't updated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 22:42:40 -05:00
Taylor Eernisse	405e5370dc	feat(sync): concurrent drains, atomic watermarks, graceful Ctrl+C shutdown Three fixes to the sync pipeline: 1. Atomic watermarks: wrap complete_job + update_watermark in a single SQLite transaction so crash between them can't leave partial state. 2. Concurrent drain loops: prefetch HTTP requests via join_all (batch size = dependent_concurrency), then write serially to DB. Reduces ~9K sequential requests from ~19 min to ~2.4 min. 3. Graceful shutdown: install Ctrl+C handler via ShutdownSignal (Arc<AtomicBool>), thread through orchestrator/CLI, release locked jobs on interrupt, record sync_run as "failed". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 11:22:04 -05:00
Taylor Eernisse	233eb546af	feat: Add commit SHAs, closes_issues watermark, and PRD alignment Migration 015 adds merge_commit_sha/squash_commit_sha to merge_requests (Gate 4/5 prerequisites), closes_issues_synced_for_updated_at watermark for incremental sync, and the missing idx_label_events_label index. The MR transformer and ingestion pipeline now populate commit SHAs during sync. The orchestrator uses watermark-based filtering for closes_issues jobs instead of re-enqueuing all MRs every sync. The Phase B PRD is updated to match the actual codebase: corrected migration numbering (011-015), documented nullable label/milestone fields (migration 012), watermark patterns (013), observability infrastructure (014), simplified source_method values, and updated entity_references schema to match implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-05 15:29:51 -05:00
Taylor Eernisse	65583ed5d6	refactor: Remove redundant doc comments throughout codebase Removes module-level doc comments (//! lines) and excessive inline doc comments that were duplicating information already evident from: - Function/struct names (self-documenting code) - Type signatures (the what is clear from types) - Implementation context (the how is clear from code) Affected modules: - cli/* - Removed command descriptions duplicating clap help text - core/* - Removed module headers and obvious function docs - documents/* - Removed extractor/regenerator/truncation docs - embedding/* - Removed pipeline and chunking docs - gitlab/* - Removed client and transformer docs (kept type definitions) - ingestion/* - Removed orchestrator and ingestion docs - search/* - Removed FTS and vector search docs Philosophy: Code should be self-documenting. Comments should explain "why" (business decisions, non-obvious constraints) not "what" (which the code itself shows). This change reduces noise and maintenance burden while keeping the codebase just as understandable. Retains comments for: - Non-obvious business logic - Important safety invariants - Complex algorithm explanations - Public API boundaries where generated docs matter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:04:32 -05:00
Taylor Eernisse	a76dc8089e	feat(orchestrator): Integrate closes_issues fetching and cross-ref extraction Extends the MR ingestion pipeline to populate the entity_references table from multiple sources: 1. Resource state events (extract_refs_from_state_events): Called after draining the resource_events queue for both issues and MRs. Extracts "closes" relationships from the structured API data. 2. System notes (extract_refs_from_system_notes): Called during MR ingestion to parse "mentioned in" and "closed by" patterns from discussion note bodies. 3. MR closes_issues API (new): - enqueue_mr_closes_issues_jobs(): Queues jobs for all MRs - drain_mr_closes_issues(): Fetches closes_issues for each MR - Records cross-references with source_method='closes_issues_api' New progress events: - ClosesIssuesFetchStarted { total } - ClosesIssueFetched { current, total } - ClosesIssuesFetchComplete { fetched, failed } New result fields on IngestMrProjectResult: - closes_issues_fetched: Count of successful fetches - closes_issues_failed: Count of failed fetches The pipeline now comprehensively builds the relationship graph between issues and MRs, enabling queries like "what will close this issue?" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:03:40 -05:00
teernisse	86a51cddef	fix: Project-scoped job claiming, structured rate-limit logging, RRF total_cmp Targeted fixes across multiple subsystems: dependent_queue: - Add project_id parameter to claim_jobs() for project-scoped job claiming, preventing cross-project job theft during concurrent multi-project ingestion - Add project_id parameter to count_pending_jobs() with optional scoping (None returns global counts, Some(pid) returns per-project counts) gitlab/client: - Downgrade rate-limit log from warn to info (429s are expected operational behavior, not warnings) and add structured fields (path, status_code) for better log filtering and aggregation gitlab/transformers/discussion: - Add tracing::warn on invalid timestamp parse instead of silent fallback to epoch 0, making data quality issues visible in logs ingestion/merge_requests: - Remove duplicate doc comment on upsert_label_tx search/rrf: - Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting, eliminating the NaN edge case entirely (total_cmp treats NaN consistently) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 13:39:13 -05:00
teernisse	f6d19a9467	feat(sync): Instrument pipeline with tracing spans, run_id correlation, and metrics Add end-to-end observability to the sync and ingest pipelines: Sync command: - Generate UUID-based run_id for each sync invocation, propagated through all child spans for log correlation across stages - Accept MetricsLayer reference to extract hierarchical StageTiming data after pipeline completion for robot-mode performance output - Record sync runs in DB via SyncRunRecorder (start/succeed/fail lifecycle) - Wrap entire sync execution in a root tracing span with run_id field Ingest command: - Wrap run_ingest in an instrumented root span with run_id and resource_type - Add project path prefix to discussion progress bars for multi-project clarity - Reset resource_events_synced_for_updated_at on --full re-sync Sync status: - Expand from single last_run to configurable recent runs list (default 10) - Parse and expose StageTiming metrics from stored metrics_json - Add run_id, total_items_processed, total_errors to SyncRunInfo - Add mr_count to DataSummary for complete entity coverage Orchestrator: - Add #[instrument] with structured fields to issue and MR ingestion functions - Record items_processed, items_skipped, errors on span close for MetricsLayer - Emit granular progress events (IssuesFetchStarted, IssuesFetchComplete) - Pass project_id through to drain_resource_events for scoped job claiming Document regenerator and embedding pipeline: - Add #[instrument] spans with items_processed, items_skipped, errors fields - Record final counts on span close for metrics extraction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 13:39:00 -05:00
Taylor Eernisse	ee5c5f9645	perf: Eliminate double serialization, add SQLite tuning, optimize hot paths 11 isomorphic performance fixes from deep audit (no behavior changes): - Eliminate double serialization: store_payload now accepts pre-serialized bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses Cow<[u8]> for zero-copy when compression is disabled. - Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas - Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT RETURNING in both issues.rs and merge_requests.rs - Replace INSERT + SELECT milestone upsert with RETURNING - Use prepare_cached for 5 hot-path queries in extractor.rs - Optimize compute_list_hash: index-sort + incremental SHA-256 instead of clone+sort+join+hash - Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity - Replace RandomState::new() in rand_jitter with atomic counter XOR nanos - Remove redundant per-note payload storage (discussion payload contains all notes already) - Change transform_issue to accept &GitLabIssue (avoids full struct clone) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 08:12:37 -05:00
Taylor Eernisse	4ee99c1677	fix: Propagate queue errors, eliminate format!-based SQL construction Two hardening changes to the dependent queue and orchestrator: - dependent_queue::fail_job now propagates the rusqlite error via ? instead of silently falling back to 0 attempts when the job row is missing. A missing job is a real bug that should surface, not be masked by unwrap_or(0) which would cause infinite retries at the base backoff interval. - orchestrator::enqueue_resource_events_for_entity_type replaces format!-based SQL ("SELECT {id_col} FROM {table}") with separate hardcoded queries per entity type. While the original values were not user-controlled, hardcoded SQL is clearer about intent and eliminates a class of injection risk entirely. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:36:45 -05:00
Taylor Eernisse	880ad1d3fa	refactor(events): Lift transaction control to callers, eliminate duplicated store functions events_db.rs: - Removed internal savepoints from upsert_state_events, upsert_label_events, and upsert_milestone_events. Each function previously created its own savepoint, making it impossible for callers to wrap all three in a single atomic transaction. - Changed signatures from &mut Connection to &Connection, since savepoints are no longer created internally. This makes the functions compatible with rusqlite::Transaction (which derefs to Connection), allowing callers to pass a transaction directly. orchestrator.rs: - Deleted the three store_*_events_tx() functions (store_state_events_tx, store_label_events_tx, store_milestone_events_tx) which were hand-duplicated copies of the events_db upsert functions, created as a workaround for the &mut Connection requirement. Now that events_db accepts &Connection, store_resource_events() calls the canonical upsert functions directly through the unchecked_transaction. - Replaced the max-iterations guard in drain_resource_events() with a HashSet-based deduplication of job IDs. The old guard used an arbitrary 2x multiplier on total_pending which could either terminate too early (if many retries were legitimate) or too late. The new approach precisely prevents reprocessing the same job within a single drain run, which is the actual invariant we need. Net effect: ~133 lines of duplicated SQL removed, single source of truth for event upsert logic, and callers control transaction scope. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 14:09:35 -05:00
Taylor Eernisse	bb75a9d228	fix(events): Resource events now run on incremental syncs, fix output and progress bar Three bugs fixed: 1. Early return in orchestrator when no discussions needed sync also skipped resource event enqueue+drain. On incremental syncs (the most common case), resource events were never fetched. Restructured to use if/else instead of early return so Step 4 always executes. 2. Ingest command JSON and human-readable output silently dropped resource_events_fetched/failed counts. Added to IngestJsonData and print_ingest_summary. 3. Progress bar reuse after finish_and_clear caused indicatif to silently ignore subsequent set_position/set_length calls. Added reset() call before reconfiguring the bar for resource events. Also removed stale comment referencing "unsafe" that didn't reflect the actual unchecked_transaction approach. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 13:06:35 -05:00
Taylor Eernisse	2bcd8db0e9	feat(events): Wire resource event fetching into sync pipeline (bd-1ep) Integrate resource event fetching as Step 4 of both issue and MR ingestion, gated behind the fetch_resource_events config flag. Orchestrator changes: - Add ProgressEvent variants: ResourceEventsFetchStarted, ResourceEventFetched, ResourceEventsFetchComplete - Add resource_events_fetched/failed fields to IngestProjectResult and IngestMrProjectResult - New enqueue_resource_events_for_entity_type() queries all issues/MRs for a project and enqueues resource_events jobs via the dependent queue (INSERT OR IGNORE for idempotency) - New drain_resource_events() claims jobs in batches, fetches state/label/milestone events from GitLab API, stores them atomically via unchecked_transaction, and handles failures with exponential backoff via fail_job() - Max-iterations guard prevents infinite retry loops within a single drain run - New store_resource_events() + per-type _tx helpers write events using prepared statements inside a single transaction - DrainResult struct tracks fetched/failed counts CLI ingest changes: - IngestResult gains resource_events_fetched/failed fields - Progress bar repurposed for resource event fetch phase (reuses discussion bar with updated template) - Accumulates event counts from both issue and MR ingestion CLI sync changes: - SyncResult gains resource_events_fetched/failed fields - Accumulates counts from both ingest stages - print_sync() conditionally displays event counts - Structured logging includes event counts Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 13:02:15 -05:00
Taylor Eernisse	a50fc78823	style: Apply cargo fmt and clippy fixes across codebase Automated formatting and lint corrections from parallel agent work: - cargo fmt: import reordering (alphabetical), line wrapping to respect max width, trailing comma normalization, destructuring alignment, function signature reformatting, match arm formatting - clippy (pedantic): Range::contains() instead of manual comparisons, i64::from() instead of `as i64` casts, .clamp() instead of .max().min() chains, let-chain refactors (if-let with &&), #[allow(clippy::too_many_arguments)] and #[allow(clippy::field_reassign_with_default)] where warranted - Removed trailing blank lines and extra whitespace No behavioral changes. All existing tests pass unmodified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 13:01:59 -05:00
Taylor Eernisse	559f0702ad	feat(ingestion): Mark entities dirty on ingest for document regeneration Integrates the dirty tracking system into all four ingestion paths (issues, MRs, issue discussions, MR discussions). After each entity is upserted within its transaction, a corresponding dirty_queue entry is inserted so the document regenerator knows which documents need rebuilding. This ensures that document generation stays transactionally consistent with data changes: if the ingest transaction rolls back, the dirty marker rolls back too, preventing stale document regeneration attempts. Also updates GiError references to LoreError in these files as part of the codebase-wide rename, and adjusts issue discussion logging from info to debug level to reduce noise during normal sync runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 15:46:51 -05:00
Taylor Eernisse	20edff4ab1	feat(documents): Add document generation pipeline with dirty tracking Implements the documents module that transforms raw ingested entities (issues, MRs, discussions) into searchable document blobs stored in the documents table. This is the foundation for both FTS5 lexical search and vector embedding. Key components: - documents::extractor: Renders entities into structured text documents. Issues include title, description, labels, milestone, assignees, and threaded discussion summaries. MRs additionally include source/target branches, reviewers, and approval status. Discussions are rendered with full note threading. - documents::regenerator: Drains the dirty_queue table to regenerate only documents whose source entities changed since last sync. Supports full rebuild mode (seeds all entities into dirty queue first) and project-scoped regeneration. - documents::truncation: Safety cap at 2MB per document to prevent pathological outliers from degrading FTS or embedding performance. - ingestion::dirty_tracker: Marks entities as dirty inside the ingestion transaction so document regeneration stays consistent with data changes. Uses INSERT OR IGNORE to deduplicate. - ingestion::discussion_queue: Queue-based discussion fetching that isolates individual discussion failures from the broader ingestion pipeline, preventing a single corrupt discussion from blocking an entire project sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 15:46:18 -05:00
Taylor Eernisse	8fe5feda7e	fix(ingestion): Move counter increments after transaction commit Ingestion counters (discussions_upserted, notes_upserted, discussions_fetched, diffnotes_count) were incremented before tx.commit(), meaning a failed commit would report inflated metrics. Counters now increment only after successful commit so reported numbers accurately reflect persisted state. Also simplifies the stale-removal guard in issue discussions: the received_first_response flag was unnecessary since an empty seen_discussion_ids list is safe to pass to remove_stale -- if there were no discussions, stale removal correctly sweeps all previously-stored discussions. The two separate code paths (empty vs populated) are collapsed into a single branch. Derives Default on IngestResult to eliminate verbose zero-init. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 08:42:11 -05:00
Taylor Eernisse	cd44e516e3	feat(ingestion): Implement MR sync with parallel discussion prefetch Adds complete merge request ingestion pipeline with a novel two-phase discussion sync strategy optimized for throughput. New modules: - merge_requests.rs: MR upsert with labels/assignees/reviewers handling, stale MR cleanup, and watermark-based incremental sync - mr_discussions.rs: Parallel prefetch strategy for MR discussions Two-phase MR discussion sync: 1. PREFETCH PHASE: Spawn concurrent tasks to fetch discussions for multiple MRs simultaneously (configurable concurrency, default 8). Transform and validate in parallel, storing results in memory. 2. WRITE PHASE: Serial database writes to avoid lock contention. Each MR's discussions written in a single transaction, with proper stale discussion cleanup. This approach achieves ~4-8x throughput vs serial fetching while maintaining database consistency. Transform errors are tracked per-MR to prevent partial writes from corrupting watermarks. Orchestrator updates: - ingest_merge_requests(): Coordinates MR fetch -> discussion sync flow - Progress callbacks emit MR-specific events for UI feedback - Respects --full flag to reset discussion watermarks for full resync The prefetch strategy is critical for MRs which typically have more discussions than issues, and where API latency dominates sync time. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 22:45:48 -05:00
Taylor Eernisse	d9d749ac57	fix(discussion): Make NormalizedDiscussion polymorphic for MR support This is a P0 fix from the CP1-CP2 alignment audit. The original NormalizedDiscussion struct had issue_id as a non-optional i64 and hardcoded noteable_type to "Issue", making it incompatible with merge request discussions even though the database schema already supports both via nullable columns and a CHECK constraint. Changes: - Add NoteableRef enum with Issue(i64) and MergeRequest(i64) variants to provide compile-time safety against mixing up issue vs MR IDs - Change NormalizedDiscussion.issue_id from i64 to Option<i64> - Add NormalizedDiscussion.merge_request_id: Option<i64> - Update transform_discussion() signature to take NoteableRef instead of local_issue_id, deriving issue_id/merge_request_id/noteable_type from the enum variant - Update upsert_discussion() SQL to include merge_request_id column (now 12 parameters instead of 11) - Export NoteableRef from transformers module - Add test for MergeRequest discussion transformation - Update all existing tests to use NoteableRef::Issue(id) The database schema (migration 002) was forward-thinking and already supports both issue_id and merge_request_id as nullable columns with a CHECK constraint. This change prepares the application layer for CP2 merge request support without requiring any migrations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 17:00:49 -05:00
Taylor Eernisse	cd60350c6d	feat(ingestion): Implement cursor-based incremental sync from GitLab Provides efficient data synchronization with minimal API calls. src/ingestion/issues.rs - Issue sync logic: - Cursor-based incremental sync using updated_at timestamp - Fetches only issues modified since last sync - Configurable cursor rewind for overlap safety (default 2s) - Batched database writes with transaction wrapping - Upserts issues, labels, milestones, and assignees - Maintains issue_labels and issue_assignees junction tables - Returns IngestIssuesResult with counts and issues needing discussion sync - Identifies issues where discussion count changed src/ingestion/discussions.rs - Discussion sync logic: - Fetches discussions for issues that need sync - Compares discussion count vs stored to detect changes - Batched note insertion with raw payload preservation - Updates discussion metadata (resolved state, note counts) - Tracks sync state per discussion to enable incremental updates - Returns IngestDiscussionsResult with fetched/skipped counts src/ingestion/orchestrator.rs - Sync coordination: - Two-phase sync: issues first, then discussions - Progress callback support for CLI progress bars - ProgressEvent enum for fine-grained status updates: - IssueFetch, IssueProcess, DiscussionFetch, DiscussionSkip - Acquires sync lock before starting - Updates sync watermark on successful completion - Handles partial failures gracefully (watermark not updated) - Returns IngestProjectResult with detailed statistics The architecture supports future additions: - Merge request ingestion (parallel to issues) - Full-text search indexing hooks - Vector embedding pipeline integration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-26 11:28:34 -05:00

36 Commits