The seed, expand, and collect stages each had their own near-identical
resolve_entity_ref helper that converted internal DB IDs to full EntityRef
structs. This duplication made it easy for bug fixes to land in one copy
but not the others.
Extract a single public resolve_entity_ref into timeline.rs with an
optional project_id parameter:
- Some(project_id): scopes the lookup (used by seed, which knows the
project from the FTS result)
- None: unscoped lookup (used by expand, which traverses cross-project
references)
Also changes UnresolvedRef.target_iid from i64 to Option<i64>. Cross-
project references parsed from descriptions may not always carry an IID
(e.g. when the reference is malformed or the target was deleted). The
previous sentinel value of 0 was semantically incorrect since GitLab IIDs
start at 1.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change detection queries (embedding/change_detector.rs):
- Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check
- SQLite now scans embedding_metadata once instead of three times
- Semantically identical: returns docs needing embedding when no
embedding exists, hash changed, or config mismatch
Count queries (cli/commands/count.rs):
- Consolidate 3 separate COUNT queries for issues into single query
using conditional aggregation (CASE WHEN state = 'x' THEN 1)
- Same optimization for MRs: 5 queries reduced to 1
Search filter queries (search/filters.rs):
- Replace N separate EXISTS clauses for label filtering with single
IN() clause with COUNT/GROUP BY HAVING pattern
- For multi-label AND queries, this reduces N subqueries to 1
FTS tokenization (search/fts.rs):
- Replace collect-into-Vec-then-join pattern with direct String building
- Pre-allocate capacity hint for result string
Discussion truncation (documents/truncation.rs):
- Calculate total length without allocating concatenated string first
- Only allocate full string when we know it fits within limit
Embedding pipeline (embedding/pipeline.rs):
- Add Vec::with_capacity hints for chunk work and cleared_docs hashset
- Reduces reallocations during embedding batch processing
Backoff calculation (core/backoff.rs):
- Replace unchecked addition with saturating_add to prevent overflow
- Add test case verifying overflow protection
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces two new modules for extracting and storing entity cross-references
from GitLab data:
note_parser.rs:
- Parses system notes for "mentioned in" and "closed by" patterns
- Extracts cross-project references (group/project#42, group/project!123)
- Uses lazy-compiled regexes for performance
- Handles both issue (#) and MR (!) sigils
- Provides extract_refs_from_system_notes() for batch processing
references.rs:
- Extracts refs from resource_state_events table (API-sourced closes links)
- Provides insert_entity_reference() for storing discovered references
- Includes resolution helpers: resolve_issue_local_id, resolve_mr_local_id,
resolve_project_path for converting iids to internal IDs
- Enables cross-project reference resolution
These modules power the entity_references table, enabling features like
"find all MRs that close this issue" and "find all issues mentioned in this MR".
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three correctness fixes found during peer code review:
Embedding pipeline savepoint leak (HIGH severity):
The SAVEPOINT embed_page / RELEASE embed_page pattern had ~10 `?`
propagation points between them. Any error from record_embedding_error,
clear_document_embeddings, or store_embedding would exit the function
without rolling back, leaving the SQLite connection in a broken
transactional state and causing cascading failures for the rest of the
session. Fixed by extracting page processing into `embed_page()` and
wrapping with explicit rollback-on-error handling.
Dependent queue fail_job race (MEDIUM severity):
fail_job performed a SELECT followed by a separate UPDATE on the
attempts counter without a transaction. Under concurrent lock
reclamation, the attempts value could be read stale. Replaced with a
single atomic UPDATE that increments attempts and computes exponential
backoff entirely in SQL, also halving DB round-trips. Added explicit
error when the job no longer exists.
RRF duplicate document score inflation (MEDIUM severity):
If a retriever returned the same document_id multiple times, the RRF
score accumulated multiple rank contributions while the rank only
recorded the first occurrence. Moved the score accumulation inside the
`if is_none` guard so only the first occurrence per list contributes.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Targeted fixes across multiple subsystems:
dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
(None returns global counts, Some(pid) returns per-project counts)
gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
behavior, not warnings) and add structured fields (path, status_code)
for better log filtering and aggregation
gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
to epoch 0, making data quality issues visible in logs
ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx
search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
eliminating the NaN edge case entirely (total_cmp treats NaN consistently)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduce the foundational observability layer for the sync pipeline:
- MetricsLayer: Custom tracing subscriber layer that captures span timing
and structured fields, materializing them into a hierarchical
Vec<StageTiming> tree for robot-mode performance data output
- logging: Dual-layer subscriber infrastructure with configurable stderr
verbosity (-v/-vv/-vvv) and always-on JSON file logging with daily
rotation and configurable retention (default 30 days)
- SyncRunRecorder: Compile-time enforced lifecycle recorder for sync_runs
table (start -> succeed|fail), with correlation IDs and aggregate counts
- LoggingConfig: New config section for log_dir, retention_days, and
file_logging toggle
- get_log_dir(): Path helper for log directory resolution
- is_permanent_api_error(): Distinguish retryable vs permanent API failures
(only 404 is truly permanent; 403/auth errors may be environmental)
Database changes:
- Migration 013: Add resource_events_synced_for_updated_at watermark columns
to issues and merge_requests tables for incremental resource event sync
- Migration 014: Enrich sync_runs with run_id correlation ID, aggregate
counts (total_items_processed, total_errors), and run_id index
- Wrap file-based migrations in savepoints for rollback safety
Dependencies: Add uuid (run_id generation), tracing-appender (file logging)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 isomorphic performance fixes from deep audit (no behavior changes):
- Eliminate double serialization: store_payload now accepts pre-serialized
bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The sync pipeline was bottlenecked at 10 req/s (hardcoded) with
sequential project processing and no retry on rate limiting. These
changes target 3-5x throughput improvement.
Rate limit configuration:
- Add requestsPerSecond to SyncConfig (default 30.0, was hardcoded 10)
- Pass configured rate through to GitLabClient::new from ingest
- Floor rate at 0.1 rps in RateLimiter::new to prevent panic on
Duration::from_secs_f64(1.0 / 0.0) — now reachable via user config
429 auto-retry:
- Both request() and request_with_headers() retry up to 3 times on
HTTP 429, respecting the retry-after header (default 60s)
- Extract parse_retry_after helper, reused by handle_response fallback
- After exhausting retries, the 429 error propagates as before
- Improved JSON decode errors now include a response body preview
Concurrent project ingestion:
- Derive Clone on GitLabClient (cheap: shares Arc<Mutex<RateLimiter>>
and reqwest::Client which is already Arc-backed)
- Restructure project loop to use futures::stream::buffer_unordered
with primary_concurrency (default 4) as the parallelism bound
- Each project gets its own SQLite connection (WAL mode + busy_timeout
handles concurrent writes)
- Add show_spinner field to IngestDisplay to separate the per-project
spinner from the sync-level stage spinner
- Error aggregation defers failures: all successful projects get their
summaries printed and results counted before returning the first error
- Bump dependentConcurrency default from 2 to 8 for discussion prefetch
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two hardening changes to the dependent queue and orchestrator:
- dependent_queue::fail_job now propagates the rusqlite error via ?
instead of silently falling back to 0 attempts when the job row is
missing. A missing job is a real bug that should surface, not be
masked by unwrap_or(0) which would cause infinite retries at the
base backoff interval.
- orchestrator::enqueue_resource_events_for_entity_type replaces
format!-based SQL ("SELECT {id_col} FROM {table}") with separate
hardcoded queries per entity type. While the original values were
not user-controlled, hardcoded SQL is clearer about intent and
eliminates a class of injection risk entirely.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
GitLab returns null for the label/milestone fields on resource_label_events
and resource_milestone_events when the referenced label or milestone has
been deleted. This caused deserialization failures during sync.
- Add migration 012 to recreate both event tables with nullable
label_name, milestone_title, and milestone_id columns (SQLite
requires table recreation to alter NOT NULL constraints)
- Change GitLabLabelEvent.label and GitLabMilestoneEvent.milestone
to Option<> in the Rust types
- Update upsert functions to pass through None values correctly
- Add tests for null label and null milestone deserialization
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
events_db.rs:
- Removed internal savepoints from upsert_state_events,
upsert_label_events, and upsert_milestone_events. Each function
previously created its own savepoint, making it impossible for
callers to wrap all three in a single atomic transaction.
- Changed signatures from &mut Connection to &Connection, since
savepoints are no longer created internally. This makes the
functions compatible with rusqlite::Transaction (which derefs to
Connection), allowing callers to pass a transaction directly.
orchestrator.rs:
- Deleted the three store_*_events_tx() functions (store_state_events_tx,
store_label_events_tx, store_milestone_events_tx) which were
hand-duplicated copies of the events_db upsert functions, created as
a workaround for the &mut Connection requirement. Now that events_db
accepts &Connection, store_resource_events() calls the canonical
upsert functions directly through the unchecked_transaction.
- Replaced the max-iterations guard in drain_resource_events() with a
HashSet-based deduplication of job IDs. The old guard used an
arbitrary 2x multiplier on total_pending which could either terminate
too early (if many retries were legitimate) or too late. The new
approach precisely prevents reprocessing the same job within a single
drain run, which is the actual invariant we need.
Net effect: ~133 lines of duplicated SQL removed, single source of
truth for event upsert logic, and callers control transaction scope.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two bug fixes:
1. extractor.rs: The content hash was computed on the pre-truncation
content, meaning the hash stored in the document didn't correspond
to the actual stored (truncated) content. This would cause change
detection to miss updates when content changed only within the
truncated portion. Hash is now computed after truncate_hard_cap()
so it always matches the persisted content.
2. dependent_queue.rs: claim_jobs() had a TOCTOU race between the
SELECT that found available jobs and the UPDATE that locked them.
Under concurrent callers, two drain runs could claim the same job.
Replaced with a single UPDATE ... RETURNING statement that
atomically selects and locks jobs in one operation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New module src/core/dependent_queue.rs provides job queue operations
against the pending_dependent_fetches table. Designed for second-pass
fetches that depend on primary entity ingestion (resource events,
MR close references, MR file diffs).
Queue operations:
- enqueue_job: Idempotent INSERT OR IGNORE keyed on the UNIQUE
(project_id, entity_type, entity_iid, job_type) constraint.
Returns bool indicating whether the row was actually inserted.
- claim_jobs: Two-phase claim — SELECT available jobs (unlocked,
past retry window) then UPDATE locked_at in batch. Orders by
enqueued_at ASC for FIFO processing within a job type.
- complete_job: DELETE the row on successful processing.
- fail_job: Increments attempts, calculates exponential backoff
(30s * 2^(attempts-1), capped at 480s), sets next_retry_at,
clears locked_at, and records the error message. Reads current
attempts via query with unwrap_or(0) fallback for robustness.
- reclaim_stale_locks: Clears locked_at on jobs locked longer than
a configurable threshold, recovering from worker crashes.
- count_pending_jobs: GROUP BY job_type aggregation for progress
reporting and stats display.
Registers both events_db and dependent_queue in src/core/mod.rs.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
New module src/core/events_db.rs provides database operations for
resource events:
- upsert_state_events: Batch INSERT OR REPLACE for state change events,
keyed on UNIQUE(gitlab_id, project_id). Wraps in a savepoint for
atomicity per entity batch. Maps GitLabStateEvent fields including
optional user, source_commit, and source_merge_request_iid.
- upsert_label_events: Same pattern for label add/remove events,
extracting label.name for denormalized storage.
- upsert_milestone_events: Same pattern for milestone assignment events,
storing both milestone.title and milestone.id.
All three upsert functions:
- Take &mut Connection (required for savepoint creation)
- Use prepare_cached for statement reuse across batch iterations
- Convert ISO timestamps via iso_to_ms_strict for ms-epoch storage
- Propagate rusqlite errors via the #[from] LoreError::Database path
- Return the count of events processed
Supporting functions:
- resolve_entity_ids: Maps entity_type string to (issue_id, MR_id) pair
with exactly-one-non-NULL invariant matching the CHECK constraints
- count_events: Queries all three event tables with conditional COUNT
aggregations, returning EventCounts struct. Uses unwrap_or((0, 0))
for graceful degradation when tables don't exist (pre-migration 011).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds a new boolean field to SyncConfig that controls whether resource
event fetching is performed during sync:
- SyncConfig.fetch_resource_events: defaults to true via serde
default_true helper, serialized as "fetchResourceEvents" in JSON
- SyncArgs.no_events: --no-events CLI flag that overrides the config
value to false when present
- SyncOptions.no_events: propagates the flag through the sync pipeline
- handle_sync_cmd: mutates loaded config when --no-events is set,
ensuring the flag takes effect regardless of config file contents
This follows the existing pattern established by --no-embed and
--no-docs flags, where CLI flags override config file defaults.
The config is loaded as mutable specifically to support this override.
Also adds "events" to the count command's entity type value_parser,
enabling `lore count events` (implementation in a separate commit).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Introduces five new tables that power temporal queries (timeline,
file-history, trace) via GitLab Resource Events APIs:
- resource_state_events: State transitions (opened/closed/reopened/merged/locked)
with actor tracking, source commit, and source MR references
- resource_label_events: Label add/remove history per entity
- resource_milestone_events: Milestone assignment changes per entity
- entity_references: Cross-reference table (Gate 2 prep) linking
source/target entity pairs with reference type and discovery method
- pending_dependent_fetches: Generic job queue for resource_events,
mr_closes_issues, and mr_diffs with exponential backoff retry
All event tables enforce entity exclusivity via CHECK constraints
(exactly one of issue_id or merge_request_id must be non-NULL).
Deduplication handled via UNIQUE indexes on (gitlab_id, project_id).
FK cascades ensure cleanup when parent entities are removed.
The dependent fetch queue uses a UNIQUE constraint on
(project_id, entity_type, entity_iid, job_type) for idempotent
enqueue, with partial indexes optimizing claim and retry queries.
Registered as migration 011 in the embedded MIGRATIONS array in db.rs.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add chunk_max_bytes and chunk_count columns to embedding_metadata to
support config drift detection and adaptive dedup sizing. Includes a
partial index on sentinel rows (chunk_index=0) to accelerate the drift
detection and max-chunk queries.
Also exports LATEST_SCHEMA_VERSION as a public constant derived from
the MIGRATIONS array length, replacing the previously hardcoded magic
number in the health check.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extend resolve_project() with a 4th cascade step: case-insensitive
substring match when exact, case-insensitive, and suffix matches all
fail. This allows shorthand like "typescript" to match
"vs/typescript-code" when unambiguous. Multi-match still returns an
error with all candidates listed.
Also change ambiguity errors from LoreError::Other to LoreError::Ambiguous
so they get the proper AMBIGUOUS error code (exit 18) instead of
INTERNAL_ERROR.
Includes tests for unambiguous substring, case-insensitive substring,
ambiguous substring, and suffix-preferred-over-substring ordering.
Co-Authored-By: Claude (us.anthropic.claude-opus-4-5-20251101-v1:0) <noreply@anthropic.com>
ConfigNotFound previously used exit code 2 which collides with clap's
usage error code. Remap it to exit 20 to avoid ambiguity. Also add
dedicated NotFound (exit 17) and Ambiguous (exit 18) error codes with
proper ErrorCode variants and Display implementations, replacing the
previous incorrect mapping of these errors to GitLabNotFound.
Co-Authored-By: Claude (us.anthropic.claude-opus-4-5-20251101-v1:0) <noreply@anthropic.com>
Mechanical rename of GiError -> LoreError across the core module to
match the project's rebranding from gitlab-inbox to gitlore/lore.
Updates the error enum name, all From impls, and the Result type alias.
Additionally introduces:
- New error variants for embedding pipeline: OllamaUnavailable,
OllamaModelNotFound, EmbeddingFailed, EmbeddingsNotBuilt. Each
includes actionable suggestions (e.g., "ollama serve", "ollama pull
nomic-embed-text") to guide users through recovery.
- New error codes 14-16 for programmatic handling of Ollama failures.
- Savepoint-based migration execution in db.rs: each migration now
runs inside a SQLite SAVEPOINT so a failed migration rolls back
cleanly without corrupting the schema_version tracking. Previously
a partial migration could leave the database in an inconsistent
state.
- core::backoff module: exponential backoff with jitter utility for
retry loops in the embedding pipeline and discussion queues.
- core::project module: helper for resolving project IDs and paths
from the local database, used by the document regenerator and
search filters.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Error suggestions now include concrete CLI examples so users
(and robot-mode consumers) can act immediately without consulting
docs. For instance, ConfigNotFound now shows the expected path
and the exact command to run, TokenNotSet shows the export syntax,
and Ambiguous shows the -p flag with example project paths.
Also fixes the error code for Ambiguous errors: it now maps to
GitLabNotFound instead of InternalError, since the entity exists
but the user needs to disambiguate -- not an internal failure.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Duplicate ISO 8601 timestamp parsing functions existed in both
discussion.rs and merge_request.rs transformers. This extracts
iso_to_ms_strict() and iso_to_ms_opt_strict() into core::time
as the single source of truth, and updates both transformer
modules to use the shared implementations.
Also removes the private now_ms() from merge_request.rs in
favor of the existing core::time::now_ms(), and replaces the
local parse_timestamp_opt() in discussion.rs with the public
iso_to_ms() from core::time.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Improves core infrastructure with robot-friendly error output and
faster lock release for better sync behavior.
Error handling improvements (error.rs):
- ErrorCode::exit_code(): Unique exit codes per error type (1-13)
for programmatic error handling in scripts/agents
- GiError::suggestion(): Helpful hints for common error recovery
- GiError::to_robot_error(): Structured JSON error conversion
- RobotError/RobotErrorOutput: Serializable error types with code,
message, and optional suggestion fields
Lock improvements (lock.rs):
- Heartbeat thread now polls every 100ms for release flag, only
updating database heartbeat at full interval (5s default)
- Eliminates 5-10s delay after sync completion when waiting for
heartbeat thread to notice release
- Reduces lock hold time after operation completes
Database (db.rs):
- Bump expected schema version to 6 for MR migration
The exit code mapping enables shell scripts and CI/CD pipelines to
distinguish between configuration errors (2-4), GitLab API errors
(5-8), and database errors (9-11) for appropriate retry/alert logic.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>