11 isomorphic performance fixes from deep audit (no behavior changes):
- Eliminate double serialization: store_payload now accepts pre-serialized
bytes (&[u8]) instead of re-serializing from serde_json::Value. Uses
Cow<[u8]> for zero-copy when compression is disabled.
- Add SQLite cache_size (64MB) and mmap_size (256MB) pragmas
- Replace SELECT-then-INSERT label upserts with INSERT...ON CONFLICT
RETURNING in both issues.rs and merge_requests.rs
- Replace INSERT + SELECT milestone upsert with RETURNING
- Use prepare_cached for 5 hot-path queries in extractor.rs
- Optimize compute_list_hash: index-sort + incremental SHA-256 instead
of clone+sort+join+hash
- Pre-allocate embedding float-to-bytes buffer with Vec::with_capacity
- Replace RandomState::new() in rand_jitter with atomic counter XOR nanos
- Remove redundant per-note payload storage (discussion payload contains
all notes already)
- Change transform_issue to accept &GitLabIssue (avoids full struct clone)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two bug fixes:
1. extractor.rs: The content hash was computed on the pre-truncation
content, meaning the hash stored in the document didn't correspond
to the actual stored (truncated) content. This would cause change
detection to miss updates when content changed only within the
truncated portion. Hash is now computed after truncate_hard_cap()
so it always matches the persisted content.
2. dependent_queue.rs: claim_jobs() had a TOCTOU race between the
SELECT that found available jobs and the UPDATE that locked them.
Under concurrent callers, two drain runs could claim the same job.
Replaced with a single UPDATE ... RETURNING statement that
atomically selects and locks jobs in one operation.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the documents module that transforms raw ingested entities
(issues, MRs, discussions) into searchable document blobs stored in
the documents table. This is the foundation for both FTS5 lexical
search and vector embedding.
Key components:
- documents::extractor: Renders entities into structured text documents.
Issues include title, description, labels, milestone, assignees, and
threaded discussion summaries. MRs additionally include source/target
branches, reviewers, and approval status. Discussions are rendered
with full note threading.
- documents::regenerator: Drains the dirty_queue table to regenerate
only documents whose source entities changed since last sync. Supports
full rebuild mode (seeds all entities into dirty queue first) and
project-scoped regeneration.
- documents::truncation: Safety cap at 2MB per document to prevent
pathological outliers from degrading FTS or embedding performance.
- ingestion::dirty_tracker: Marks entities as dirty inside the
ingestion transaction so document regeneration stays consistent
with data changes. Uses INSERT OR IGNORE to deduplicate.
- ingestion::discussion_queue: Queue-based discussion fetching that
isolates individual discussion failures from the broader ingestion
pipeline, preventing a single corrupt discussion from blocking
an entire project sync.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>