Commit Graph

6 Commits

Author SHA1 Message Date
teernisse
23efb15599 feat(truncation): add pre-truncation for oversized descriptions
Add pre_truncate_description() to prevent unbounded memory allocation when
processing pathologically large descriptions (e.g., 500MB base64 blobs in
issue descriptions).

Previously, the document extraction pipeline would:
1. Allocate memory for the entire description
2. Append to content buffer
3. Only truncate at the end via truncate_hard_cap()

For a 500MB description, this would allocate 500MB+ before truncation.

New approach:
1. Check description size BEFORE appending
2. If over limit, truncate at UTF-8 boundary immediately
3. Add human-readable marker: "[... description truncated from 500.0MB to 2.0MB ...]"
4. Log warning with original size for observability

Also adds format_bytes() helper for human-readable byte sizes (B, KB, MB).

This is applied to both issue and MR document extraction in extractor.rs,
protecting the embedding pipeline from OOM on malformed GitLab data.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-26 11:06:32 -05:00
Taylor Eernisse
45126f04a6 fix: document upsert project_id, truncation budget, and Ollama model matching
- regenerator: Include project_id in the ON CONFLICT UPDATE clause for
  document upserts. Previously, if a document moved between projects
  (e.g., during re-ingestion), the project_id would remain stale.

- truncation: Compute the omission marker ("N notes omitted") before
  checking whether first+last notes fit in the budget. The old order
  computed the marker after the budget check, meaning the marker's byte
  cost was unaccounted for and could cause over-budget output.

- ollama: Tighten model name matching to require either an exact match
  or a colon-delimited tag prefix (model == name or name starts with
  "model:"). The prior starts_with check would false-positive on
  "nomic-embed-text-v2" when looking for "nomic-embed-text". Tests
  updated to cover exact match, tagged, wrong model, and prefix
  false-positive cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 10:16:14 -05:00
Taylor Eernisse
72f1cafdcf perf: Optimize SQL queries and reduce allocations in hot paths
Change detection queries (embedding/change_detector.rs):
- Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check
- SQLite now scans embedding_metadata once instead of three times
- Semantically identical: returns docs needing embedding when no
  embedding exists, hash changed, or config mismatch

Count queries (cli/commands/count.rs):
- Consolidate 3 separate COUNT queries for issues into single query
  using conditional aggregation (CASE WHEN state = 'x' THEN 1)
- Same optimization for MRs: 5 queries reduced to 1

Search filter queries (search/filters.rs):
- Replace N separate EXISTS clauses for label filtering with single
  IN() clause with COUNT/GROUP BY HAVING pattern
- For multi-label AND queries, this reduces N subqueries to 1

FTS tokenization (search/fts.rs):
- Replace collect-into-Vec-then-join pattern with direct String building
- Pre-allocate capacity hint for result string

Discussion truncation (documents/truncation.rs):
- Calculate total length without allocating concatenated string first
- Only allocate full string when we know it fits within limit

Embedding pipeline (embedding/pipeline.rs):
- Add Vec::with_capacity hints for chunk work and cleared_docs hashset
- Reduces reallocations during embedding batch processing

Backoff calculation (core/backoff.rs):
- Replace unchecked addition with saturating_add to prevent overflow
- Add test case verifying overflow protection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 11:21:28 -05:00
Taylor Eernisse
65583ed5d6 refactor: Remove redundant doc comments throughout codebase
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)

Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs

Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.

Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:04:32 -05:00
Taylor Eernisse
a50fc78823 style: Apply cargo fmt and clippy fixes across codebase
Automated formatting and lint corrections from parallel agent work:

- cargo fmt: import reordering (alphabetical), line wrapping to respect
  max width, trailing comma normalization, destructuring alignment,
  function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
  i64::from() instead of `as i64` casts, .clamp() instead of
  .max().min() chains, let-chain refactors (if-let with &&),
  #[allow(clippy::too_many_arguments)] and
  #[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace

No behavioral changes. All existing tests pass unmodified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 13:01:59 -05:00
Taylor Eernisse
20edff4ab1 feat(documents): Add document generation pipeline with dirty tracking
Implements the documents module that transforms raw ingested entities
(issues, MRs, discussions) into searchable document blobs stored in
the documents table. This is the foundation for both FTS5 lexical
search and vector embedding.

Key components:

- documents::extractor: Renders entities into structured text documents.
  Issues include title, description, labels, milestone, assignees, and
  threaded discussion summaries. MRs additionally include source/target
  branches, reviewers, and approval status. Discussions are rendered
  with full note threading.

- documents::regenerator: Drains the dirty_queue table to regenerate
  only documents whose source entities changed since last sync. Supports
  full rebuild mode (seeds all entities into dirty queue first) and
  project-scoped regeneration.

- documents::truncation: Safety cap at 2MB per document to prevent
  pathological outliers from degrading FTS or embedding performance.

- ingestion::dirty_tracker: Marks entities as dirty inside the
  ingestion transaction so document regeneration stays consistent
  with data changes. Uses INSERT OR IGNORE to deduplicate.

- ingestion::discussion_queue: Queue-based discussion fetching that
  isolates individual discussion failures from the broader ingestion
  pipeline, preventing a single corrupt discussion from blocking
  an entire project sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:46:18 -05:00