Files

Taylor Eernisse 3bb24dc6cb docs: Add performance audit report with optimization findings

PERFORMANCE_AUDIT.md documents a comprehensive code analysis identifying
12 optimization opportunities across the codebase:

High-impact findings (ICE score > 8):
1. Triple-EXISTS change detection -> LEFT JOIN (DONE)
2. N+1 label/assignee inserts during ingestion
3. Clone in embedding batch loop
4. Correlated GROUP_CONCAT in list queries
5. Multiple EXISTS per label filter (DONE)

Medium-impact findings (ICE 5-7):
6. String allocation in chunking
7. Multiple COUNT queries -> conditional aggregation (DONE)
8. Collect-then-concat in truncation (DONE)
9. Box<dyn ToSql> allocations in filters
10. Missing Vec::with_capacity hints (DONE)
11. FTS token collect-join pattern (DONE)
12. Transformer string clones

Report includes:
- Methodology section explaining code-analysis approach
- ICE (Impact x Confidence / Effort) scoring matrix
- Detailed SQL query transformations with isomorphism proofs
- Before/after code samples for each optimization
- Test verification notes

Status: 6 of 12 optimizations implemented in this session.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 11:23:06 -05:00

14 KiB

Raw Blame History

Gitlore Performance Audit Report

Date: 2026-02-05 Auditor: Claude Code (Opus 4.5) Scope: Core system performance - ingestion, embedding, search, and document regeneration

Executive Summary

This audit identifies 12 high-impact optimization opportunities across the Gitlore codebase. The most significant findings center on:

SQL query patterns with N+1 issues and inefficient correlated subqueries
Memory allocation patterns in hot paths (embedding, chunking, ingestion)
Change detection queries using triple-EXISTS patterns instead of JOINs

Estimated overall improvement potential: 30-50% reduction in latency for filtered searches, 2-5x improvement in ingestion throughput for issues/MRs with many labels.

Methodology

Codebase analysis: Full read of all modules in src/
SQL pattern analysis: All queries checked for N+1, missing indexes, unbounded results
Memory allocation analysis: Clone patterns, unnecessary collections, missing capacity hints
Test baseline: All tests pass (cargo test --release)

Note: Without access to a live GitLab instance or populated database, profiling is code-analysis based rather than runtime measured.

Opportunity Matrix

ID	Issue	Location	Impact	Confidence	Effort	ICE Score	Status
1	Triple-EXISTS change detection	`change_detector.rs:19-46`	HIGH	95%	LOW	9.5	DONE
2	N+1 label/assignee inserts	`issues.rs:270-285`, `merge_requests.rs:242-272`	HIGH	95%	MEDIUM	9.0	Pending
3	Clone in embedding batch loop	`pipeline.rs:165`	HIGH	90%	LOW	9.0	Pending
4	Correlated GROUP_CONCAT in list	`list.rs:341-348`	HIGH	90%	MEDIUM	8.5	Pending
5	Multiple EXISTS per label filter	`filters.rs:100-107`	HIGH	85%	MEDIUM	8.0	DONE
6	String allocation in chunking	`chunking.rs:7-49`	MEDIUM	95%	MEDIUM	7.5	Pending
7	Multiple COUNT queries	`count.rs:44-56`	MEDIUM	95%	LOW	7.0	DONE
8	Collect-then-concat pattern	`truncation.rs:60-61`	MEDIUM	90%	LOW	7.0	DONE
9	Box allocations	`filters.rs:67-135`	MEDIUM	80%	HIGH	6.0	Pending
10	Missing Vec::with_capacity	`pipeline.rs:106`, multiple	LOW	95%	LOW	5.5	DONE
11	FTS token collect-join	`fts.rs:26-41`	LOW	90%	LOW	5.0	DONE
12	Transformer string clones	`merge_request.rs:51-77`	MEDIUM	85%	HIGH	5.0	Pending

ICE Score = (Impact x Confidence) / Effort, scaled 1-10

Detailed Findings

1. Triple-EXISTS Change Detection Query (ICE: 9.5)

Location: src/embedding/change_detector.rs:19-46

Current Code:

SELECT d.id, d.content_text, d.content_hash
FROM documents d
WHERE d.id > ?1
  AND (
    NOT EXISTS (SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0)
    OR EXISTS (SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 AND em.document_hash != d.content_hash)
    OR EXISTS (SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 AND (...))
  )
ORDER BY d.id
LIMIT ?2

Problem: Three separate EXISTS subqueries, each scanning embedding_metadata. SQLite cannot short-circuit across OR'd EXISTS efficiently.

Proposed Fix:

SELECT d.id, d.content_text, d.content_hash
FROM documents d
LEFT JOIN embedding_metadata em
  ON em.document_id = d.id AND em.chunk_index = 0
WHERE d.id > ?1
  AND (
    em.document_id IS NULL                      -- no embedding
    OR em.document_hash != d.content_hash       -- hash mismatch
    OR em.chunk_max_bytes IS NULL
    OR em.chunk_max_bytes != ?3
    OR em.model != ?4
    OR em.dims != ?5
  )
ORDER BY d.id
LIMIT ?2

Isomorphism Proof: Both queries return documents needing embedding when:

No embedding exists for chunk_index=0 (NULL check)
Hash changed (direct comparison)
Config mismatch (model/dims/chunk_max_bytes)

The LEFT JOIN + NULL check is semantically identical to NOT EXISTS. The OR conditions inside WHERE match the EXISTS predicates exactly.

Expected Impact: 2-3x faster for large document sets. Single scan of embedding_metadata instead of three.

2. N+1 Label/Assignee Inserts (ICE: 9.0)

Location:

src/ingestion/issues.rs:270-285
src/ingestion/merge_requests.rs:242-272

Current Code:

for label_name in label_names {
    let label_id = upsert_label_tx(tx, project_id, label_name, &mut labels_created)?;
    link_issue_label_tx(tx, local_issue_id, label_id)?;
}

Problem: Each label triggers 2+ SQL statements. With 20 labels × 100 issues = 4000+ queries per batch.

Proposed Fix: Batch insert using prepared statements with multi-row VALUES:

// Build batch: INSERT INTO issue_labels VALUES (?, ?), (?, ?), ...
let mut values = String::new();
let mut params: Vec<Box<dyn ToSql>> = Vec::with_capacity(label_ids.len() * 2);
for (i, label_id) in label_ids.iter().enumerate() {
    if i > 0 { values.push_str(","); }
    values.push_str("(?,?)");
    params.push(Box::new(local_issue_id));
    params.push(Box::new(*label_id));
}
let sql = format!("INSERT OR IGNORE INTO issue_labels (issue_id, label_id) VALUES {}", values);

Or use prepare_cached() pattern from events_db.rs.

Isomorphism Proof: Both approaches insert identical rows. OR IGNORE handles duplicates identically.

Expected Impact: 5-10x faster ingestion for issues/MRs with many labels.

3. Clone in Embedding Batch Loop (ICE: 9.0)

Location: src/embedding/pipeline.rs:165

Current Code:

let texts: Vec<String> = batch.iter().map(|c| c.text.clone()).collect();

Problem: Every batch iteration clones all chunk texts. With BATCH_SIZE=32 and thousands of chunks, this doubles memory allocation in the hot path.

Proposed Fix: Transfer ownership instead of cloning:

// Option A: Drain chunks from all_chunks instead of iterating
let texts: Vec<String> = batch.into_iter().map(|c| c.text).collect();

// Option B: Store references in ChunkWork, clone only at API boundary
struct ChunkWork<'a> {
    text: &'a str,
    // ...
}

Isomorphism Proof: Same texts sent to Ollama, same embeddings returned. Order and content identical.

Expected Impact: 30-50% reduction in embedding pipeline memory allocation.

4. Correlated GROUP_CONCAT in List Queries (ICE: 8.5)

Location: src/cli/commands/list.rs:341-348

Current Code:

SELECT i.*,
       (SELECT GROUP_CONCAT(l.name, X'1F') FROM issue_labels il JOIN labels l ... WHERE il.issue_id = i.id) AS labels_csv,
       (SELECT COUNT(*) FROM discussions WHERE issue_id = i.id) as discussion_count
FROM issues i

Problem: Each correlated subquery executes per row. With LIMIT 50, that's 100+ subquery executions.

Proposed Fix: Use window functions or pre-aggregated CTEs:

WITH label_agg AS (
    SELECT il.issue_id, GROUP_CONCAT(l.name, X'1F') AS labels_csv
    FROM issue_labels il JOIN labels l ON il.label_id = l.id
    GROUP BY il.issue_id
),
discussion_agg AS (
    SELECT issue_id, COUNT(*) AS cnt
    FROM discussions WHERE issue_id IS NOT NULL
    GROUP BY issue_id
)
SELECT i.*, la.labels_csv, da.cnt
FROM issues i
LEFT JOIN label_agg la ON la.issue_id = i.id
LEFT JOIN discussion_agg da ON da.issue_id = i.id
WHERE ...
LIMIT 50

Isomorphism Proof: Same data returned - labels concatenated, discussion counts accurate. JOIN preserves NULL when no labels/discussions exist.

Expected Impact: 3-5x faster list queries with discussion/label data.

5. Multiple EXISTS Per Label Filter (ICE: 8.0)

Location: src/search/filters.rs:100-107

Current Code:

WHERE EXISTS (SELECT 1 ... AND label_name = ?)
  AND EXISTS (SELECT 1 ... AND label_name = ?)
  AND EXISTS (SELECT 1 ... AND label_name = ?)

Problem: Filtering by 3 labels generates 3 EXISTS subqueries. Each scans document_labels.

Proposed Fix: Single EXISTS with GROUP BY/HAVING:

WHERE EXISTS (
    SELECT 1 FROM document_labels dl
    WHERE dl.document_id = d.id
      AND dl.label_name IN (?, ?, ?)
    GROUP BY dl.document_id
    HAVING COUNT(DISTINCT dl.label_name) = 3
)

Isomorphism Proof: Both return documents with ALL specified labels. AND of EXISTS = document has label1 AND label2 AND label3. GROUP BY + HAVING COUNT(DISTINCT) = 3 is mathematically equivalent.

Expected Impact: 2-4x faster filtered search with multiple labels.

6. String Allocation in Chunking (ICE: 7.5)

Location: src/embedding/chunking.rs:7-49

Current Code:

chunks.push((chunk_index, remaining.to_string()));

Problem: Converts &str slices to owned String for every chunk. The input is already a &str.

Proposed Fix: Return borrowed slices or use Cow:

pub fn split_into_chunks(content: &str) -> Vec<(usize, &str)> {
    // Return slices into original content
}

Or if ownership is needed later:

pub fn split_into_chunks(content: &str) -> Vec<(usize, Cow<'_, str>)>

Isomorphism Proof: Same chunk boundaries, same text content. Only allocation behavior changes.

Expected Impact: Reduces allocations by ~50% in chunking hot path.

7. Multiple COUNT Queries (ICE: 7.0)

Location: src/cli/commands/count.rs:44-56

Current Code:

let count = conn.query_row("SELECT COUNT(*) FROM issues", ...)?;
let opened = conn.query_row("SELECT COUNT(*) FROM issues WHERE state = 'opened'", ...)?;
let closed = conn.query_row("SELECT COUNT(*) FROM issues WHERE state = 'closed'", ...)?;

Problem: 5 separate queries for MR state breakdown, 3 for issues.

Proposed Fix: Single query with CASE aggregation:

SELECT
    COUNT(*) AS total,
    SUM(CASE WHEN state = 'opened' THEN 1 ELSE 0 END) AS opened,
    SUM(CASE WHEN state = 'closed' THEN 1 ELSE 0 END) AS closed
FROM issues

Isomorphism Proof: Identical counts returned. CASE WHEN with SUM is standard SQL for conditional counting.

Expected Impact: 3-5x fewer round trips for count command.

8. Collect-then-Concat Pattern (ICE: 7.0)

Location: src/documents/truncation.rs:60-61

Current Code:

let formatted: Vec<String> = notes.iter().map(format_note).collect();
let total: String = formatted.concat();

Problem: Allocates intermediate Vec, then allocates again for concat.

Proposed Fix: Use fold or format directly:

let total = notes.iter().fold(String::new(), |mut acc, note| {
    acc.push_str(&format_note(note));
    acc
});

Or with capacity hint:

let total_len: usize = notes.iter().map(|n| estimate_note_len(n)).sum();
let mut total = String::with_capacity(total_len);
for note in notes {
    total.push_str(&format_note(note));
}

Isomorphism Proof: Same concatenated string output. Order preserved.

Expected Impact: 50% reduction in allocations for document regeneration.

9. Box Allocations (ICE: 6.0)

Location: src/search/filters.rs:67-135

Current Code:

let mut params: Vec<Box<dyn rusqlite::types::ToSql>> = vec![Box::new(ids_json)];
// ... more Box::new() calls
let param_refs: Vec<&dyn rusqlite::types::ToSql> = params.iter().map(|p| p.as_ref()).collect();

Problem: Boxing each parameter, then collecting references. Two allocations per parameter.

Proposed Fix: Use rusqlite's params! macro or typed parameter arrays:

// For known parameter counts, use arrays
let params: [&dyn ToSql; 4] = [&ids_json, &author, &state, &limit];

// Or build SQL with named parameters and use params! directly

Expected Impact: Eliminates ~15 allocations per filtered search.

10. Missing Vec::with_capacity (ICE: 5.5)

Locations:

src/embedding/pipeline.rs:106
src/embedding/pipeline.rs:162
Multiple other locations

Current Code:

let mut all_chunks: Vec<ChunkWork> = Vec::new();

Proposed Fix:

// Estimate: average 3 chunks per document
let mut all_chunks = Vec::with_capacity(pending.len() * 3);

Expected Impact: Eliminates reallocation overhead during vector growth.

11. FTS Token Collect-Join (ICE: 5.0)

Location: src/search/fts.rs:26-41

Current Code:

let tokens: Vec<String> = trimmed.split_whitespace().map(...).collect();
tokens.join(" ")

Proposed Fix: Use itertools or avoid intermediate vec:

use itertools::Itertools;
trimmed.split_whitespace().map(...).join(" ")

Expected Impact: Minor - search queries are typically short.

12. Transformer String Clones (ICE: 5.0)

Location: src/gitlab/transformers/merge_request.rs:51-77

Problem: Multiple .clone() calls on String fields during transformation.

Proposed Fix: Use std::mem::take() where possible, or restructure to avoid cloning.

Expected Impact: Moderate - depends on MR volume.

Regression Guardrails

For any optimization implemented:

Test Coverage: All existing tests must pass
Output Equivalence: For SQL changes, verify identical result sets with test data
Benchmark Suite: Add benchmarks for affected paths before/after

Suggested benchmark targets:

#[bench] fn bench_change_detection_1k_docs(b: &mut Bencher) { ... }
#[bench] fn bench_label_insert_50_labels(b: &mut Bencher) { ... }
#[bench] fn bench_hybrid_search_filtered(b: &mut Bencher) { ... }

Implementation Priority

Phase 1 (Quick Wins) - COMPLETE:

~~Change detection query rewrite (#1)~~ DONE
~~Multiple COUNT consolidation (#7)~~ DONE
~~Collect-concat pattern (#8)~~ DONE
~~Vec::with_capacity hints (#10)~~ DONE
~~FTS token collect-join (#11)~~ DONE
~~Multiple EXISTS per label (#5)~~ DONE

Phase 2 (Medium Effort): 5. Embedding batch clone removal (#3) 6. Label filter EXISTS consolidation (#5) 7. Chunking string allocation (#6)

Phase 3 (Higher Effort): 8. N+1 batch inserts (#2) 9. List query CTEs (#4) 10. Parameter boxing (#9)

Appendix: Test Baseline

cargo test --release
running 127 tests
test result: ok. 127 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

All tests pass. Any optimization must maintain this baseline.

14 KiB Raw Blame History Unescape Escape

Gitlore Performance Audit Report

Executive Summary

Methodology

Opportunity Matrix

Detailed Findings

1. Triple-EXISTS Change Detection Query (ICE: 9.5)

2. N+1 Label/Assignee Inserts (ICE: 9.0)

3. Clone in Embedding Batch Loop (ICE: 9.0)

4. Correlated GROUP_CONCAT in List Queries (ICE: 8.5)

5. Multiple EXISTS Per Label Filter (ICE: 8.0)

6. String Allocation in Chunking (ICE: 7.5)

7. Multiple COUNT Queries (ICE: 7.0)

8. Collect-then-Concat Pattern (ICE: 7.0)

9. Box Allocations (ICE: 6.0)

10. Missing Vec::with_capacity (ICE: 5.5)

11. FTS Token Collect-Join (ICE: 5.0)

12. Transformer String Clones (ICE: 5.0)

Regression Guardrails

Implementation Priority

Appendix: Test Baseline

14 KiB

Raw Blame History