diff --git a/PERFORMANCE_AUDIT.md b/PERFORMANCE_AUDIT.md new file mode 100644 index 0000000..c46bf51 --- /dev/null +++ b/PERFORMANCE_AUDIT.md @@ -0,0 +1,467 @@ +# Gitlore Performance Audit Report + +**Date**: 2026-02-05 +**Auditor**: Claude Code (Opus 4.5) +**Scope**: Core system performance - ingestion, embedding, search, and document regeneration + +## Executive Summary + +This audit identifies 12 high-impact optimization opportunities across the Gitlore codebase. The most significant findings center on: + +1. **SQL query patterns** with N+1 issues and inefficient correlated subqueries +2. **Memory allocation patterns** in hot paths (embedding, chunking, ingestion) +3. **Change detection queries** using triple-EXISTS patterns instead of JOINs + +**Estimated overall improvement potential**: 30-50% reduction in latency for filtered searches, 2-5x improvement in ingestion throughput for issues/MRs with many labels. + +--- + +## Methodology + +- **Codebase analysis**: Full read of all modules in `src/` +- **SQL pattern analysis**: All queries checked for N+1, missing indexes, unbounded results +- **Memory allocation analysis**: Clone patterns, unnecessary collections, missing capacity hints +- **Test baseline**: All tests pass (`cargo test --release`) + +Note: Without access to a live GitLab instance or populated database, profiling is code-analysis based rather than runtime measured. + +--- + +## Opportunity Matrix + +| ID | Issue | Location | Impact | Confidence | Effort | ICE Score | Status | +|----|-------|----------|--------|------------|--------|-----------|--------| +| 1 | Triple-EXISTS change detection | `change_detector.rs:19-46` | HIGH | 95% | LOW | **9.5** | **DONE** | +| 2 | N+1 label/assignee inserts | `issues.rs:270-285`, `merge_requests.rs:242-272` | HIGH | 95% | MEDIUM | **9.0** | Pending | +| 3 | Clone in embedding batch loop | `pipeline.rs:165` | HIGH | 90% | LOW | **9.0** | Pending | +| 4 | Correlated GROUP_CONCAT in list | `list.rs:341-348` | HIGH | 90% | MEDIUM | **8.5** | Pending | +| 5 | Multiple EXISTS per label filter | `filters.rs:100-107` | HIGH | 85% | MEDIUM | **8.0** | **DONE** | +| 6 | String allocation in chunking | `chunking.rs:7-49` | MEDIUM | 95% | MEDIUM | **7.5** | Pending | +| 7 | Multiple COUNT queries | `count.rs:44-56` | MEDIUM | 95% | LOW | **7.0** | **DONE** | +| 8 | Collect-then-concat pattern | `truncation.rs:60-61` | MEDIUM | 90% | LOW | **7.0** | **DONE** | +| 9 | Box allocations | `filters.rs:67-135` | MEDIUM | 80% | HIGH | **6.0** | Pending | +| 10 | Missing Vec::with_capacity | `pipeline.rs:106`, multiple | LOW | 95% | LOW | **5.5** | **DONE** | +| 11 | FTS token collect-join | `fts.rs:26-41` | LOW | 90% | LOW | **5.0** | **DONE** | +| 12 | Transformer string clones | `merge_request.rs:51-77` | MEDIUM | 85% | HIGH | **5.0** | Pending | + +ICE Score = (Impact x Confidence) / Effort, scaled 1-10 + +--- + +## Detailed Findings + +### 1. Triple-EXISTS Change Detection Query (ICE: 9.5) + +**Location**: `src/embedding/change_detector.rs:19-46` + +**Current Code**: +```sql +SELECT d.id, d.content_text, d.content_hash +FROM documents d +WHERE d.id > ?1 + AND ( + NOT EXISTS (SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0) + OR EXISTS (SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 AND em.document_hash != d.content_hash) + OR EXISTS (SELECT 1 FROM embedding_metadata em WHERE em.document_id = d.id AND em.chunk_index = 0 AND (...)) + ) +ORDER BY d.id +LIMIT ?2 +``` + +**Problem**: Three separate EXISTS subqueries, each scanning `embedding_metadata`. SQLite cannot short-circuit across OR'd EXISTS efficiently. + +**Proposed Fix**: +```sql +SELECT d.id, d.content_text, d.content_hash +FROM documents d +LEFT JOIN embedding_metadata em + ON em.document_id = d.id AND em.chunk_index = 0 +WHERE d.id > ?1 + AND ( + em.document_id IS NULL -- no embedding + OR em.document_hash != d.content_hash -- hash mismatch + OR em.chunk_max_bytes IS NULL + OR em.chunk_max_bytes != ?3 + OR em.model != ?4 + OR em.dims != ?5 + ) +ORDER BY d.id +LIMIT ?2 +``` + +**Isomorphism Proof**: Both queries return documents needing embedding when: +- No embedding exists for chunk_index=0 (NULL check) +- Hash changed (direct comparison) +- Config mismatch (model/dims/chunk_max_bytes) + +The LEFT JOIN + NULL check is semantically identical to NOT EXISTS. The OR conditions inside WHERE match the EXISTS predicates exactly. + +**Expected Impact**: 2-3x faster for large document sets. Single scan of embedding_metadata instead of three. + +--- + +### 2. N+1 Label/Assignee Inserts (ICE: 9.0) + +**Location**: +- `src/ingestion/issues.rs:270-285` +- `src/ingestion/merge_requests.rs:242-272` + +**Current Code**: +```rust +for label_name in label_names { + let label_id = upsert_label_tx(tx, project_id, label_name, &mut labels_created)?; + link_issue_label_tx(tx, local_issue_id, label_id)?; +} +``` + +**Problem**: Each label triggers 2+ SQL statements. With 20 labels × 100 issues = 4000+ queries per batch. + +**Proposed Fix**: Batch insert using prepared statements with multi-row VALUES: + +```rust +// Build batch: INSERT INTO issue_labels VALUES (?, ?), (?, ?), ... +let mut values = String::new(); +let mut params: Vec> = Vec::with_capacity(label_ids.len() * 2); +for (i, label_id) in label_ids.iter().enumerate() { + if i > 0 { values.push_str(","); } + values.push_str("(?,?)"); + params.push(Box::new(local_issue_id)); + params.push(Box::new(*label_id)); +} +let sql = format!("INSERT OR IGNORE INTO issue_labels (issue_id, label_id) VALUES {}", values); +``` + +Or use `prepare_cached()` pattern from `events_db.rs`. + +**Isomorphism Proof**: Both approaches insert identical rows. OR IGNORE handles duplicates identically. + +**Expected Impact**: 5-10x faster ingestion for issues/MRs with many labels. + +--- + +### 3. Clone in Embedding Batch Loop (ICE: 9.0) + +**Location**: `src/embedding/pipeline.rs:165` + +**Current Code**: +```rust +let texts: Vec = batch.iter().map(|c| c.text.clone()).collect(); +``` + +**Problem**: Every batch iteration clones all chunk texts. With BATCH_SIZE=32 and thousands of chunks, this doubles memory allocation in the hot path. + +**Proposed Fix**: Transfer ownership instead of cloning: + +```rust +// Option A: Drain chunks from all_chunks instead of iterating +let texts: Vec = batch.into_iter().map(|c| c.text).collect(); + +// Option B: Store references in ChunkWork, clone only at API boundary +struct ChunkWork<'a> { + text: &'a str, + // ... +} +``` + +**Isomorphism Proof**: Same texts sent to Ollama, same embeddings returned. Order and content identical. + +**Expected Impact**: 30-50% reduction in embedding pipeline memory allocation. + +--- + +### 4. Correlated GROUP_CONCAT in List Queries (ICE: 8.5) + +**Location**: `src/cli/commands/list.rs:341-348` + +**Current Code**: +```sql +SELECT i.*, + (SELECT GROUP_CONCAT(l.name, X'1F') FROM issue_labels il JOIN labels l ... WHERE il.issue_id = i.id) AS labels_csv, + (SELECT COUNT(*) FROM discussions WHERE issue_id = i.id) as discussion_count +FROM issues i +``` + +**Problem**: Each correlated subquery executes per row. With LIMIT 50, that's 100+ subquery executions. + +**Proposed Fix**: Use window functions or pre-aggregated CTEs: + +```sql +WITH label_agg AS ( + SELECT il.issue_id, GROUP_CONCAT(l.name, X'1F') AS labels_csv + FROM issue_labels il JOIN labels l ON il.label_id = l.id + GROUP BY il.issue_id +), +discussion_agg AS ( + SELECT issue_id, COUNT(*) AS cnt + FROM discussions WHERE issue_id IS NOT NULL + GROUP BY issue_id +) +SELECT i.*, la.labels_csv, da.cnt +FROM issues i +LEFT JOIN label_agg la ON la.issue_id = i.id +LEFT JOIN discussion_agg da ON da.issue_id = i.id +WHERE ... +LIMIT 50 +``` + +**Isomorphism Proof**: Same data returned - labels concatenated, discussion counts accurate. JOIN preserves NULL when no labels/discussions exist. + +**Expected Impact**: 3-5x faster list queries with discussion/label data. + +--- + +### 5. Multiple EXISTS Per Label Filter (ICE: 8.0) + +**Location**: `src/search/filters.rs:100-107` + +**Current Code**: +```sql +WHERE EXISTS (SELECT 1 ... AND label_name = ?) + AND EXISTS (SELECT 1 ... AND label_name = ?) + AND EXISTS (SELECT 1 ... AND label_name = ?) +``` + +**Problem**: Filtering by 3 labels generates 3 EXISTS subqueries. Each scans document_labels. + +**Proposed Fix**: Single EXISTS with GROUP BY/HAVING: + +```sql +WHERE EXISTS ( + SELECT 1 FROM document_labels dl + WHERE dl.document_id = d.id + AND dl.label_name IN (?, ?, ?) + GROUP BY dl.document_id + HAVING COUNT(DISTINCT dl.label_name) = 3 +) +``` + +**Isomorphism Proof**: Both return documents with ALL specified labels. AND of EXISTS = document has label1 AND label2 AND label3. GROUP BY + HAVING COUNT(DISTINCT) = 3 is mathematically equivalent. + +**Expected Impact**: 2-4x faster filtered search with multiple labels. + +--- + +### 6. String Allocation in Chunking (ICE: 7.5) + +**Location**: `src/embedding/chunking.rs:7-49` + +**Current Code**: +```rust +chunks.push((chunk_index, remaining.to_string())); +``` + +**Problem**: Converts `&str` slices to owned `String` for every chunk. The input is already a `&str`. + +**Proposed Fix**: Return borrowed slices or use `Cow`: + +```rust +pub fn split_into_chunks(content: &str) -> Vec<(usize, &str)> { + // Return slices into original content +} +``` + +Or if ownership is needed later: +```rust +pub fn split_into_chunks(content: &str) -> Vec<(usize, Cow<'_, str>)> +``` + +**Isomorphism Proof**: Same chunk boundaries, same text content. Only allocation behavior changes. + +**Expected Impact**: Reduces allocations by ~50% in chunking hot path. + +--- + +### 7. Multiple COUNT Queries (ICE: 7.0) + +**Location**: `src/cli/commands/count.rs:44-56` + +**Current Code**: +```rust +let count = conn.query_row("SELECT COUNT(*) FROM issues", ...)?; +let opened = conn.query_row("SELECT COUNT(*) FROM issues WHERE state = 'opened'", ...)?; +let closed = conn.query_row("SELECT COUNT(*) FROM issues WHERE state = 'closed'", ...)?; +``` + +**Problem**: 5 separate queries for MR state breakdown, 3 for issues. + +**Proposed Fix**: Single query with CASE aggregation: + +```sql +SELECT + COUNT(*) AS total, + SUM(CASE WHEN state = 'opened' THEN 1 ELSE 0 END) AS opened, + SUM(CASE WHEN state = 'closed' THEN 1 ELSE 0 END) AS closed +FROM issues +``` + +**Isomorphism Proof**: Identical counts returned. CASE WHEN with SUM is standard SQL for conditional counting. + +**Expected Impact**: 3-5x fewer round trips for count command. + +--- + +### 8. Collect-then-Concat Pattern (ICE: 7.0) + +**Location**: `src/documents/truncation.rs:60-61` + +**Current Code**: +```rust +let formatted: Vec = notes.iter().map(format_note).collect(); +let total: String = formatted.concat(); +``` + +**Problem**: Allocates intermediate Vec, then allocates again for concat. + +**Proposed Fix**: Use fold or format directly: + +```rust +let total = notes.iter().fold(String::new(), |mut acc, note| { + acc.push_str(&format_note(note)); + acc +}); +``` + +Or with capacity hint: +```rust +let total_len: usize = notes.iter().map(|n| estimate_note_len(n)).sum(); +let mut total = String::with_capacity(total_len); +for note in notes { + total.push_str(&format_note(note)); +} +``` + +**Isomorphism Proof**: Same concatenated string output. Order preserved. + +**Expected Impact**: 50% reduction in allocations for document regeneration. + +--- + +### 9. Box Allocations (ICE: 6.0) + +**Location**: `src/search/filters.rs:67-135` + +**Current Code**: +```rust +let mut params: Vec> = vec![Box::new(ids_json)]; +// ... more Box::new() calls +let param_refs: Vec<&dyn rusqlite::types::ToSql> = params.iter().map(|p| p.as_ref()).collect(); +``` + +**Problem**: Boxing each parameter, then collecting references. Two allocations per parameter. + +**Proposed Fix**: Use rusqlite's params! macro or typed parameter arrays: + +```rust +// For known parameter counts, use arrays +let params: [&dyn ToSql; 4] = [&ids_json, &author, &state, &limit]; + +// Or build SQL with named parameters and use params! directly +``` + +**Expected Impact**: Eliminates ~15 allocations per filtered search. + +--- + +### 10. Missing Vec::with_capacity (ICE: 5.5) + +**Locations**: +- `src/embedding/pipeline.rs:106` +- `src/embedding/pipeline.rs:162` +- Multiple other locations + +**Current Code**: +```rust +let mut all_chunks: Vec = Vec::new(); +``` + +**Proposed Fix**: +```rust +// Estimate: average 3 chunks per document +let mut all_chunks = Vec::with_capacity(pending.len() * 3); +``` + +**Expected Impact**: Eliminates reallocation overhead during vector growth. + +--- + +### 11. FTS Token Collect-Join (ICE: 5.0) + +**Location**: `src/search/fts.rs:26-41` + +**Current Code**: +```rust +let tokens: Vec = trimmed.split_whitespace().map(...).collect(); +tokens.join(" ") +``` + +**Proposed Fix**: Use itertools or avoid intermediate vec: + +```rust +use itertools::Itertools; +trimmed.split_whitespace().map(...).join(" ") +``` + +**Expected Impact**: Minor - search queries are typically short. + +--- + +### 12. Transformer String Clones (ICE: 5.0) + +**Location**: `src/gitlab/transformers/merge_request.rs:51-77` + +**Problem**: Multiple `.clone()` calls on String fields during transformation. + +**Proposed Fix**: Use `std::mem::take()` where possible, or restructure to avoid cloning. + +**Expected Impact**: Moderate - depends on MR volume. + +--- + +## Regression Guardrails + +For any optimization implemented: + +1. **Test Coverage**: All existing tests must pass +2. **Output Equivalence**: For SQL changes, verify identical result sets with test data +3. **Benchmark Suite**: Add benchmarks for affected paths before/after + +Suggested benchmark targets: +```rust +#[bench] fn bench_change_detection_1k_docs(b: &mut Bencher) { ... } +#[bench] fn bench_label_insert_50_labels(b: &mut Bencher) { ... } +#[bench] fn bench_hybrid_search_filtered(b: &mut Bencher) { ... } +``` + +--- + +## Implementation Priority + +**Phase 1 (Quick Wins)** - COMPLETE: +1. ~~Change detection query rewrite (#1)~~ **DONE** +2. ~~Multiple COUNT consolidation (#7)~~ **DONE** +3. ~~Collect-concat pattern (#8)~~ **DONE** +4. ~~Vec::with_capacity hints (#10)~~ **DONE** +5. ~~FTS token collect-join (#11)~~ **DONE** +6. ~~Multiple EXISTS per label (#5)~~ **DONE** + +**Phase 2 (Medium Effort)**: +5. Embedding batch clone removal (#3) +6. Label filter EXISTS consolidation (#5) +7. Chunking string allocation (#6) + +**Phase 3 (Higher Effort)**: +8. N+1 batch inserts (#2) +9. List query CTEs (#4) +10. Parameter boxing (#9) + +--- + +## Appendix: Test Baseline + +``` +cargo test --release +running 127 tests +test result: ok. 127 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out +``` + +All tests pass. Any optimization must maintain this baseline.