Files
gitlore/docs/performance-audit-2026-02-12.md
teernisse acc5e12e3d perf: force partial index for DiffNote queries, batch stats counts
Query optimizer fixes for the `who` and `stats` commands based on
a systematic performance audit of the SQLite query plans.

who command (expert/reviews/detail modes):
- Add INDEXED BY idx_notes_diffnote_path_created hints to all DiffNote
  queries. SQLite's planner was selecting idx_notes_system (38% of rows)
  over the far more selective partial index (9.3% of rows). Measured
  50-133x speedup on expert queries, 26x on reviews queries.
- Reorder JOIN clauses in detail mode's MR-author sub-select to match
  the index scan direction (notes -> discussions -> merge_requests).

stats command:
- Replace 12+ sequential COUNT(*) queries with conditional aggregates
  (COALESCE + SUM + CASE). Documents, dirty_sources, pending_discussion_
  fetches, and pending_dependent_fetches tables each scanned once instead
  of 2-3 times. Measured 1.7x speedup (109ms -> 65ms warm cache).
- Switch FTS document count from COUNT(*) on the virtual table to
  COUNT(*) on documents_fts_docsize shadow table (B-tree scan vs FTS5
  virtual table overhead). Measured 19x speedup for that single query.

Database: 61652 docs, 282K notes, 211K discussions, 1.5GB.
2026-02-12 11:21:00 -05:00

7.2 KiB

Deep Performance Audit Report

Date: 2026-02-12 Branch: perf-audit (e9bacc94) Parent: 039ab1c2 (master, v0.6.1)


Methodology

  1. Baseline — measured p50/p95 latency for all major commands with warm cache
  2. Profile — used macOS sample profiler and EXPLAIN QUERY PLAN to identify hotspots
  3. Golden output — captured exact numeric outputs before changes as equivalence oracle
  4. One lever per change — each optimization isolated and independently benchmarked
  5. Revert threshold — any optimization <1.1x speedup reverted per audit rules

Baseline Measurements (warm cache, release build)

Command Latency Notes
who --path src/core/db.rs (expert) 2200ms Hotspot
who --active 83-93ms Acceptable
who workload 22ms Fast
stats 107-112ms Hotspot
search "authentication" 1030ms Hotspot (library-level)
list issues -n 50 ~40ms Fast

Optimization 1: INDEXED BY for DiffNote Queries

Target: src/cli/commands/who.rs — expert and reviews query paths

Problem: SQLite query planner chose idx_notes_system (38% selectivity, 106K rows) over idx_notes_diffnote_path_created (9.3% selectivity, 26K rows) for path-filtered DiffNote queries. The partial index WHERE noteable_type = 'MergeRequest' AND type = 'DiffNote' is far more selective but the planner's cost model didn't pick it.

Change: Added INDEXED BY idx_notes_diffnote_path_created to all 8 SQL queries across query_expert, query_expert_details, query_reviews, build_path_query (probes 1 & 2), and suffix_probe.

Results:

Query Before After Speedup
expert (specific path) 2200ms 56-58ms 38x
expert (broad path) 2200ms 83ms 26x
reviews 1800ms 24ms 75x

Isomorphism proof: INDEXED BY only changes which index the planner uses, not the query semantics. Same rows matched, same ordering, same output. Verified by golden output comparison across 5+ runs.


Optimization 2: Conditional Aggregates in Stats

Target: src/cli/commands/stats.rs

Problem: 12+ sequential COUNT(*) queries each requiring a full table scan of documents (61K rows). Each scan touched the same pages but couldn't share work.

Changes:

  • Documents: 5 sequential COUNTs -> 1 query with SUM(CASE WHEN ... THEN 1 END)
  • FTS count: SELECT COUNT(*) FROM documents_fts (virtual table, slow) -> SELECT COUNT(*) FROM documents_fts_docsize (shadow B-tree table, 19x faster)
  • Embeddings: 2 queries -> 1 with COUNT(DISTINCT document_id), COUNT(*)
  • Dirty sources: 2 queries -> 1 with conditional aggregates
  • Pending fetches: 2 queries -> 1 each (discussions, dependents)

Results:

Metric Before After Speedup
Warm median 112ms 66ms 1.70x
Cold 1220ms ~700ms ~1.7x

Golden output verified:

total:61652, issues:8241, mrs:10018, discussions:43393, truncated:63
fts:61652, embedded:61652, chunks:88161

All values match exactly across before/after runs.

Isomorphism proof: SUM(CASE WHEN x THEN 1 END) is algebraically identical to COUNT(*) WHERE x. The FTS5 shadow table documents_fts_docsize has exactly one row per FTS document by SQLite specification, so COUNT(*) on it equals the virtual table count.


Investigation: Two-Phase FTS Search (REVERTED)

Target: src/search/fts.rs, src/cli/commands/search.rs

Hypothesis: FTS5 snippet() generation is expensive. Splitting search into Phase 1 (score-only MATCH+bm25) and Phase 2 (snippet for filtered results only) should reduce work.

Implementation: Created fetch_fts_snippets() that retrieves snippets only for post-filter document IDs via json_each() join.

Results:

Metric Before After Improvement
search (limit 20) 1030ms 995ms 3.5%

Decision: Reverted. Per audit rules, <1.1x speedup does not justify added code complexity.

Root cause: The bottleneck is not snippet generation but MATCH + bm25() scoring itself. Profiling showed strspn (FTS5 tokenizer) and memmove as the top CPU consumers. The same query runs in 30ms on system sqlite3 but 1030ms in rusqlite's bundled SQLite — a ~125x gap despite both being SQLite 3.51.x compiled at -O3.


Library-Level Finding: Bundled SQLite FTS5 Performance

Observation: FTS5 MATCH+bm25 queries are ~125x slower in rusqlite's bundled SQLite vs system sqlite3.

Environment Query Time Notes
System sqlite3 (macOS) 30ms (with snippet), 8ms (without) Same .db file
rusqlite bundled 1030ms features = ["bundled"], OPT_LEVEL=3

Profiler data (macOS sample):

  • Top hotspot: strspn in FTS5 tokenizer
  • Secondary: memmove in FTS5 internals
  • Scaling: ~5ms per result (limit 5 = 497ms, limit 20 = 995ms)

Possible causes:

  • Bundled SQLite compiled without platform-specific optimizations (SIMD, etc.)
  • Different memory allocator behavior
  • Missing compile-time tuning flags

Recommendation for future: Investigate switching from features = ["bundled"] to system SQLite linkage, or audit the bundled compile flags in the libsqlite3-sys build script.


Exploration Agent Findings (Informational)

Four parallel exploration agents surveyed the entire codebase. Key findings beyond what was already addressed:

Ingestion Pipeline

  • Serial DB writes in async context (acceptable — rusqlite is synchronous)
  • Label ingestion uses individual inserts (potential batch optimization, low priority)

CLI / GitLab Client

  • GraphQL client recreated per call (client.rs:98-100) — caches connection pool, minor
  • Double JSON deserialization in GraphQL responses — medium priority
  • N+1 subqueries in list command (list.rs:408-423) — 4 correlated subqueries per row

Search / Embedding

  • No N+1 patterns, no O(n^2) algorithms
  • Chunking is O(n) single-pass with proper UTF-8 safety
  • Ollama concurrency model is sound (parallel HTTP, serial DB writes)

Database / Documents

  • O(n^2) prefix sum in truncation.rs — low traffic path
  • String allocation patterns in extractors — micro-optimization territory

Opportunity Matrix

Candidate Impact Confidence Effort Score Status
INDEXED BY for DiffNote Very High High Low 9.0 Shipped
Stats conditional aggregates Medium High Low 7.0 Shipped
Bundled SQLite FTS5 Very High Medium High 5.0 Documented
List N+1 subqueries Medium Medium Medium 4.0 Backlog
GraphQL double deser Low Medium Low 3.5 Backlog
Truncation O(n^2) Low High Low 3.0 Backlog

Files Modified

File Change
src/cli/commands/who.rs INDEXED BY hints on 8 SQL queries
src/cli/commands/stats.rs Conditional aggregates, FTS5 shadow table, merged queries

Quality Gates

  • All 603 tests pass
  • cargo clippy --all-targets -- -D warnings clean
  • cargo fmt --check clean
  • Golden output verified for both optimizations