1. fetch_open_threads: replace N+1 loop (2 queries per thread) with a
single query using correlated subqueries for note_count and started_by.
2. extract_key_decisions: track consumed notes so the same note is not
matched to multiple events, preventing duplicate decision entries.
3. build_timeline_excerpt_from_pipeline: log tracing::warn on seed/collect
failures instead of silently returning empty timeline.
NEAR is an FTS5 function (NEAR(term1 term2, N)), not an infix operator like
AND/OR/NOT. Passing it through unquoted in Safe mode was incorrect - it would
be treated as a literal term rather than a function call.
Users who need NEAR proximity search should use FtsQueryMode::Raw which
passes the query through verbatim to FTS5.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update test_raw_mode_leading_wildcard_falls_back_to_safe to match the
actual Safe mode behavior: OR is a recognized FTS5 boolean operator and
passes through unquoted, so the expected output is '"*" OR "auth"' not
'"*" "OR" "auth"'. The previous assertion was incorrect since the Safe
mode operator-passthrough logic was added.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FTS5 boolean operators (AND, OR, NOT, NEAR) are case-sensitive uppercase
keywords that must appear unquoted in the query string. Previously, the
user-friendly query builder would double-quote every token, causing
queries like "switch AND health" to search for the literal word "AND"
instead of using it as a boolean conjunction.
Adds a FTS5_OPERATORS constant and checks each token against it before
quoting, allowing natural boolean search syntax to work as expected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add input validation for Raw FTS query mode to prevent expensive or
malformed queries from reaching SQLite FTS5:
- Reject unbalanced double quotes (would cause FTS5 syntax error)
- Reject leading wildcard-only queries ("*", "* OR ...") that trigger
expensive full-table scans
- Reject empty/whitespace-only queries
- Invalid raw input falls back to Safe mode automatically instead of
erroring, so callers never see FTS5 parse failures
The Safe mode already escapes all tokens with double-quote wrapping
and handles embedded quotes via doubling. Raw mode now has a
validation layer on top.
All queries remain parameterized (?1, ?2) — user input never enters
SQL strings directly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change OllamaClient::embed_batch to accept &[&str] instead of
Vec<String>. The EmbedRequest struct now borrows both model name and
input texts, eliminating per-batch cloning of chunk text (up to 32KB
per chunk x 32 chunks per batch). Serialization output is identical
since serde serializes &str and String to the same JSON.
In hybrid search, defer the RrfResult->HybridResult mapping until
after filter+take, so only `limit` items (typically 20) are
constructed instead of up to 1,500 at RECALL_CAP. Also switch
filtered_ids to into_iter() to avoid an extra .copied() pass.
Switch FTS search_fts from prepare() to prepare_cached() for statement
reuse across repeated searches. Benchmarked at ~1.6x faster.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change detection queries (embedding/change_detector.rs):
- Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check
- SQLite now scans embedding_metadata once instead of three times
- Semantically identical: returns docs needing embedding when no
embedding exists, hash changed, or config mismatch
Count queries (cli/commands/count.rs):
- Consolidate 3 separate COUNT queries for issues into single query
using conditional aggregation (CASE WHEN state = 'x' THEN 1)
- Same optimization for MRs: 5 queries reduced to 1
Search filter queries (search/filters.rs):
- Replace N separate EXISTS clauses for label filtering with single
IN() clause with COUNT/GROUP BY HAVING pattern
- For multi-label AND queries, this reduces N subqueries to 1
FTS tokenization (search/fts.rs):
- Replace collect-into-Vec-then-join pattern with direct String building
- Pre-allocate capacity hint for result string
Discussion truncation (documents/truncation.rs):
- Calculate total length without allocating concatenated string first
- Only allocate full string when we know it fits within limit
Embedding pipeline (embedding/pipeline.rs):
- Add Vec::with_capacity hints for chunk work and cleared_docs hashset
- Reduces reallocations during embedding batch processing
Backoff calculation (core/backoff.rs):
- Replace unchecked addition with saturating_add to prevent overflow
- Add test case verifying overflow protection
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)
Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs
Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.
Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Automated formatting and lint corrections from parallel agent work:
- cargo fmt: import reordering (alphabetical), line wrapping to respect
max width, trailing comma normalization, destructuring alignment,
function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
i64::from() instead of `as i64` casts, .clamp() instead of
.max().min() chains, let-chain refactors (if-let with &&),
#[allow(clippy::too_many_arguments)] and
#[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace
No behavioral changes. All existing tests pass unmodified.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements the search module providing three search modes:
- Lexical (FTS5): Full-text search using SQLite FTS5 with safe query
sanitization. User queries are automatically tokenized and wrapped
in proper FTS5 syntax. Supports a "raw" mode for power users who
want direct FTS5 query syntax (NEAR, column filters, etc.).
- Semantic (vector): Embeds the search query via Ollama, then performs
cosine similarity search against stored document embeddings. Results
are deduplicated by doc_id since documents may have multiple chunks.
- Hybrid (default): Executes both lexical and semantic searches in
parallel, then fuses results using Reciprocal Rank Fusion (RRF) with
k=60. This avoids the complexity of score normalization while
producing high-quality merged rankings. Gracefully degrades to
lexical-only when embeddings are unavailable.
Additional components:
- search::filters: Post-retrieval filtering by source_type, author,
project, labels (AND logic), file path prefix, created_after, and
updated_after. Date filters accept relative formats (7d, 2w) and
ISO dates.
- search::rrf: Reciprocal Rank Fusion implementation with configurable
k parameter and optional explain mode that annotates each result
with its component ranks and fusion score breakdown.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>