Commit Graph

19 Commits

Author SHA1 Message Date
Taylor Eernisse
ebf64816c9 fix(search): correct FTS5 raw mode fallback test assertion
Update test_raw_mode_leading_wildcard_falls_back_to_safe to match the
actual Safe mode behavior: OR is a recognized FTS5 boolean operator and
passes through unquoted, so the expected output is '"*" OR "auth"' not
'"*" "OR" "auth"'. The previous assertion was incorrect since the Safe
mode operator-passthrough logic was added.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 22:34:01 -05:00
teernisse
59f65b127a fix(search): pass FTS5 boolean operators through unquoted
FTS5 boolean operators (AND, OR, NOT, NEAR) are case-sensitive uppercase
keywords that must appear unquoted in the query string. Previously, the
user-friendly query builder would double-quote every token, causing
queries like "switch AND health" to search for the literal word "AND"
instead of using it as a boolean conjunction.

Adds a FTS5_OPERATORS constant and checks each token against it before
quoting, allowing natural boolean search syntax to work as expected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 14:56:29 -05:00
teernisse
f439c42b3d chore: add gitignore for mock-seed, roam CI workflow, formatting
- Add tools/mock-seed/ to .gitignore
- Add .github/workflows/roam.yml CI workflow
- Add .roam/fitness.yaml architectural fitness rules
- Rustfmt formatting fixes in show.rs and vector.rs
- Beads sync

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 13:50:30 -05:00
teernisse
e6771709f1 refactor(core): extract path_resolver module, fix old_path matching in who
Extract shared path resolution logic from who.rs into a new
core::path_resolver module for cross-module reuse. Functions moved:
escape_like, normalize_repo_path, PathQuery, SuffixResult,
build_path_query, suffix_probe. Duplicate escape_like copies removed
from list.rs, project.rs, and filters.rs — all now import from
path_resolver.

Additionally fixes two bugs in query_expert_details() and
query_overlap() where only position_new_path was checked (missing
old_path matches for renamed files) and state filter excluded 'closed'
MRs despite the main scoring query including them with a decay
multiplier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 13:50:14 -05:00
teernisse
6e55b2470d bugfix: DB column and size issues 2026-02-13 11:11:35 -05:00
Taylor Eernisse
53ef21d653 fix: propagate DB errors instead of silently swallowing them
Replace .unwrap_or(), .ok(), and .filter_map(|r| r.ok()) patterns with
proper error propagation using ? and rusqlite::OptionalExtension where
the query may legitimately return no rows.

Affected areas:
- events_db::count_events: three count queries now propagate errors
  instead of defaulting to (0, 0) on failure
- note_parser::extract_refs_from_system_notes: row iteration errors
  are now propagated instead of silently dropped via filter_map
- note_parser::noteable_type_to_entity_type: unknown types now log a
  debug warning before defaulting to "issue"
- payloads::store_payload/read_payload: use .optional()? instead of
  .ok() to distinguish "no row" from "query failed"
- backoff::compute_next_attempt_at: use .clamp(0, 30) to guard against
  negative attempt_count, not just .min(30)
- search::vector::max_chunks_per_document: returns Result<i64> with
  proper error propagation through .optional()?.flatten()
- embedding::chunk_ids::decode_rowid: promote debug_assert to assert
  since negative rowids indicate data corruption worth failing fast on
- ingestion::dirty_tracker::record_dirty_error: use .optional()? to
  handle missing dirty_sources row gracefully instead of hard error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 10:15:36 -05:00
Taylor Eernisse
b168a58134 fix(search): cap vector search k-value and add rowid assertion
The vector search multiplier could grow unbounded on documents with
many chunks, producing enormous k values that cause SQLite to scan
far more rows than necessary. Clamp the multiplier to [8, 200] and
cap k at 10,000 to prevent degenerate performance on large corpora.

Also adds a debug_assert in decode_rowid to catch negative rowids
early — these indicate a bug in the encoding pipeline and should
fail fast rather than silently produce garbage document IDs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 14:34:05 -05:00
Taylor Eernisse
940a96375a refactor(search): rename --after/--updated-after to --since/--updated-since
The --since naming is more intuitive (matches git log --since) and
consistent with the list commands which already use --since. Renames
the CLI flags, SearchCliFilters fields, SearchFilters fields,
autocorrect registry, and robot-docs manifest. No behavioral change.

Affected paths:
- cli/mod.rs: SearchArgs field + clap attribute rename
- cli/commands/search.rs: SearchCliFilters + run_search plumbing
- search/filters.rs: SearchFilters struct + apply_filters logic
- main.rs: handle_search + robot-docs JSON
- cli/autocorrect.rs: COMMAND_FLAGS entry for search

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 14:33:24 -05:00
Taylor Eernisse
435a208c93 perf: eliminate unnecessary clones and pre-allocate collections
Three micro-optimizations with zero behavioral change:

1. timeline_collect.rs: Reorder format!() before enum construction so
   the owned String moves into the variant directly, eliminating
   .clone() on state, label, and milestone strings in StateChanged,
   LabelAdded/Removed, and MilestoneSet/Removed event paths.

2. pipeline.rs: Use Arc<str> for doc_hash shared across a document's
   chunks instead of cloning the full String per chunk. Also remove
   redundant embed_buf.reserve() since extend_from_slice already
   handles growth and the buffer is reused across iterations.

3. rrf.rs: Pre-allocate HashMap with combined vector+fts result count
   via with_capacity() to avoid rehashing during RRF score accumulation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 08:08:14 -05:00
Taylor Eernisse
5786d7f4b6 fix: defensive hardening — lock release logging, SQLite param guard, vector cast
Three defensive improvements found via peer code review:

1. lock.rs: Lock release errors were silently discarded with `let _ =`.
   If the DELETE failed (disk full, corruption), the lock stayed in the
   database with no diagnostic. Next sync would require --force with no
   clue why. Now logs with error!() including the underlying error message.

2. filters.rs: Dynamic SQL label filter construction had no upper bound
   on bind parameters. With many combined filters, param_idx + labels.len()
   could exceed SQLite's 999-parameter limit, producing an opaque error.
   Added a guard that caps labels at 900 - param_idx.

3. vector.rs: max_chunks_per_document returned i64 which was cast to
   usize. A negative value from a corrupt database would wrap to a huge
   number, causing overflow in the multiplier calculation. Now clamped
   to .max(1) and cast via unsigned_abs().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 07:55:54 -05:00
Taylor Eernisse
8cf14fb69b feat(search): sanitize raw FTS5 queries with safe fallback
Add input validation for Raw FTS query mode to prevent expensive or
malformed queries from reaching SQLite FTS5:

- Reject unbalanced double quotes (would cause FTS5 syntax error)
- Reject leading wildcard-only queries ("*", "* OR ...") that trigger
  expensive full-table scans
- Reject empty/whitespace-only queries
- Invalid raw input falls back to Safe mode automatically instead of
  erroring, so callers never see FTS5 parse failures

The Safe mode already escapes all tokens with double-quote wrapping
and handles embedded quotes via doubling. Raw mode now has a
validation layer on top.

All queries remain parameterized (?1, ?2) — user input never enters
SQL strings directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 22:42:17 -05:00
Taylor Eernisse
3e9cf2358e perf(search+embed): zero-copy embedding API and deferred RRF mapping
Change OllamaClient::embed_batch to accept &[&str] instead of
Vec<String>. The EmbedRequest struct now borrows both model name and
input texts, eliminating per-batch cloning of chunk text (up to 32KB
per chunk x 32 chunks per batch). Serialization output is identical
since serde serializes &str and String to the same JSON.

In hybrid search, defer the RrfResult->HybridResult mapping until
after filter+take, so only `limit` items (typically 20) are
constructed instead of up to 1,500 at RECALL_CAP. Also switch
filtered_ids to into_iter() to avoid an extra .copied() pass.

Switch FTS search_fts from prepare() to prepare_cached() for statement
reuse across repeated searches. Benchmarked at ~1.6x faster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-05 17:35:53 -05:00
Taylor Eernisse
72f1cafdcf perf: Optimize SQL queries and reduce allocations in hot paths
Change detection queries (embedding/change_detector.rs):
- Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check
- SQLite now scans embedding_metadata once instead of three times
- Semantically identical: returns docs needing embedding when no
  embedding exists, hash changed, or config mismatch

Count queries (cli/commands/count.rs):
- Consolidate 3 separate COUNT queries for issues into single query
  using conditional aggregation (CASE WHEN state = 'x' THEN 1)
- Same optimization for MRs: 5 queries reduced to 1

Search filter queries (search/filters.rs):
- Replace N separate EXISTS clauses for label filtering with single
  IN() clause with COUNT/GROUP BY HAVING pattern
- For multi-label AND queries, this reduces N subqueries to 1

FTS tokenization (search/fts.rs):
- Replace collect-into-Vec-then-join pattern with direct String building
- Pre-allocate capacity hint for result string

Discussion truncation (documents/truncation.rs):
- Calculate total length without allocating concatenated string first
- Only allocate full string when we know it fits within limit

Embedding pipeline (embedding/pipeline.rs):
- Add Vec::with_capacity hints for chunk work and cleared_docs hashset
- Reduces reallocations during embedding batch processing

Backoff calculation (core/backoff.rs):
- Replace unchecked addition with saturating_add to prevent overflow
- Add test case verifying overflow protection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 11:21:28 -05:00
Taylor Eernisse
65583ed5d6 refactor: Remove redundant doc comments throughout codebase
Removes module-level doc comments (//! lines) and excessive inline doc
comments that were duplicating information already evident from:
- Function/struct names (self-documenting code)
- Type signatures (the what is clear from types)
- Implementation context (the how is clear from code)

Affected modules:
- cli/* - Removed command descriptions duplicating clap help text
- core/* - Removed module headers and obvious function docs
- documents/* - Removed extractor/regenerator/truncation docs
- embedding/* - Removed pipeline and chunking docs
- gitlab/* - Removed client and transformer docs (kept type definitions)
- ingestion/* - Removed orchestrator and ingestion docs
- search/* - Removed FTS and vector search docs

Philosophy: Code should be self-documenting. Comments should explain
"why" (business decisions, non-obvious constraints) not "what" (which
the code itself shows). This change reduces noise and maintenance burden
while keeping the codebase just as understandable.

Retains comments for:
- Non-obvious business logic
- Important safety invariants
- Complex algorithm explanations
- Public API boundaries where generated docs matter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 00:04:32 -05:00
Taylor Eernisse
1fdc6d03cc fix: Savepoint leak in embedding pipeline, atomic fail_job, RRF dedup
Three correctness fixes found during peer code review:

Embedding pipeline savepoint leak (HIGH severity):
The SAVEPOINT embed_page / RELEASE embed_page pattern had ~10 `?`
propagation points between them. Any error from record_embedding_error,
clear_document_embeddings, or store_embedding would exit the function
without rolling back, leaving the SQLite connection in a broken
transactional state and causing cascading failures for the rest of the
session. Fixed by extracting page processing into `embed_page()` and
wrapping with explicit rollback-on-error handling.

Dependent queue fail_job race (MEDIUM severity):
fail_job performed a SELECT followed by a separate UPDATE on the
attempts counter without a transaction. Under concurrent lock
reclamation, the attempts value could be read stale. Replaced with a
single atomic UPDATE that increments attempts and computes exponential
backoff entirely in SQL, also halving DB round-trips. Added explicit
error when the job no longer exists.

RRF duplicate document score inflation (MEDIUM severity):
If a retriever returned the same document_id multiple times, the RRF
score accumulated multiple rank contributions while the rank only
recorded the first occurrence. Moved the score accumulation inside the
`if is_none` guard so only the first occurrence per list contributes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 14:16:38 -05:00
teernisse
86a51cddef fix: Project-scoped job claiming, structured rate-limit logging, RRF total_cmp
Targeted fixes across multiple subsystems:

dependent_queue:
- Add project_id parameter to claim_jobs() for project-scoped job claiming,
  preventing cross-project job theft during concurrent multi-project ingestion
- Add project_id parameter to count_pending_jobs() with optional scoping
  (None returns global counts, Some(pid) returns per-project counts)

gitlab/client:
- Downgrade rate-limit log from warn to info (429s are expected operational
  behavior, not warnings) and add structured fields (path, status_code)
  for better log filtering and aggregation

gitlab/transformers/discussion:
- Add tracing::warn on invalid timestamp parse instead of silent fallback
  to epoch 0, making data quality issues visible in logs

ingestion/merge_requests:
- Remove duplicate doc comment on upsert_label_tx

search/rrf:
- Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting,
  eliminating the NaN edge case entirely (total_cmp treats NaN consistently)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 13:39:13 -05:00
Taylor Eernisse
a50fc78823 style: Apply cargo fmt and clippy fixes across codebase
Automated formatting and lint corrections from parallel agent work:

- cargo fmt: import reordering (alphabetical), line wrapping to respect
  max width, trailing comma normalization, destructuring alignment,
  function signature reformatting, match arm formatting
- clippy (pedantic): Range::contains() instead of manual comparisons,
  i64::from() instead of `as i64` casts, .clamp() instead of
  .max().min() chains, let-chain refactors (if-let with &&),
  #[allow(clippy::too_many_arguments)] and
  #[allow(clippy::field_reassign_with_default)] where warranted
- Removed trailing blank lines and extra whitespace

No behavioral changes. All existing tests pass unmodified.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 13:01:59 -05:00
Taylor Eernisse
7d07f95d4c fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures
Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:

- Config drift detection: find_pending_documents and
  count_pending_documents now take model_name and compare
  chunk_max_bytes, model, and dims against stored metadata. Documents
  embedded with stale config are automatically re-queued.

- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
  are skipped with a sentinel error recorded in embedding_metadata,
  preventing both rowid collision and infinite re-processing loops.

- Deferred clearing: old embeddings are no longer cleared before
  attempting new ones. clear_document_embeddings is deferred until the
  first successful chunk embedding, so if all chunks fail the document
  retains its previous embeddings rather than losing all data.

- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
  so a crash mid-page rolls back atomically instead of leaving partial
  state (cleared embeddings with no replacements).

- Per-chunk retry on context overflow: when a batch fails with a
  context-length error, each chunk is retried individually so one
  oversized chunk doesn't poison the entire batch.

- Adaptive dedup in vector search: replaces the static 3x over-fetch
  multiplier with a dynamic one based on actual max chunks per document
  (using the new chunk_count column with a fallback COUNT query for
  pre-migration data). Also replaces partial_cmp with total_cmp for
  f64 distance sorting.

- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
  embedding_metadata to support config drift detection and adaptive
  dedup without runtime queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:35:08 -05:00
Taylor Eernisse
d5bdb24b0f feat(search): Add hybrid search engine with FTS5, vector, and RRF fusion
Implements the search module providing three search modes:

- Lexical (FTS5): Full-text search using SQLite FTS5 with safe query
  sanitization. User queries are automatically tokenized and wrapped
  in proper FTS5 syntax. Supports a "raw" mode for power users who
  want direct FTS5 query syntax (NEAR, column filters, etc.).

- Semantic (vector): Embeds the search query via Ollama, then performs
  cosine similarity search against stored document embeddings. Results
  are deduplicated by doc_id since documents may have multiple chunks.

- Hybrid (default): Executes both lexical and semantic searches in
  parallel, then fuses results using Reciprocal Rank Fusion (RRF) with
  k=60. This avoids the complexity of score normalization while
  producing high-quality merged rankings. Gracefully degrades to
  lexical-only when embeddings are unavailable.

Additional components:

- search::filters: Post-retrieval filtering by source_type, author,
  project, labels (AND logic), file path prefix, created_after, and
  updated_after. Date filters accept relative formats (7d, 2w) and
  ISO dates.

- search::rrf: Reciprocal Rank Fusion implementation with configurable
  k parameter and optional explain mode that annotates each result
  with its component ranks and fusion score breakdown.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:46:42 -05:00