gitlore

Author	SHA1	Message	Date
Taylor Eernisse	ebf64816c9	fix(search): correct FTS5 raw mode fallback test assertion Update test_raw_mode_leading_wildcard_falls_back_to_safe to match the actual Safe mode behavior: OR is a recognized FTS5 boolean operator and passes through unquoted, so the expected output is '"" OR "auth"' not '"" "OR" "auth"'. The previous assertion was incorrect since the Safe mode operator-passthrough logic was added. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 22:34:01 -05:00
teernisse	59f65b127a	fix(search): pass FTS5 boolean operators through unquoted FTS5 boolean operators (AND, OR, NOT, NEAR) are case-sensitive uppercase keywords that must appear unquoted in the query string. Previously, the user-friendly query builder would double-quote every token, causing queries like "switch AND health" to search for the literal word "AND" instead of using it as a boolean conjunction. Adds a FTS5_OPERATORS constant and checks each token against it before quoting, allowing natural boolean search syntax to work as expected. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 14:56:29 -05:00
teernisse	f439c42b3d	chore: add gitignore for mock-seed, roam CI workflow, formatting - Add tools/mock-seed/ to .gitignore - Add .github/workflows/roam.yml CI workflow - Add .roam/fitness.yaml architectural fitness rules - Rustfmt formatting fixes in show.rs and vector.rs - Beads sync Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 13:50:30 -05:00
teernisse	e6771709f1	refactor(core): extract path_resolver module, fix old_path matching in who Extract shared path resolution logic from who.rs into a new core::path_resolver module for cross-module reuse. Functions moved: escape_like, normalize_repo_path, PathQuery, SuffixResult, build_path_query, suffix_probe. Duplicate escape_like copies removed from list.rs, project.rs, and filters.rs — all now import from path_resolver. Additionally fixes two bugs in query_expert_details() and query_overlap() where only position_new_path was checked (missing old_path matches for renamed files) and state filter excluded 'closed' MRs despite the main scoring query including them with a decay multiplier. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 13:50:14 -05:00
teernisse	6e55b2470d	bugfix: DB column and size issues	2026-02-13 11:11:35 -05:00
Taylor Eernisse	53ef21d653	fix: propagate DB errors instead of silently swallowing them Replace .unwrap_or(), .ok(), and .filter_map(\|r\| r.ok()) patterns with proper error propagation using ? and rusqlite::OptionalExtension where the query may legitimately return no rows. Affected areas: - events_db::count_events: three count queries now propagate errors instead of defaulting to (0, 0) on failure - note_parser::extract_refs_from_system_notes: row iteration errors are now propagated instead of silently dropped via filter_map - note_parser::noteable_type_to_entity_type: unknown types now log a debug warning before defaulting to "issue" - payloads::store_payload/read_payload: use .optional()? instead of .ok() to distinguish "no row" from "query failed" - backoff::compute_next_attempt_at: use .clamp(0, 30) to guard against negative attempt_count, not just .min(30) - search::vector::max_chunks_per_document: returns Result<i64> with proper error propagation through .optional()?.flatten() - embedding::chunk_ids::decode_rowid: promote debug_assert to assert since negative rowids indicate data corruption worth failing fast on - ingestion::dirty_tracker::record_dirty_error: use .optional()? to handle missing dirty_sources row gracefully instead of hard error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 10:15:36 -05:00
Taylor Eernisse	b168a58134	fix(search): cap vector search k-value and add rowid assertion The vector search multiplier could grow unbounded on documents with many chunks, producing enormous k values that cause SQLite to scan far more rows than necessary. Clamp the multiplier to [8, 200] and cap k at 10,000 to prevent degenerate performance on large corpora. Also adds a debug_assert in decode_rowid to catch negative rowids early — these indicate a bug in the encoding pipeline and should fail fast rather than silently produce garbage document IDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 14:34:05 -05:00
Taylor Eernisse	940a96375a	refactor(search): rename --after/--updated-after to --since/--updated-since The --since naming is more intuitive (matches git log --since) and consistent with the list commands which already use --since. Renames the CLI flags, SearchCliFilters fields, SearchFilters fields, autocorrect registry, and robot-docs manifest. No behavioral change. Affected paths: - cli/mod.rs: SearchArgs field + clap attribute rename - cli/commands/search.rs: SearchCliFilters + run_search plumbing - search/filters.rs: SearchFilters struct + apply_filters logic - main.rs: handle_search + robot-docs JSON - cli/autocorrect.rs: COMMAND_FLAGS entry for search Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 14:33:24 -05:00
Taylor Eernisse	435a208c93	perf: eliminate unnecessary clones and pre-allocate collections Three micro-optimizations with zero behavioral change: 1. timeline_collect.rs: Reorder format!() before enum construction so the owned String moves into the variant directly, eliminating .clone() on state, label, and milestone strings in StateChanged, LabelAdded/Removed, and MilestoneSet/Removed event paths. 2. pipeline.rs: Use Arc<str> for doc_hash shared across a document's chunks instead of cloning the full String per chunk. Also remove redundant embed_buf.reserve() since extend_from_slice already handles growth and the buffer is reused across iterations. 3. rrf.rs: Pre-allocate HashMap with combined vector+fts result count via with_capacity() to avoid rehashing during RRF score accumulation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 08:08:14 -05:00
Taylor Eernisse	5786d7f4b6	fix: defensive hardening — lock release logging, SQLite param guard, vector cast Three defensive improvements found via peer code review: 1. lock.rs: Lock release errors were silently discarded with `let _ =`. If the DELETE failed (disk full, corruption), the lock stayed in the database with no diagnostic. Next sync would require --force with no clue why. Now logs with error!() including the underlying error message. 2. filters.rs: Dynamic SQL label filter construction had no upper bound on bind parameters. With many combined filters, param_idx + labels.len() could exceed SQLite's 999-parameter limit, producing an opaque error. Added a guard that caps labels at 900 - param_idx. 3. vector.rs: max_chunks_per_document returned i64 which was cast to usize. A negative value from a corrupt database would wrap to a huge number, causing overflow in the multiplier calculation. Now clamped to .max(1) and cast via unsigned_abs(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 07:55:54 -05:00
Taylor Eernisse	8cf14fb69b	feat(search): sanitize raw FTS5 queries with safe fallback Add input validation for Raw FTS query mode to prevent expensive or malformed queries from reaching SQLite FTS5: - Reject unbalanced double quotes (would cause FTS5 syntax error) - Reject leading wildcard-only queries ("", " OR ...") that trigger expensive full-table scans - Reject empty/whitespace-only queries - Invalid raw input falls back to Safe mode automatically instead of erroring, so callers never see FTS5 parse failures The Safe mode already escapes all tokens with double-quote wrapping and handles embedded quotes via doubling. Raw mode now has a validation layer on top. All queries remain parameterized (?1, ?2) — user input never enters SQL strings directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 22:42:17 -05:00
Taylor Eernisse	3e9cf2358e	perf(search+embed): zero-copy embedding API and deferred RRF mapping Change OllamaClient::embed_batch to accept &[&str] instead of Vec<String>. The EmbedRequest struct now borrows both model name and input texts, eliminating per-batch cloning of chunk text (up to 32KB per chunk x 32 chunks per batch). Serialization output is identical since serde serializes &str and String to the same JSON. In hybrid search, defer the RrfResult->HybridResult mapping until after filter+take, so only `limit` items (typically 20) are constructed instead of up to 1,500 at RECALL_CAP. Also switch filtered_ids to into_iter() to avoid an extra .copied() pass. Switch FTS search_fts from prepare() to prepare_cached() for statement reuse across repeated searches. Benchmarked at ~1.6x faster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-05 17:35:53 -05:00
Taylor Eernisse	72f1cafdcf	perf: Optimize SQL queries and reduce allocations in hot paths Change detection queries (embedding/change_detector.rs): - Replace triple-EXISTS subquery pattern with LEFT JOIN + NULL check - SQLite now scans embedding_metadata once instead of three times - Semantically identical: returns docs needing embedding when no embedding exists, hash changed, or config mismatch Count queries (cli/commands/count.rs): - Consolidate 3 separate COUNT queries for issues into single query using conditional aggregation (CASE WHEN state = 'x' THEN 1) - Same optimization for MRs: 5 queries reduced to 1 Search filter queries (search/filters.rs): - Replace N separate EXISTS clauses for label filtering with single IN() clause with COUNT/GROUP BY HAVING pattern - For multi-label AND queries, this reduces N subqueries to 1 FTS tokenization (search/fts.rs): - Replace collect-into-Vec-then-join pattern with direct String building - Pre-allocate capacity hint for result string Discussion truncation (documents/truncation.rs): - Calculate total length without allocating concatenated string first - Only allocate full string when we know it fits within limit Embedding pipeline (embedding/pipeline.rs): - Add Vec::with_capacity hints for chunk work and cleared_docs hashset - Reduces reallocations during embedding batch processing Backoff calculation (core/backoff.rs): - Replace unchecked addition with saturating_add to prevent overflow - Add test case verifying overflow protection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 11:21:28 -05:00
Taylor Eernisse	65583ed5d6	refactor: Remove redundant doc comments throughout codebase Removes module-level doc comments (//! lines) and excessive inline doc comments that were duplicating information already evident from: - Function/struct names (self-documenting code) - Type signatures (the what is clear from types) - Implementation context (the how is clear from code) Affected modules: - cli/* - Removed command descriptions duplicating clap help text - core/* - Removed module headers and obvious function docs - documents/* - Removed extractor/regenerator/truncation docs - embedding/* - Removed pipeline and chunking docs - gitlab/* - Removed client and transformer docs (kept type definitions) - ingestion/* - Removed orchestrator and ingestion docs - search/* - Removed FTS and vector search docs Philosophy: Code should be self-documenting. Comments should explain "why" (business decisions, non-obvious constraints) not "what" (which the code itself shows). This change reduces noise and maintenance burden while keeping the codebase just as understandable. Retains comments for: - Non-obvious business logic - Important safety invariants - Complex algorithm explanations - Public API boundaries where generated docs matter Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 00:04:32 -05:00
Taylor Eernisse	1fdc6d03cc	fix: Savepoint leak in embedding pipeline, atomic fail_job, RRF dedup Three correctness fixes found during peer code review: Embedding pipeline savepoint leak (HIGH severity): The SAVEPOINT embed_page / RELEASE embed_page pattern had ~10 `?` propagation points between them. Any error from record_embedding_error, clear_document_embeddings, or store_embedding would exit the function without rolling back, leaving the SQLite connection in a broken transactional state and causing cascading failures for the rest of the session. Fixed by extracting page processing into `embed_page()` and wrapping with explicit rollback-on-error handling. Dependent queue fail_job race (MEDIUM severity): fail_job performed a SELECT followed by a separate UPDATE on the attempts counter without a transaction. Under concurrent lock reclamation, the attempts value could be read stale. Replaced with a single atomic UPDATE that increments attempts and computes exponential backoff entirely in SQL, also halving DB round-trips. Added explicit error when the job no longer exists. RRF duplicate document score inflation (MEDIUM severity): If a retriever returned the same document_id multiple times, the RRF score accumulated multiple rank contributions while the rank only recorded the first occurrence. Moved the score accumulation inside the `if is_none` guard so only the first occurrence per list contributes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 14:16:38 -05:00
teernisse	86a51cddef	fix: Project-scoped job claiming, structured rate-limit logging, RRF total_cmp Targeted fixes across multiple subsystems: dependent_queue: - Add project_id parameter to claim_jobs() for project-scoped job claiming, preventing cross-project job theft during concurrent multi-project ingestion - Add project_id parameter to count_pending_jobs() with optional scoping (None returns global counts, Some(pid) returns per-project counts) gitlab/client: - Downgrade rate-limit log from warn to info (429s are expected operational behavior, not warnings) and add structured fields (path, status_code) for better log filtering and aggregation gitlab/transformers/discussion: - Add tracing::warn on invalid timestamp parse instead of silent fallback to epoch 0, making data quality issues visible in logs ingestion/merge_requests: - Remove duplicate doc comment on upsert_label_tx search/rrf: - Replace partial_cmp().unwrap_or() with total_cmp() for f64 sorting, eliminating the NaN edge case entirely (total_cmp treats NaN consistently) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 13:39:13 -05:00
Taylor Eernisse	a50fc78823	style: Apply cargo fmt and clippy fixes across codebase Automated formatting and lint corrections from parallel agent work: - cargo fmt: import reordering (alphabetical), line wrapping to respect max width, trailing comma normalization, destructuring alignment, function signature reformatting, match arm formatting - clippy (pedantic): Range::contains() instead of manual comparisons, i64::from() instead of `as i64` casts, .clamp() instead of .max().min() chains, let-chain refactors (if-let with &&), #[allow(clippy::too_many_arguments)] and #[allow(clippy::field_reassign_with_default)] where warranted - Removed trailing blank lines and extra whitespace No behavioral changes. All existing tests pass unmodified. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 13:01:59 -05:00
Taylor Eernisse	7d07f95d4c	fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from 500 to 200 to stay within nomic-embed-text's 8,192-token context window. This commit addresses all downstream consequences of that reduction: - Config drift detection: find_pending_documents and count_pending_documents now take model_name and compare chunk_max_bytes, model, and dims against stored metadata. Documents embedded with stale config are automatically re-queued. - Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks are skipped with a sentinel error recorded in embedding_metadata, preventing both rowid collision and infinite re-processing loops. - Deferred clearing: old embeddings are no longer cleared before attempting new ones. clear_document_embeddings is deferred until the first successful chunk embedding, so if all chunks fail the document retains its previous embeddings rather than losing all data. - Savepoints: each page of DB writes is wrapped in a SQLite savepoint so a crash mid-page rolls back atomically instead of leaving partial state (cleared embeddings with no replacements). - Per-chunk retry on context overflow: when a batch fails with a context-length error, each chunk is retried individually so one oversized chunk doesn't poison the entire batch. - Adaptive dedup in vector search: replaces the static 3x over-fetch multiplier with a dynamic one based on actual max chunks per document (using the new chunk_count column with a fallback COUNT query for pre-migration data). Also replaces partial_cmp with total_cmp for f64 distance sorting. - Stores chunk_max_bytes and chunk_count (on sentinel rows) in embedding_metadata to support config drift detection and adaptive dedup without runtime queries. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 09:35:08 -05:00
Taylor Eernisse	d5bdb24b0f	feat(search): Add hybrid search engine with FTS5, vector, and RRF fusion Implements the search module providing three search modes: - Lexical (FTS5): Full-text search using SQLite FTS5 with safe query sanitization. User queries are automatically tokenized and wrapped in proper FTS5 syntax. Supports a "raw" mode for power users who want direct FTS5 query syntax (NEAR, column filters, etc.). - Semantic (vector): Embeds the search query via Ollama, then performs cosine similarity search against stored document embeddings. Results are deduplicated by doc_id since documents may have multiple chunks. - Hybrid (default): Executes both lexical and semantic searches in parallel, then fuses results using Reciprocal Rank Fusion (RRF) with k=60. This avoids the complexity of score normalization while producing high-quality merged rankings. Gracefully degrades to lexical-only when embeddings are unavailable. Additional components: - search::filters: Post-retrieval filtering by source_type, author, project, labels (AND logic), file path prefix, created_after, and updated_after. Date filters accept relative formats (7d, 2w) and ISO dates. - search::rrf: Reciprocal Rank Fusion implementation with configurable k parameter and optional explain mode that annotates each result with its component ranks and fusion score breakdown. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-30 15:46:42 -05:00

19 Commits