103 KiB
Checkpoint 3: Search & Sync MVP
Note: The project was renamed from "gitlab-inbox" to "gitlore" and the CLI from "gi" to "lore". References to "gi" in this document should be read as "lore".
Status: Planning Prerequisite: Checkpoints 0, 1, 2 complete (issues, MRs, discussions ingested) Goal: Deliver working semantic + lexical hybrid search with efficient incremental sync
This checkpoint consolidates SPEC.md checkpoints 3A, 3B, 4, and 5 into a unified implementation plan. The work is structured for parallel agent execution where dependencies allow.
All code integrates with existing gitlore infrastructure:
- Error handling via
GiErrorandErrorCodeinsrc/core/error.rs - CLI patterns matching
src/cli/commands/*.rs(run functions, JSON/human output) - Database via
rusqlite::Connectionwith migrations inmigrations/ - Config via
src/core/config.rs(EmbeddingConfig already defined) - Robot mode JSON with
{"ok": true, "data": {...}}pattern
Executive Summary
Deliverables:
- Document generation from issues/MRs/discussions with FTS5 indexing
- Ollama-powered embedding pipeline with sqlite-vec storage
- Hybrid search (RRF-ranked vector + lexical) with rich filtering
- Orchestrated
gi synccommand with incremental re-embedding
Key Design Decisions:
- Documents are the search unit (not raw entities)
- FTS5 works standalone when Ollama unavailable (graceful degradation)
- sqlite-vec
rowid = documents.idfor simple joins - RRF ranking avoids score normalization complexity
- Queue-based discussion fetching isolates failures
- FTS5 query sanitization prevents syntax errors from user input
- Exponential backoff on all queues prevents hot-loop retries
- Transient embed failures trigger graceful degradation (not hard errors)
Phase 1: Schema Foundation
1.1 Documents Schema (Migration 007)
File: migrations/007_documents.sql
-- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents (
id INTEGER PRIMARY KEY,
source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion')),
source_id INTEGER NOT NULL, -- local DB id in the source table
project_id INTEGER NOT NULL REFERENCES projects(id),
author_username TEXT, -- for discussions: first note author
label_names TEXT, -- JSON array (display/debug only)
created_at INTEGER, -- ms epoch UTC
updated_at INTEGER, -- ms epoch UTC
url TEXT,
title TEXT, -- null for discussions
content_text TEXT NOT NULL, -- canonical text for embedding/search
content_hash TEXT NOT NULL, -- SHA-256 for change detection
labels_hash TEXT NOT NULL DEFAULT '', -- SHA-256 over sorted labels (write optimization)
paths_hash TEXT NOT NULL DEFAULT '', -- SHA-256 over sorted paths (write optimization)
is_truncated INTEGER NOT NULL DEFAULT 0,
truncated_reason TEXT CHECK (
truncated_reason IN ('token_limit_middle_drop','single_note_oversized','first_last_oversized')
OR truncated_reason IS NULL
),
UNIQUE(source_type, source_id)
);
CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
CREATE INDEX idx_documents_hash ON documents(content_hash);
-- Fast label filtering (indexed exact-match)
CREATE TABLE document_labels (
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
label_name TEXT NOT NULL,
PRIMARY KEY(document_id, label_name)
) WITHOUT ROWID;
CREATE INDEX idx_document_labels_label ON document_labels(label_name);
-- Fast path filtering (DiffNote file paths)
CREATE TABLE document_paths (
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
path TEXT NOT NULL,
PRIMARY KEY(document_id, path)
) WITHOUT ROWID;
CREATE INDEX idx_document_paths_path ON document_paths(path);
-- Queue for incremental document regeneration (with retry tracking)
-- Uses next_attempt_at for index-friendly backoff queries
CREATE TABLE dirty_sources (
source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion')),
source_id INTEGER NOT NULL,
queued_at INTEGER NOT NULL, -- ms epoch UTC
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER,
last_error TEXT,
next_attempt_at INTEGER, -- ms epoch UTC; NULL means ready immediately
PRIMARY KEY(source_type, source_id)
);
CREATE INDEX idx_dirty_sources_next_attempt ON dirty_sources(next_attempt_at);
-- Resumable queue for dependent discussion fetching
-- Uses next_attempt_at for index-friendly backoff queries
CREATE TABLE pending_discussion_fetches (
project_id INTEGER NOT NULL REFERENCES projects(id),
noteable_type TEXT NOT NULL, -- 'Issue' | 'MergeRequest'
noteable_iid INTEGER NOT NULL,
queued_at INTEGER NOT NULL, -- ms epoch UTC
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER,
last_error TEXT,
next_attempt_at INTEGER, -- ms epoch UTC; NULL means ready immediately
PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_next_attempt ON pending_discussion_fetches(next_attempt_at);
Acceptance Criteria:
- Migration applies cleanly on fresh DB
- Migration applies cleanly after CP2 schema
- All foreign keys enforced
- Indexes created
labels_hashandpaths_hashcolumns present for write optimizationnext_attempt_atindexed for efficient backoff queries
1.2 FTS5 Index (Migration 008)
File: migrations/008_fts5.sql
-- Full-text search with porter stemmer and prefix indexes for type-ahead
CREATE VIRTUAL TABLE documents_fts USING fts5(
title,
content_text,
content='documents',
content_rowid='id',
tokenize='porter unicode61',
prefix='2 3 4'
);
-- Keep FTS in sync via triggers
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
END;
-- Only rebuild FTS when searchable text actually changes (not metadata-only updates)
CREATE TRIGGER documents_au AFTER UPDATE ON documents
WHEN old.title IS NOT new.title OR old.content_text != new.content_text
BEGIN
INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
VALUES('delete', old.id, old.title, old.content_text);
INSERT INTO documents_fts(rowid, title, content_text)
VALUES (new.id, new.title, new.content_text);
END;
Acceptance Criteria:
documents_ftscreated as virtual table- Triggers fire on insert/update/delete
- Update trigger only fires when title or content_text changes (not metadata-only updates)
- FTS row count matches documents count after bulk insert
- Prefix search works for type-ahead UX
1.3 Embeddings Schema (Migration 009)
File: migrations/009_embeddings.sql
-- NOTE: sqlite-vec vec0 virtual tables cannot participate in FK cascades.
-- We must use an explicit trigger to delete orphan embeddings when documents
-- are deleted. See documents_embeddings_ad trigger below.
-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
embedding float[768]
);
-- Embedding provenance + change detection
CREATE TABLE embedding_metadata (
document_id INTEGER PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE,
model TEXT NOT NULL, -- 'nomic-embed-text'
dims INTEGER NOT NULL, -- 768
content_hash TEXT NOT NULL, -- copied from documents.content_hash
created_at INTEGER NOT NULL, -- ms epoch UTC
last_error TEXT, -- error message from last failed attempt
attempt_count INTEGER NOT NULL DEFAULT 0,
last_attempt_at INTEGER -- ms epoch UTC
);
CREATE INDEX idx_embedding_metadata_errors
ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
CREATE INDEX idx_embedding_metadata_hash ON embedding_metadata(content_hash);
-- CRITICAL: Delete orphan embeddings when documents are deleted.
-- vec0 virtual tables don't support FK ON DELETE CASCADE, so we need this trigger.
-- embedding_metadata has ON DELETE CASCADE, so only vec0 needs explicit cleanup
CREATE TRIGGER documents_embeddings_ad AFTER DELETE ON documents BEGIN
DELETE FROM embeddings WHERE rowid = old.id;
END;
Acceptance Criteria:
embeddingsvec0 table createdembedding_metadatatracks provenance- Error tracking fields present for retry logic
- Orphan cleanup trigger fires on document deletion
Dependencies:
- Requires sqlite-vec extension loaded at runtime
- Extension loading already happens in
src/core/db.rs - Migration runner must load sqlite-vec before applying migrations (including on fresh DB)
Phase 2: Document Generation
2.1 Document Module Structure
New module: src/documents/
src/documents/
├── mod.rs # Module exports
├── extractor.rs # Document extraction from entities
├── truncation.rs # Note-boundary aware truncation
└── regenerator.rs # Dirty source processing
File: src/documents/mod.rs
//! Document generation and management.
//!
//! Extracts searchable documents from issues, MRs, and discussions.
mod extractor;
mod regenerator;
mod truncation;
pub use extractor::{
extract_discussion_document, extract_issue_document, extract_mr_document,
DocumentData, SourceType,
};
// Note: extract_*_document() return Result<Option<DocumentData>>
// None means the source entity was deleted from the database
pub use regenerator::regenerate_dirty_documents;
pub use truncation::{truncate_content, TruncationResult};
Update src/lib.rs:
pub mod documents; // Add to existing modules
2.2 Document Types
File: src/documents/extractor.rs
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
/// Source type for documents.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum SourceType {
Issue,
MergeRequest,
Discussion,
}
impl SourceType {
pub fn as_str(&self) -> &'static str {
match self {
Self::Issue => "issue",
Self::MergeRequest => "merge_request",
Self::Discussion => "discussion",
}
}
/// Parse from CLI input, accepting common aliases.
///
/// Accepts: "issue", "mr", "merge_request", "discussion"
pub fn parse(s: &str) -> Option<Self> {
match s.to_lowercase().as_str() {
"issue" | "issues" => Some(Self::Issue),
"mr" | "mrs" | "merge_request" | "merge_requests" => Some(Self::MergeRequest),
"discussion" | "discussions" => Some(Self::Discussion),
_ => None,
}
}
}
impl std::fmt::Display for SourceType {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{}", self.as_str())
}
}
/// Generated document ready for storage.
#[derive(Debug, Clone)]
pub struct DocumentData {
pub source_type: SourceType,
pub source_id: i64,
pub project_id: i64,
pub author_username: Option<String>,
pub labels: Vec<String>,
pub paths: Vec<String>, // DiffNote file paths
pub labels_hash: String, // SHA-256 over sorted labels (write optimization)
pub paths_hash: String, // SHA-256 over sorted paths (write optimization)
pub created_at: i64,
pub updated_at: i64,
pub url: Option<String>,
pub title: Option<String>,
pub content_text: String,
pub content_hash: String,
pub is_truncated: bool,
pub truncated_reason: Option<String>,
}
/// Compute SHA-256 hash of content.
pub fn compute_content_hash(content: &str) -> String {
let mut hasher = Sha256::new();
hasher.update(content.as_bytes());
format!("{:x}", hasher.finalize())
}
/// Compute SHA-256 hash over a sorted list of strings.
/// Used for labels_hash and paths_hash to detect changes efficiently.
pub fn compute_list_hash(items: &[String]) -> String {
let mut sorted = items.to_vec();
sorted.sort();
let joined = sorted.join("\n");
compute_content_hash(&joined)
}
Document Formats:
All document types use consistent header format for better search relevance and context:
| Source | content_text |
|---|---|
| Issue | Structured header + description (see below) |
| MR | Structured header + description (see below) |
| Discussion | Full thread with header (see below) |
Issue Document Format:
[[Issue]] #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234
Labels: ["bug", "auth"]
State: opened
Author: @johndoe
--- Description ---
We need to modernize our authentication system...
MR Document Format:
[[MergeRequest]] !456: Implement JWT authentication
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/merge_requests/456
Labels: ["feature", "auth"]
State: opened
Author: @johndoe
Source: feature/jwt-auth -> main
--- Description ---
This MR implements JWT-based authentication as discussed in #234...
Discussion Document Format:
[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"]
--- Thread ---
@johndoe (2024-03-15):
I think we should move to JWT-based auth...
@janedoe (2024-03-15):
Agreed. What about refresh token strategy?
Acceptance Criteria:
- Issue document: structured header with
[[Issue]]prefix, project, URL, labels, state, author, then description - MR document: structured header with
[[MergeRequest]]prefix, project, URL, labels, state, author, branches, then description - Discussion document: includes parent type+title, project, URL, labels, files, then thread
- System notes (is_system=1) excluded from discussion content
- DiffNote file paths extracted to paths vector
- Labels extracted to labels vector
- SHA-256 hash computed from content_text
- Headers use consistent separator lines (
--- Description ---,--- Thread ---)
2.3 Truncation Logic
File: src/documents/truncation.rs
/// Maximum content length (~8,000 tokens at 4 chars/token estimate).
pub const MAX_CONTENT_CHARS: usize = 32_000;
/// Truncation result with metadata.
#[derive(Debug, Clone)]
pub struct TruncationResult {
pub content: String,
pub is_truncated: bool,
pub reason: Option<TruncationReason>,
}
/// Reason for truncation.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TruncationReason {
TokenLimitMiddleDrop,
SingleNoteOversized,
FirstLastOversized,
}
impl TruncationReason {
pub fn as_str(&self) -> &'static str {
match self {
Self::TokenLimitMiddleDrop => "token_limit_middle_drop",
Self::SingleNoteOversized => "single_note_oversized",
Self::FirstLastOversized => "first_last_oversized",
}
}
}
/// Truncate content at note boundaries.
///
/// Rules:
/// - Max content: 32,000 characters
/// - Truncate at NOTE boundaries (never mid-note)
/// - Preserve first N notes and last M notes
/// - Drop from middle, insert marker
pub fn truncate_content(notes: &[NoteContent], max_chars: usize) -> TruncationResult {
// Implementation handles edge cases per table below
todo!()
}
/// Note content for truncation.
pub struct NoteContent {
pub author: String,
pub date: String,
pub body: String,
}
Edge Cases:
| Scenario | Handling |
|---|---|
| Single note > 32000 chars | Truncate at char boundary, append [truncated], reason = single_note_oversized |
| First + last note > 32000 | Keep only first note (truncated if needed), reason = first_last_oversized |
| Only one note | Truncate at char boundary if needed |
Acceptance Criteria:
- Notes never cut mid-content
- First and last notes preserved when possible
- Truncation marker
\n\n[... N notes omitted for length ...]\n\ninserted - Metadata fields set correctly
- Edge cases handled per table above
2.4 CLI: gi generate-docs (Incremental by Default)
File: src/cli/commands/generate_docs.rs
//! Generate documents command - create searchable documents from entities.
//!
//! By default, runs incrementally (processes only dirty_sources queue).
//! Use --full to regenerate all documents from scratch.
use rusqlite::Connection;
use serde::Serialize;
use crate::core::error::Result;
use crate::documents::{DocumentData, SourceType};
use crate::Config;
/// Result of document generation.
#[derive(Debug, Serialize)]
pub struct GenerateDocsResult {
pub issues: usize,
pub mrs: usize,
pub discussions: usize,
pub total: usize,
pub truncated: usize,
pub skipped: usize, // Unchanged documents
}
/// Chunk size for --full mode transactions.
/// Balances throughput against WAL file growth and memory pressure.
const FULL_MODE_CHUNK_SIZE: usize = 2000;
/// Run document generation (incremental by default).
///
/// Incremental mode (default):
/// - Processes only items in dirty_sources queue
/// - Fast for routine syncs
///
/// Full mode (--full):
/// - Regenerates ALL documents from scratch
/// - Uses chunked transactions (2k docs/tx) to bound WAL growth
/// - Use when schema changes or after migration
pub fn run_generate_docs(
config: &Config,
full: bool,
project_filter: Option<&str>,
) -> Result<GenerateDocsResult> {
if full {
// Full mode: regenerate everything using chunked transactions
//
// Using chunked transactions instead of a single giant transaction:
// - Bounds WAL file growth (single 50k-doc tx could balloon WAL)
// - Reduces memory pressure from statement caches
// - Allows progress reporting between chunks
// - Crash partway through leaves partial but consistent state
//
// Steps per chunk:
// 1. BEGIN IMMEDIATE transaction
// 2. Query next batch of sources (issues/MRs/discussions)
// 3. For each: generate document, compute hash
// 4. Upsert into `documents` table (FTS triggers auto-fire)
// 5. Populate `document_labels` and `document_paths`
// 6. COMMIT
// 7. Report progress, loop to next chunk
//
// After all chunks:
// 8. Single final transaction for FTS rebuild:
// INSERT INTO documents_fts(documents_fts) VALUES('rebuild')
//
// Example implementation:
let conn = open_db(config)?;
let mut result = GenerateDocsResult::default();
let mut offset = 0;
loop {
// Process issues in chunks
let issues: Vec<Issue> = query_issues(&conn, project_filter, FULL_MODE_CHUNK_SIZE, offset)?;
if issues.is_empty() { break; }
let tx = conn.transaction()?;
for issue in &issues {
let doc = generate_issue_document(issue)?;
upsert_document(&tx, &doc)?;
result.issues += 1;
}
tx.commit()?;
offset += issues.len();
// Report progress here if using indicatif
}
// Similar chunked loops for MRs and discussions...
// Final FTS rebuild in its own transaction
let tx = conn.transaction()?;
tx.execute(
"INSERT INTO documents_fts(documents_fts) VALUES('rebuild')",
[],
)?;
tx.commit()?;
} else {
// Incremental mode: process dirty_sources only
// 1. Query dirty_sources (bounded by LIMIT)
// 2. Regenerate only those documents
// 3. Clear from dirty_sources after processing
}
todo!()
}
/// Print human-readable output.
pub fn print_generate_docs(result: &GenerateDocsResult) {
println!("Document generation complete:");
println!(" Issues: {:>6} documents", result.issues);
println!(" MRs: {:>6} documents", result.mrs);
println!(" Discussions: {:>6} documents", result.discussions);
println!(" ─────────────────────");
println!(" Total: {:>6} documents", result.total);
if result.truncated > 0 {
println!(" Truncated: {:>6}", result.truncated);
}
if result.skipped > 0 {
println!(" Skipped: {:>6} (unchanged)", result.skipped);
}
}
/// Print JSON output for robot mode.
pub fn print_generate_docs_json(result: &GenerateDocsResult) {
let output = serde_json::json!({
"ok": true,
"data": result
});
println!("{}", serde_json::to_string_pretty(&output).unwrap());
}
CLI integration in src/cli/mod.rs:
/// Generate-docs subcommand arguments.
#[derive(Args)]
pub struct GenerateDocsArgs {
/// Regenerate ALL documents (not just dirty queue)
#[arg(long)]
full: bool,
/// Only generate for specific project
#[arg(long)]
project: Option<String>,
}
Acceptance Criteria:
- Creates document for each issue
- Creates document for each MR
- Creates document for each discussion
- Default mode processes dirty_sources queue only (incremental)
--fullregenerates all documents from scratch--fulluses chunked transactions (2k docs/tx) to bound WAL growth- Final FTS rebuild after all chunks complete
- Progress bar in human mode (via
indicatif) - JSON output in robot mode
Phase 3: Lexical Search
3.1 Search Module Structure
New module: src/search/
src/search/
├── mod.rs # Module exports
├── fts.rs # FTS5 search
├── vector.rs # Vector search (sqlite-vec)
├── hybrid.rs # Combined hybrid search
└── filters.rs # Filter parsing and application
File: src/search/mod.rs
//! Search functionality for documents.
//!
//! Supports lexical (FTS5), semantic (vector), and hybrid search.
mod filters;
mod fts;
mod hybrid;
mod rrf;
mod vector;
pub use filters::{SearchFilters, PathFilter, apply_filters};
pub use fts::{search_fts, to_fts_query, FtsResult, FtsQueryMode, generate_fallback_snippet, get_result_snippet};
pub use hybrid::{search_hybrid, HybridResult, SearchMode};
pub use rrf::{rank_rrf, RrfResult};
pub use vector::{search_vector, VectorResult};
3.2 FTS5 Search Function
File: src/search/fts.rs
use rusqlite::Connection;
use crate::core::error::Result;
/// FTS search result.
#[derive(Debug, Clone)]
pub struct FtsResult {
pub document_id: i64,
pub rank: f64, // BM25 score (lower = better match)
pub snippet: String, // Context snippet around match
}
/// Generate fallback snippet for semantic-only results.
///
/// When FTS snippets aren't available (semantic-only mode), this generates
/// a context snippet by truncating the document content. Useful for displaying
/// search results without FTS hits.
///
/// Args:
/// content_text: Full document content
/// max_chars: Maximum snippet length (default 200)
///
/// Returns a truncated string with ellipsis if truncated.
pub fn generate_fallback_snippet(content_text: &str, max_chars: usize) -> String {
let trimmed = content_text.trim();
if trimmed.len() <= max_chars {
return trimmed.to_string();
}
// Find word boundary near max_chars to avoid cutting mid-word
let truncation_point = trimmed[..max_chars]
.rfind(|c: char| c.is_whitespace())
.unwrap_or(max_chars);
format!("{}...", &trimmed[..truncation_point])
}
/// Get snippet for search result, preferring FTS when available.
///
/// Priority:
/// 1. FTS snippet (if document matched FTS query)
/// 2. Fallback: truncated content_text
pub fn get_result_snippet(
fts_snippet: Option<&str>,
content_text: &str,
) -> String {
match fts_snippet {
Some(snippet) if !snippet.is_empty() => snippet.to_string(),
_ => generate_fallback_snippet(content_text, 200),
}
}
/// FTS query parsing mode.
#[derive(Debug, Clone, Copy, Default)]
pub enum FtsQueryMode {
/// Safe parsing (default): escapes dangerous syntax but preserves
/// trailing `*` for obvious prefix queries (type-ahead UX).
#[default]
Safe,
/// Raw mode: passes user MATCH syntax through unchanged.
/// Use with caution - invalid syntax will cause FTS5 errors.
Raw,
}
/// Convert user query to FTS5-safe MATCH expression.
///
/// FTS5 MATCH syntax has special characters that cause errors if passed raw:
/// - `-` (NOT operator)
/// - `"` (phrase quotes)
/// - `:` (column filter)
/// - `*` (prefix)
/// - `AND`, `OR`, `NOT` (operators)
///
/// Strategy for Safe mode:
/// - Wrap each whitespace-delimited token in double quotes
/// - Escape internal quotes by doubling them
/// - PRESERVE trailing `*` for simple prefix queries (alphanumeric tokens)
/// - This forces FTS5 to treat tokens as literals while allowing type-ahead
///
/// Raw mode passes the query through unchanged for power users who want
/// full FTS5 syntax (phrase queries, column scopes, boolean operators).
///
/// Examples (Safe mode):
/// - "auth error" -> `"auth" "error"` (implicit AND)
/// - "auth*" -> `"auth"*` (prefix preserved!)
/// - "jwt_token*" -> `"jwt_token"*` (prefix preserved!)
/// - "C++" -> `"C++"` (special chars preserved, no prefix)
/// - "don't panic" -> `"don't" "panic"` (apostrophe preserved)
/// - "-DWITH_SSL" -> `"-DWITH_SSL"` (leading dash neutralized)
pub fn to_fts_query(raw: &str, mode: FtsQueryMode) -> String {
if matches!(mode, FtsQueryMode::Raw) {
return raw.trim().to_string();
}
raw.split_whitespace()
.map(|token| {
let t = token.trim();
if t.is_empty() {
return "\"\"".to_string();
}
// Detect simple prefix queries: alphanumeric/underscore followed by *
// e.g., "auth*", "jwt_token*", "user123*"
let is_prefix = t.ends_with('*')
&& t.len() > 1
&& t[..t.len() - 1]
.chars()
.all(|c| c.is_ascii_alphanumeric() || c == '_');
// Escape internal double quotes by doubling them
let escaped = t.replace('"', "\"\"");
if is_prefix {
// Strip trailing *, quote the core, then re-add *
let core = &escaped[..escaped.len() - 1];
format!("\"{}\"*", core)
} else {
format!("\"{}\"", escaped)
}
})
.collect::<Vec<_>>()
.join(" ")
}
/// Search documents using FTS5.
///
/// Returns matching document IDs with BM25 rank scores and snippets.
/// Lower rank values indicate better matches.
/// Uses bm25() explicitly (not the `rank` alias) and snippet() for context.
///
/// IMPORTANT: User input is sanitized via `to_fts_query()` to prevent
/// FTS5 syntax errors from special characters while preserving prefix search.
pub fn search_fts(
conn: &Connection,
query: &str,
limit: usize,
mode: FtsQueryMode,
) -> Result<Vec<FtsResult>> {
if query.trim().is_empty() {
return Ok(Vec::new());
}
let safe_query = to_fts_query(query, mode);
let mut stmt = conn.prepare(
"SELECT rowid,
bm25(documents_fts),
snippet(documents_fts, 1, '<mark>', '</mark>', '...', 64)
FROM documents_fts
WHERE documents_fts MATCH ?
ORDER BY bm25(documents_fts)
LIMIT ?"
)?;
let results = stmt
.query_map([&safe_query, &limit.to_string()], |row| {
Ok(FtsResult {
document_id: row.get(0)?,
rank: row.get(1)?,
snippet: row.get(2)?,
})
})?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(results)
}
Acceptance Criteria:
- Returns matching document IDs with BM25 rank
- Porter stemming works (search/searching match)
- Prefix search works (type-ahead UX):
auth*returns results starting with "auth" - Empty query returns empty results
- Nonsense query returns empty results
- Special characters in query don't cause FTS5 syntax errors (
-,",:,*) - Query
"-DWITH_SSL"returns results (not treated as NOT operator) - Query
C++returns results (special chars preserved) - Safe mode preserves trailing
*on alphanumeric tokens - Raw mode (
--fts-mode=raw) passes query through unchanged
3.3 Search Filters
File: src/search/filters.rs
use rusqlite::Connection;
use crate::core::error::Result;
use crate::documents::SourceType;
/// Maximum allowed limit for search results.
const MAX_SEARCH_LIMIT: usize = 100;
/// Default limit for search results.
const DEFAULT_SEARCH_LIMIT: usize = 20;
/// Search filters applied post-retrieval.
#[derive(Debug, Clone, Default)]
pub struct SearchFilters {
pub source_type: Option<SourceType>,
pub author: Option<String>,
pub project_id: Option<i64>,
pub after: Option<i64>, // ms epoch
pub labels: Vec<String>, // AND logic
pub path: Option<PathFilter>,
pub limit: usize, // Default 20, max 100
}
impl SearchFilters {
/// Check if any filter is set (used for adaptive recall).
pub fn has_any_filter(&self) -> bool {
self.source_type.is_some()
|| self.author.is_some()
|| self.project_id.is_some()
|| self.after.is_some()
|| !self.labels.is_empty()
|| self.path.is_some()
}
/// Clamp limit to valid range [1, MAX_SEARCH_LIMIT].
pub fn clamp_limit(&self) -> usize {
if self.limit == 0 {
DEFAULT_SEARCH_LIMIT
} else {
self.limit.min(MAX_SEARCH_LIMIT)
}
}
}
/// Path filter with prefix or exact match.
#[derive(Debug, Clone)]
pub enum PathFilter {
Prefix(String), // Trailing `/` -> LIKE 'path/%'
Exact(String), // No trailing `/` -> = 'path'
}
impl PathFilter {
pub fn from_str(s: &str) -> Self {
if s.ends_with('/') {
Self::Prefix(s.to_string())
} else {
Self::Exact(s.to_string())
}
}
}
/// Apply filters to document IDs, returning filtered set.
///
/// IMPORTANT: Preserves ranking order from input document_ids.
/// Filters must not reorder results - maintain the RRF/search ranking.
///
/// Uses JSON1 extension for efficient ordered ID passing:
/// - Passes document_ids as JSON array: `[1,2,3,...]`
/// - Uses `json_each()` to expand into rows with `key` as position
/// - JOINs with documents table and applies filters
/// - Orders by original position to preserve ranking
pub fn apply_filters(
conn: &Connection,
document_ids: &[i64],
filters: &SearchFilters,
) -> Result<Vec<i64>> {
if document_ids.is_empty() {
return Ok(Vec::new());
}
// Build JSON array of document IDs
let ids_json = serde_json::to_string(document_ids)?;
// Build dynamic WHERE clauses
let mut conditions: Vec<String> = Vec::new();
let mut params: Vec<Box<dyn rusqlite::ToSql>> = Vec::new();
// Always bind the JSON array first
params.push(Box::new(ids_json));
if let Some(ref source_type) = filters.source_type {
conditions.push("d.source_type = ?".into());
params.push(Box::new(source_type.as_str().to_string()));
}
if let Some(ref author) = filters.author {
conditions.push("d.author_username = ?".into());
params.push(Box::new(author.clone()));
}
if let Some(project_id) = filters.project_id {
conditions.push("d.project_id = ?".into());
params.push(Box::new(project_id));
}
if let Some(after) = filters.after {
conditions.push("d.created_at >= ?".into());
params.push(Box::new(after));
}
// Labels: AND logic - all labels must be present
for label in &filters.labels {
conditions.push(
"EXISTS (SELECT 1 FROM document_labels dl WHERE dl.document_id = d.id AND dl.label_name = ?)".into()
);
params.push(Box::new(label.clone()));
}
// Path filter
if let Some(ref path_filter) = filters.path {
match path_filter {
PathFilter::Exact(path) => {
conditions.push(
"EXISTS (SELECT 1 FROM document_paths dp WHERE dp.document_id = d.id AND dp.path = ?)".into()
);
params.push(Box::new(path.clone()));
}
PathFilter::Prefix(prefix) => {
// IMPORTANT: Must use ESCAPE clause for backslash escaping to work in SQLite LIKE
conditions.push(
"EXISTS (SELECT 1 FROM document_paths dp WHERE dp.document_id = d.id AND dp.path LIKE ? ESCAPE '\\')".into()
);
// Escape LIKE wildcards and add trailing %
let like_pattern = format!(
"{}%",
prefix.replace('%', "\\%").replace('_', "\\_")
);
params.push(Box::new(like_pattern));
}
}
}
let where_clause = if conditions.is_empty() {
String::new()
} else {
format!("AND {}", conditions.join(" AND "))
};
let limit = filters.clamp_limit();
// SQL using JSON1 for ordered ID passing
// json_each() returns rows with `key` (0-indexed position) and `value` (the ID)
let sql = format!(
r#"
SELECT d.id
FROM json_each(?) AS j
JOIN documents d ON d.id = j.value
WHERE 1=1 {}
ORDER BY j.key
LIMIT ?
"#,
where_clause
);
params.push(Box::new(limit as i64));
let mut stmt = conn.prepare(&sql)?;
let params_refs: Vec<&dyn rusqlite::ToSql> = params.iter().map(|p| p.as_ref()).collect();
let results = stmt
.query_map(params_refs.as_slice(), |row| row.get(0))?
.collect::<std::result::Result<Vec<i64>, _>>()?;
Ok(results)
}
Supported filters:
| Filter | SQL Column | Notes |
|---|---|---|
--type |
source_type |
issue, mr, discussion |
--author |
author_username |
Exact match |
--project |
project_id |
Resolve path to ID |
--after |
created_at |
>= date (ms epoch) |
--label |
document_labels |
JOIN, multiple = AND |
--path |
document_paths |
JOIN, trailing / = prefix |
--limit |
N/A | Default 20, max 100 |
Acceptance Criteria:
- Each filter correctly restricts results
- Multiple
--labelflags use AND logic - Path prefix vs exact match works correctly
- Filters compose (all applied together)
- Ranking order preserved after filtering (ORDER BY position)
- Limit clamped to valid range [1, 100]
- Default limit is 20 when not specified
- JSON1
json_each()correctly expands document IDs
3.4 CLI: gi search --mode=lexical
File: src/cli/commands/search.rs
//! Search command - find documents using lexical, semantic, or hybrid search.
use console::style;
use serde::Serialize;
use crate::core::error::Result;
use crate::core::time::ms_to_iso;
use crate::search::{SearchFilters, SearchMode, search_fts, search_vector, rank_rrf, RrfResult};
use crate::Config;
/// Search result for display.
#[derive(Debug, Serialize)]
pub struct SearchResultDisplay {
pub document_id: i64,
pub source_type: String,
pub title: Option<String>,
pub url: Option<String>,
pub project_path: String,
pub author: Option<String>,
pub created_at: String, // ISO format
pub updated_at: String, // ISO format
pub score: f64, // Normalized 0-1
pub snippet: String, // Context around match
pub labels: Vec<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub explain: Option<ExplainData>,
}
/// Ranking explanation for --explain flag.
#[derive(Debug, Serialize)]
pub struct ExplainData {
pub vector_rank: Option<usize>,
pub fts_rank: Option<usize>,
pub rrf_score: f64,
}
/// Search results response.
#[derive(Debug, Serialize)]
pub struct SearchResponse {
pub query: String,
pub mode: String,
pub total_results: usize,
pub results: Vec<SearchResultDisplay>,
#[serde(skip_serializing_if = "Vec::is_empty")]
pub warnings: Vec<String>,
}
/// Run search command.
pub fn run_search(
config: &Config,
query: &str,
mode: SearchMode,
filters: SearchFilters,
explain: bool,
) -> Result<SearchResponse> {
// 1. Parse query and filters
// 2. Execute search based on mode
// 3. Apply post-retrieval filters
// 4. Format and return results
todo!()
}
/// Print human-readable search results.
pub fn print_search_results(response: &SearchResponse, explain: bool) {
println!(
"Found {} results ({} search)\n",
response.total_results,
response.mode
);
for (i, result) in response.results.iter().enumerate() {
let type_prefix = match result.source_type.as_str() {
"merge_request" => "MR",
"issue" => "Issue",
"discussion" => "Discussion",
_ => &result.source_type,
};
let title = result.title.as_deref().unwrap_or("(untitled)");
println!(
"[{}] {} - {} ({})",
i + 1,
style(type_prefix).cyan(),
title,
format!("{:.2}", result.score)
);
if explain {
if let Some(exp) = &result.explain {
let vec_str = exp.vector_rank.map(|r| format!("#{}", r)).unwrap_or_else(|| "-".into());
let fts_str = exp.fts_rank.map(|r| format!("#{}", r)).unwrap_or_else(|| "-".into());
println!(
" Vector: {}, FTS: {}, RRF: {:.4}",
vec_str, fts_str, exp.rrf_score
);
}
}
if let Some(author) = &result.author {
println!(
" @{} · {} · {}",
author, &result.created_at[..10], result.project_path
);
}
println!(" \"{}...\"", &result.snippet);
if let Some(url) = &result.url {
println!(" {}", style(url).dim());
}
println!();
}
}
/// Print JSON search results for robot mode.
pub fn print_search_results_json(response: &SearchResponse, elapsed_ms: u64) {
let output = serde_json::json!({
"ok": true,
"data": response,
"meta": {
"elapsed_ms": elapsed_ms
}
});
println!("{}", serde_json::to_string_pretty(&output).unwrap());
}
CLI integration in src/cli/mod.rs:
/// Search subcommand arguments.
#[derive(Args)]
pub struct SearchArgs {
/// Search query
query: String,
/// Search mode
#[arg(long, default_value = "hybrid")]
mode: String, // "hybrid" | "lexical" | "semantic"
/// Filter by source type
#[arg(long, value_name = "TYPE")]
r#type: Option<String>,
/// Filter by author username
#[arg(long)]
author: Option<String>,
/// Filter by project path
#[arg(long)]
project: Option<String>,
/// Filter by creation date (after)
#[arg(long)]
after: Option<String>,
/// Filter by label (can specify multiple)
#[arg(long, action = clap::ArgAction::Append)]
label: Vec<String>,
/// Filter by file path
#[arg(long)]
path: Option<String>,
/// Maximum results
#[arg(long, default_value = "20")]
limit: usize,
/// Show ranking breakdown
#[arg(long)]
explain: bool,
/// FTS query mode: "safe" (default) or "raw"
/// - safe: Escapes special chars but preserves `*` for prefix queries
/// - raw: Pass FTS5 MATCH syntax through unchanged (advanced)
#[arg(long, default_value = "safe")]
fts_mode: String, // "safe" | "raw"
}
Acceptance Criteria:
- Works without Ollama running
- All filters functional
- Human-readable output with snippets
- Semantic-only results get fallback snippets from content_text
- JSON output matches schema
- Empty results show helpful message
- "No data indexed" message if documents table empty
--fts-mode=safe(default) preserves prefix*while escaping special chars--fts-mode=rawpasses FTS5 MATCH syntax through unchanged
Phase 4: Embedding Pipeline
4.1 Embedding Module Structure
New module: src/embedding/
src/embedding/
├── mod.rs # Module exports
├── ollama.rs # Ollama API client
├── pipeline.rs # Batch embedding orchestration
└── change_detector.rs # Detect documents needing re-embedding
File: src/embedding/mod.rs
//! Embedding generation and storage.
//!
//! Uses Ollama for embedding generation and sqlite-vec for storage.
mod change_detector;
mod ollama;
mod pipeline;
pub use change_detector::detect_embedding_changes;
pub use ollama::{OllamaClient, OllamaConfig, check_ollama_health};
pub use pipeline::{embed_documents, EmbedResult};
4.2 Ollama Client
File: src/embedding/ollama.rs
use reqwest::Client;
use serde::{Deserialize, Serialize};
use crate::core::error::{GiError, Result};
/// Ollama client configuration.
#[derive(Debug, Clone)]
pub struct OllamaConfig {
pub base_url: String, // "http://localhost:11434"
pub model: String, // "nomic-embed-text"
pub timeout_secs: u64, // Request timeout
}
impl Default for OllamaConfig {
fn default() -> Self {
Self {
base_url: "http://localhost:11434".into(),
model: "nomic-embed-text".into(),
timeout_secs: 60,
}
}
}
/// Ollama API client.
pub struct OllamaClient {
client: Client,
config: OllamaConfig,
}
/// Batch embed request.
#[derive(Serialize)]
struct EmbedRequest {
model: String,
input: Vec<String>,
}
/// Batch embed response.
#[derive(Deserialize)]
struct EmbedResponse {
model: String,
embeddings: Vec<Vec<f32>>,
}
/// Model info from /api/tags.
#[derive(Deserialize)]
struct TagsResponse {
models: Vec<ModelInfo>,
}
#[derive(Deserialize)]
struct ModelInfo {
name: String,
}
impl OllamaClient {
pub fn new(config: OllamaConfig) -> Self {
let client = Client::builder()
.timeout(std::time::Duration::from_secs(config.timeout_secs))
.build()
.expect("Failed to create HTTP client");
Self { client, config }
}
/// Check if Ollama is available and model is loaded.
pub async fn health_check(&self) -> Result<()> {
let url = format!("{}/api/tags", self.config.base_url);
let response = self.client.get(&url).send().await.map_err(|e| {
GiError::OllamaUnavailable {
base_url: self.config.base_url.clone(),
source: Some(e),
}
})?;
let tags: TagsResponse = response.json().await?;
let model_available = tags.models.iter().any(|m| m.name.starts_with(&self.config.model));
if !model_available {
return Err(GiError::OllamaModelNotFound {
model: self.config.model.clone(),
});
}
Ok(())
}
/// Generate embeddings for a batch of texts.
///
/// Returns 768-dimensional vectors for each input text.
pub async fn embed_batch(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
let url = format!("{}/api/embed", self.config.base_url);
let request = EmbedRequest {
model: self.config.model.clone(),
input: texts,
};
let response = self.client
.post(&url)
.json(&request)
.send()
.await
.map_err(|e| GiError::OllamaUnavailable {
base_url: self.config.base_url.clone(),
source: Some(e),
})?;
if !response.status().is_success() {
let status = response.status();
let body = response.text().await.unwrap_or_default();
return Err(GiError::EmbeddingFailed {
document_id: 0, // Batch failure
reason: format!("HTTP {}: {}", status, body),
});
}
let embed_response: EmbedResponse = response.json().await?;
Ok(embed_response.embeddings)
}
}
/// Quick health check without full client.
pub async fn check_ollama_health(base_url: &str) -> bool {
let client = Client::new();
client
.get(format!("{}/api/tags", base_url))
.send()
.await
.is_ok()
}
Endpoints:
| Endpoint | Purpose |
|---|---|
GET /api/tags |
Health check, verify model available |
POST /api/embed |
Batch embedding (preferred) |
Acceptance Criteria:
- Health check detects Ollama availability
- Batch embedding works with up to 32 texts
- Clear error messages for common failures
4.3 Error Handling Extensions
File: src/core/error.rs (extend existing)
Add to ErrorCode:
pub enum ErrorCode {
// ... existing variants ...
InvalidEnumValue,
OllamaUnavailable,
OllamaModelNotFound,
EmbeddingFailed,
}
impl ErrorCode {
pub fn exit_code(&self) -> i32 {
match self {
// ... existing mappings ...
Self::InvalidEnumValue => 13,
Self::OllamaUnavailable => 14,
Self::OllamaModelNotFound => 15,
Self::EmbeddingFailed => 16,
}
}
}
impl std::fmt::Display for ErrorCode {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let code = match self {
// ... existing mappings ...
Self::InvalidEnumValue => "INVALID_ENUM_VALUE",
Self::OllamaUnavailable => "OLLAMA_UNAVAILABLE",
Self::OllamaModelNotFound => "OLLAMA_MODEL_NOT_FOUND",
Self::EmbeddingFailed => "EMBEDDING_FAILED",
};
write!(f, "{code}")
}
}
Add to GiError:
pub enum GiError {
// ... existing variants ...
#[error("Cannot connect to Ollama at {base_url}. Is it running?")]
OllamaUnavailable {
base_url: String,
#[source]
source: Option<reqwest::Error>,
},
#[error("Ollama model '{model}' not found. Run: ollama pull {model}")]
OllamaModelNotFound { model: String },
#[error("Embedding failed for document {document_id}: {reason}")]
EmbeddingFailed { document_id: i64, reason: String },
}
impl GiError {
pub fn code(&self) -> ErrorCode {
match self {
// ... existing mappings ...
Self::OllamaUnavailable { .. } => ErrorCode::OllamaUnavailable,
Self::OllamaModelNotFound { .. } => ErrorCode::OllamaModelNotFound,
Self::EmbeddingFailed { .. } => ErrorCode::EmbeddingFailed,
}
}
pub fn suggestion(&self) -> Option<&'static str> {
match self {
// ... existing mappings ...
Self::OllamaUnavailable { .. } => Some("Start Ollama: ollama serve"),
Self::OllamaModelNotFound { model } => Some("Pull the model: ollama pull nomic-embed-text"),
Self::EmbeddingFailed { .. } => Some("Check Ollama logs or retry with 'gi embed --retry-failed'"),
}
}
}
4.4 Embedding Pipeline
File: src/embedding/pipeline.rs
use indicatif::{ProgressBar, ProgressStyle};
use rusqlite::Connection;
use crate::core::error::Result;
use crate::embedding::OllamaClient;
/// Batch size for embedding requests.
const BATCH_SIZE: usize = 32;
/// SQLite page size for paging through pending documents.
const DB_PAGE_SIZE: usize = 500;
/// Expected embedding dimensions for nomic-embed-text model.
/// IMPORTANT: Validates against this to prevent silent corruption.
const EXPECTED_DIMS: usize = 768;
/// Which documents to embed.
#[derive(Debug, Clone, Copy)]
pub enum EmbedSelection {
/// New or changed documents (default).
Pending,
/// Only previously failed documents.
RetryFailed,
}
/// Result of embedding run.
#[derive(Debug, Default)]
pub struct EmbedResult {
pub embedded: usize,
pub failed: usize,
pub skipped: usize,
}
/// Embed documents that need embedding.
///
/// Process:
/// 1. Query dirty_sources ordered by queued_at
/// 2. For each: regenerate document, compute new hash
/// 3. ALWAYS upsert document (labels/paths may change even if content_hash unchanged)
/// 4. Track whether content_hash changed (for stats)
/// 5. Delete from dirty_sources (or record error on failure)
pub async fn embed_documents(
conn: &Connection,
client: &OllamaClient,
selection: EmbedSelection,
concurrency: usize,
progress_callback: Option<Box<dyn Fn(usize, usize)>>,
) -> Result<EmbedResult> {
use futures::stream::{FuturesUnordered, StreamExt};
let mut result = EmbedResult::default();
let total_pending = count_pending_documents(conn, selection)?;
if total_pending == 0 {
return Ok(result);
}
// Page through pending documents to avoid loading all into memory
loop {
let pending = find_pending_documents(conn, DB_PAGE_SIZE, selection)?;
if pending.is_empty() {
break;
}
// Launch concurrent HTTP requests, collect results
let mut futures = FuturesUnordered::new();
for batch in pending.chunks(BATCH_SIZE) {
let texts: Vec<String> = batch.iter().map(|d| d.content.clone()).collect();
let batch_meta: Vec<(i64, String)> = batch
.iter()
.map(|d| (d.id, d.content_hash.clone()))
.collect();
futures.push(async move {
let embed_result = client.embed_batch(texts).await;
(batch_meta, embed_result)
});
// Cap in-flight requests
if futures.len() >= concurrency {
if let Some((meta, res)) = futures.next().await {
collect_writes(conn, &meta, res, &mut result)?;
}
}
}
// Drain remaining futures
while let Some((meta, res)) = futures.next().await {
collect_writes(conn, &meta, res, &mut result)?;
}
if let Some(ref cb) = progress_callback {
cb(result.embedded + result.failed, total_pending);
}
}
Ok(result)
}
/// Collect embedding results and write to DB (sequential, on main thread).
///
/// IMPORTANT: Validates embedding dimensions to prevent silent corruption.
/// If model returns wrong dimensions (e.g., different model configured),
/// the document is marked as failed rather than storing corrupt data.
fn collect_writes(
conn: &Connection,
batch_meta: &[(i64, String)],
embed_result: Result<Vec<Vec<f32>>>,
result: &mut EmbedResult,
) -> Result<()> {
let tx = conn.transaction()?;
match embed_result {
Ok(embeddings) => {
for ((doc_id, hash), embedding) in batch_meta.iter().zip(embeddings.iter()) {
// Validate dimensions to prevent silent corruption
if embedding.len() != EXPECTED_DIMS {
record_embedding_error(
&tx,
*doc_id,
hash,
&format!(
"embedding dimension mismatch: got {}, expected {}",
embedding.len(),
EXPECTED_DIMS
),
)?;
result.failed += 1;
continue;
}
store_embedding(&tx, *doc_id, embedding, hash)?;
result.embedded += 1;
}
}
Err(e) => {
for (doc_id, hash) in batch_meta {
record_embedding_error(&tx, *doc_id, hash, &e.to_string())?;
result.failed += 1;
}
}
}
tx.commit()?;
Ok(())
}
struct PendingDocument {
id: i64,
content: String,
content_hash: String,
}
/// Count total pending documents (for progress reporting).
fn count_pending_documents(conn: &Connection, selection: EmbedSelection) -> Result<usize> {
let sql = match selection {
EmbedSelection::Pending =>
"SELECT COUNT(*)
FROM documents d
LEFT JOIN embedding_metadata em ON d.id = em.document_id
WHERE em.document_id IS NULL
OR em.content_hash != d.content_hash",
EmbedSelection::RetryFailed =>
"SELECT COUNT(*)
FROM documents d
JOIN embedding_metadata em ON d.id = em.document_id
WHERE em.last_error IS NOT NULL",
};
let count: usize = conn.query_row(sql, [], |row| row.get(0))?;
Ok(count)
}
/// Find pending documents for embedding.
///
/// IMPORTANT: Uses deterministic ORDER BY d.id to ensure consistent
/// paging behavior. Without ordering, SQLite may return rows in
/// different orders across calls, causing missed or duplicate documents.
fn find_pending_documents(
conn: &Connection,
limit: usize,
selection: EmbedSelection,
) -> Result<Vec<PendingDocument>> {
let sql = match selection {
EmbedSelection::Pending =>
"SELECT d.id, d.content_text, d.content_hash
FROM documents d
LEFT JOIN embedding_metadata em ON d.id = em.document_id
WHERE em.document_id IS NULL
OR em.content_hash != d.content_hash
ORDER BY d.id
LIMIT ?",
EmbedSelection::RetryFailed =>
"SELECT d.id, d.content_text, d.content_hash
FROM documents d
JOIN embedding_metadata em ON d.id = em.document_id
WHERE em.last_error IS NOT NULL
ORDER BY d.id
LIMIT ?",
};
let mut stmt = conn.prepare(sql)?;
let docs = stmt
.query_map([limit], |row| {
Ok(PendingDocument {
id: row.get(0)?,
content: row.get(1)?,
content_hash: row.get(2)?,
})
})?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(docs)
}
fn store_embedding(
tx: &rusqlite::Transaction,
document_id: i64,
embedding: &[f32],
content_hash: &str,
) -> Result<()> {
// Convert embedding to bytes for sqlite-vec
// sqlite-vec expects raw little-endian bytes, not the array directly
let embedding_bytes: Vec<u8> = embedding
.iter()
.flat_map(|f| f.to_le_bytes())
.collect();
// Store in sqlite-vec (rowid = document_id)
tx.execute(
"INSERT OR REPLACE INTO embeddings(rowid, embedding) VALUES (?, ?)",
rusqlite::params![document_id, embedding_bytes],
)?;
// Update metadata
let now = crate::core::time::now_ms();
tx.execute(
"INSERT OR REPLACE INTO embedding_metadata
(document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at)
VALUES (?, 'nomic-embed-text', 768, ?, ?, NULL, 0, ?)",
rusqlite::params![document_id, content_hash, now, now],
)?;
Ok(())
}
fn record_embedding_error(
tx: &rusqlite::Transaction,
document_id: i64,
content_hash: &str,
error: &str,
) -> Result<()> {
let now = crate::core::time::now_ms();
tx.execute(
"INSERT INTO embedding_metadata
(document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at)
VALUES (?, 'nomic-embed-text', 768, ?, ?, ?, 1, ?)
ON CONFLICT(document_id) DO UPDATE SET
last_error = excluded.last_error,
attempt_count = attempt_count + 1,
last_attempt_at = excluded.last_attempt_at",
rusqlite::params![document_id, content_hash, now, error, now],
)?;
Ok(())
}
Acceptance Criteria:
- New documents get embedded
- Changed documents (hash mismatch) get re-embedded
- Unchanged documents skipped
- Failures recorded in
embedding_metadata.last_error - Failures record actual content_hash (not empty string)
- Writes batched in transactions for performance
- Concurrency parameter respected
- Progress reported during embedding
- Deterministic
ORDER BY d.idensures consistent paging EmbedSelectionparameter controls pending vs retry-failed mode
4.5 CLI: gi embed
File: src/cli/commands/embed.rs
//! Embed command - generate embeddings for documents.
use indicatif::{ProgressBar, ProgressStyle};
use serde::Serialize;
use crate::core::error::Result;
use crate::embedding::{embed_documents, EmbedResult, OllamaClient, OllamaConfig};
use crate::Config;
/// Run embedding command.
pub async fn run_embed(
config: &Config,
retry_failed: bool,
) -> Result<EmbedResult> {
use crate::core::db::open_database;
use crate::embedding::pipeline::EmbedSelection;
let ollama_config = OllamaConfig {
base_url: config.embedding.base_url.clone(),
model: config.embedding.model.clone(),
timeout_secs: 120,
};
let client = OllamaClient::new(ollama_config);
// Health check
client.health_check().await?;
// Open database connection
let conn = open_database(config)?;
// Determine selection mode
let selection = if retry_failed {
EmbedSelection::RetryFailed
} else {
EmbedSelection::Pending
};
// Run embedding
let result = embed_documents(
&conn,
&client,
selection,
config.embedding.concurrency as usize,
None,
).await?;
Ok(result)
}
/// Print human-readable output.
pub fn print_embed(result: &EmbedResult, elapsed_secs: u64) {
println!("Embedding complete:");
println!(" Embedded: {:>6} documents", result.embedded);
println!(" Failed: {:>6} documents", result.failed);
println!(" Skipped: {:>6} documents", result.skipped);
println!(" Elapsed: {}m {}s", elapsed_secs / 60, elapsed_secs % 60);
}
/// Print JSON output for robot mode.
pub fn print_embed_json(result: &EmbedResult, elapsed_ms: u64) {
let output = serde_json::json!({
"ok": true,
"data": {
"embedded": result.embedded,
"failed": result.failed,
"skipped": result.skipped
},
"meta": {
"elapsed_ms": elapsed_ms
}
});
println!("{}", serde_json::to_string_pretty(&output).unwrap());
}
CLI integration:
/// Embed subcommand arguments.
#[derive(Args)]
pub struct EmbedArgs {
/// Retry only previously failed documents
#[arg(long)]
retry_failed: bool,
}
Acceptance Criteria:
- Embeds documents without embeddings
- Re-embeds documents with changed hash
--retry-failedonly processes failed documents- Progress bar with count
- Clear error if Ollama unavailable
4.6 CLI: gi stats
File: src/cli/commands/stats.rs
//! Stats command - display document and embedding statistics.
use rusqlite::Connection;
use serde::Serialize;
use crate::core::error::Result;
use crate::Config;
/// Document statistics.
#[derive(Debug, Serialize)]
pub struct Stats {
pub documents: DocumentStats,
pub embeddings: EmbeddingStats,
pub fts: FtsStats,
pub queues: QueueStats,
}
#[derive(Debug, Serialize)]
pub struct DocumentStats {
pub issues: usize,
pub mrs: usize,
pub discussions: usize,
pub total: usize,
pub truncated: usize,
}
#[derive(Debug, Serialize)]
pub struct EmbeddingStats {
pub embedded: usize,
pub pending: usize,
pub failed: usize,
pub coverage_pct: f64,
}
#[derive(Debug, Serialize)]
pub struct FtsStats {
pub indexed: usize,
}
/// Queue statistics for observability.
///
/// Exposes internal queue depths so operators can detect backlogs
/// and failing items that need manual intervention.
#[derive(Debug, Serialize)]
pub struct QueueStats {
/// Items in dirty_sources queue (pending document regeneration)
pub dirty_sources: usize,
/// Items in dirty_sources with last_error set (failing regeneration)
pub dirty_sources_failed: usize,
/// Items in pending_discussion_fetches queue
pub pending_discussion_fetches: usize,
/// Items in pending_discussion_fetches with last_error set
pub pending_discussion_fetches_failed: usize,
}
/// Integrity check result.
#[derive(Debug, Serialize)]
pub struct IntegrityCheck {
pub documents_count: usize,
pub fts_count: usize,
pub embeddings_count: usize,
pub metadata_count: usize,
pub orphaned_embeddings: usize,
pub hash_mismatches: usize,
pub ok: bool,
}
/// Run stats command.
pub fn run_stats(config: &Config) -> Result<Stats> {
// Query counts from database
todo!()
}
/// Run integrity check (--check flag).
///
/// Verifies:
/// - documents count == documents_fts count
/// - embeddings.rowid all exist in documents.id
/// - embedding_metadata.content_hash == documents.content_hash
pub fn run_integrity_check(config: &Config) -> Result<IntegrityCheck> {
// 1. Count documents
// 2. Count FTS entries
// 3. Find orphaned embeddings (no matching document)
// 4. Find hash mismatches between embedding_metadata and documents
// 5. Return check results
todo!()
}
/// Repair result from --repair flag.
#[derive(Debug, Serialize)]
pub struct RepairResult {
pub orphaned_embeddings_deleted: usize,
pub stale_embeddings_cleared: usize,
pub missing_fts_repopulated: usize,
}
/// Repair issues found by integrity check (--repair flag).
///
/// Fixes:
/// - Deletes orphaned embeddings (embedding_metadata rows with no matching document)
/// - Clears stale embedding_metadata (hash mismatch) so they get re-embedded
/// - Repopulates FTS for documents missing from documents_fts
pub fn run_repair(config: &Config) -> Result<RepairResult> {
let conn = open_db(config)?;
// Delete orphaned embeddings (no matching document)
let orphaned_deleted = conn.execute(
"DELETE FROM embedding_metadata
WHERE document_id NOT IN (SELECT id FROM documents)",
[],
)?;
// Also delete from embeddings virtual table (sqlite-vec)
conn.execute(
"DELETE FROM embeddings
WHERE rowid NOT IN (SELECT id FROM documents)",
[],
)?;
// Clear stale embedding_metadata (hash mismatch) - will be re-embedded
let stale_cleared = conn.execute(
"DELETE FROM embedding_metadata
WHERE (document_id, content_hash) NOT IN (
SELECT id, content_hash FROM documents
)",
[],
)?;
// Repopulate FTS for missing documents
let fts_repopulated = conn.execute(
"INSERT INTO documents_fts(rowid, title, content_text)
SELECT id, COALESCE(title, ''), content_text
FROM documents
WHERE id NOT IN (SELECT rowid FROM documents_fts)",
[],
)?;
Ok(RepairResult {
orphaned_embeddings_deleted: orphaned_deleted,
stale_embeddings_cleared: stale_cleared,
missing_fts_repopulated: fts_repopulated,
})
}
/// Print human-readable stats.
pub fn print_stats(stats: &Stats) {
println!("Document Statistics:");
println!(" Issues: {:>6} documents", stats.documents.issues);
println!(" MRs: {:>6} documents", stats.documents.mrs);
println!(" Discussions: {:>6} documents", stats.documents.discussions);
println!(" Total: {:>6} documents", stats.documents.total);
if stats.documents.truncated > 0 {
println!(" Truncated: {:>6}", stats.documents.truncated);
}
println!();
println!("Embedding Coverage:");
println!(" Embedded: {:>6} ({:.1}%)", stats.embeddings.embedded, stats.embeddings.coverage_pct);
println!(" Pending: {:>6}", stats.embeddings.pending);
println!(" Failed: {:>6}", stats.embeddings.failed);
println!();
println!("FTS Index:");
println!(" Indexed: {:>6} documents", stats.fts.indexed);
println!();
println!("Queue Depths:");
println!(" Dirty sources: {:>6} ({} failed)",
stats.queues.dirty_sources,
stats.queues.dirty_sources_failed
);
println!(" Discussion fetches:{:>6} ({} failed)",
stats.queues.pending_discussion_fetches,
stats.queues.pending_discussion_fetches_failed
);
}
/// Print integrity check results.
pub fn print_integrity_check(check: &IntegrityCheck) {
println!("Integrity Check:");
println!(" Documents: {:>6}", check.documents_count);
println!(" FTS entries: {:>6}", check.fts_count);
println!(" Embeddings: {:>6}", check.embeddings_count);
println!(" Metadata: {:>6}", check.metadata_count);
if check.orphaned_embeddings > 0 {
println!(" Orphaned embeddings: {:>6} (WARN)", check.orphaned_embeddings);
}
if check.hash_mismatches > 0 {
println!(" Hash mismatches: {:>6} (WARN)", check.hash_mismatches);
}
println!();
println!(" Status: {}", if check.ok { "OK" } else { "ISSUES FOUND" });
}
/// Print JSON stats for robot mode.
pub fn print_stats_json(stats: &Stats) {
let output = serde_json::json!({
"ok": true,
"data": stats
});
println!("{}", serde_json::to_string_pretty(&output).unwrap());
}
/// Print repair results. pub fn print_repair_result(result: &RepairResult) { println!("Repair Results:"); println!(" Orphaned embeddings deleted: {}", result.orphaned_embeddings_deleted); println!(" Stale embeddings cleared: {}", result.stale_embeddings_cleared); println!(" Missing FTS repopulated: {}", result.missing_fts_repopulated); println!(); let total = result.orphaned_embeddings_deleted + result.stale_embeddings_cleared + result.missing_fts_repopulated; if total == 0 { println!(" No issues found to repair."); } else { println!(" Fixed {} issues.", total); } }
**CLI integration:**
```rust
/// Stats subcommand arguments.
#[derive(Args)]
pub struct StatsArgs {
/// Run integrity checks (document/FTS/embedding consistency)
#[arg(long)]
check: bool,
/// Repair issues found by --check (deletes orphaned embeddings, clears stale metadata)
#[arg(long, requires = "check")]
repair: bool,
}
Acceptance Criteria:
- Shows document counts by type
- Shows embedding coverage
- Shows FTS index count
- Identifies truncated documents
- Shows queue depths (dirty_sources, pending_discussion_fetches)
- Shows failed item counts for each queue
--checkverifies document/FTS/embedding consistency--repairfixes orphaned embeddings, stale metadata, missing FTS entries- JSON output for scripting
Phase 5: Hybrid Search
5.1 Vector Search Function
File: src/search/vector.rs
use rusqlite::Connection;
use crate::core::error::Result;
/// Vector search result.
#[derive(Debug, Clone)]
pub struct VectorResult {
pub document_id: i64,
pub distance: f64, // Lower = more similar
}
/// Search documents using vector similarity.
///
/// Uses sqlite-vec for efficient vector search.
/// Returns document IDs sorted by distance (lower = better match).
///
/// IMPORTANT: sqlite-vec KNN queries require:
/// - k parameter for number of results
/// - embedding passed as raw little-endian bytes
pub fn search_vector(
conn: &Connection,
query_embedding: &[f32],
limit: usize,
) -> Result<Vec<VectorResult>> {
// Convert embedding to bytes for sqlite-vec
let embedding_bytes: Vec<u8> = query_embedding
.iter()
.flat_map(|f| f.to_le_bytes())
.collect();
let mut stmt = conn.prepare(
"SELECT rowid, distance
FROM embeddings
WHERE embedding MATCH ? AND k = ?
ORDER BY distance
LIMIT ?"
)?;
let results = stmt
.query_map(rusqlite::params![embedding_bytes, limit, limit], |row| {
Ok(VectorResult {
document_id: row.get(0)?,
distance: row.get(1)?,
})
})?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(results)
}
Acceptance Criteria:
- Returns document IDs with distances
- Lower distance = better match
- Works with 768-dim vectors
- Uses k parameter for KNN query
- Embedding passed as bytes
5.2 RRF Ranking
File: src/search/rrf.rs
use std::collections::HashMap;
/// RRF ranking constant.
const RRF_K: f64 = 60.0;
/// RRF-ranked result.
#[derive(Debug, Clone)]
pub struct RrfResult {
pub document_id: i64,
pub rrf_score: f64, // Raw RRF score
pub normalized_score: f64, // Normalized to 0-1
pub vector_rank: Option<usize>,
pub fts_rank: Option<usize>,
}
/// Rank documents using Reciprocal Rank Fusion.
///
/// Algorithm:
/// RRF_score(d) = Σ 1 / (k + rank_i(d))
///
/// Where:
/// - k = 60 (tunable constant)
/// - rank_i(d) = rank of document d in retriever i (1-indexed)
/// - Sum over all retrievers where document appears
pub fn rank_rrf(
vector_results: &[(i64, f64)], // (doc_id, distance)
fts_results: &[(i64, f64)], // (doc_id, bm25_score)
) -> Vec<RrfResult> {
let mut scores: HashMap<i64, (f64, Option<usize>, Option<usize>)> = HashMap::new();
// Add vector results (1-indexed ranks)
for (rank, (doc_id, _)) in vector_results.iter().enumerate() {
let rrf_contribution = 1.0 / (RRF_K + (rank + 1) as f64);
let entry = scores.entry(*doc_id).or_insert((0.0, None, None));
entry.0 += rrf_contribution;
entry.1 = Some(rank + 1);
}
// Add FTS results (1-indexed ranks)
for (rank, (doc_id, _)) in fts_results.iter().enumerate() {
let rrf_contribution = 1.0 / (RRF_K + (rank + 1) as f64);
let entry = scores.entry(*doc_id).or_insert((0.0, None, None));
entry.0 += rrf_contribution;
entry.2 = Some(rank + 1);
}
// Convert to results and sort by RRF score descending
let mut results: Vec<_> = scores
.into_iter()
.map(|(doc_id, (rrf_score, vector_rank, fts_rank))| {
RrfResult {
document_id: doc_id,
rrf_score,
normalized_score: 0.0, // Will be set below
vector_rank,
fts_rank,
}
})
.collect();
results.sort_by(|a, b| b.rrf_score.partial_cmp(&a.rrf_score).unwrap());
// Normalize scores to 0-1
if let Some(max_score) = results.first().map(|r| r.rrf_score) {
for result in &mut results {
result.normalized_score = result.rrf_score / max_score;
}
}
results
}
Acceptance Criteria:
- Documents in both lists score higher
- Documents in one list still included
- Normalized score = rrfScore / max(rrfScore)
- Raw RRF score available in
--explainoutput
5.3 Adaptive Recall
File: src/search/hybrid.rs
use rusqlite::Connection;
use crate::core::error::Result;
use crate::embedding::OllamaClient;
use crate::search::{SearchFilters, SearchMode, search_fts, search_vector, rank_rrf, RrfResult, FtsQueryMode};
/// Minimum base recall for unfiltered search.
const BASE_RECALL_MIN: usize = 50;
/// Minimum recall when filters are applied.
const FILTERED_RECALL_MIN: usize = 200;
/// Maximum recall to prevent excessive resource usage.
const RECALL_CAP: usize = 1500;
/// Search mode.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SearchMode {
Hybrid, // Vector + FTS with RRF
Lexical, // FTS only
Semantic, // Vector only
}
impl SearchMode {
pub fn from_str(s: &str) -> Option<Self> {
match s.to_lowercase().as_str() {
"hybrid" => Some(Self::Hybrid),
"lexical" | "fts" => Some(Self::Lexical),
"semantic" | "vector" => Some(Self::Semantic),
_ => None,
}
}
pub fn as_str(&self) -> &'static str {
match self {
Self::Hybrid => "hybrid",
Self::Lexical => "lexical",
Self::Semantic => "semantic",
}
}
}
/// Hybrid search result.
#[derive(Debug)]
pub struct HybridResult {
pub document_id: i64,
pub score: f64,
pub vector_rank: Option<usize>,
pub fts_rank: Option<usize>,
pub rrf_score: f64,
}
/// Execute hybrid search.
///
/// Adaptive recall: expands topK proportionally to requested limit and filter
/// restrictiveness to prevent "no results" when relevant docs would be filtered out.
///
/// Formula:
/// - Unfiltered: max(50, limit * 10), capped at 1500
/// - Filtered: max(200, limit * 50), capped at 1500
///
/// IMPORTANT: All modes use RRF consistently to ensure rank fields
/// are populated correctly for --explain output.
pub async fn search_hybrid(
conn: &Connection,
client: Option<&OllamaClient>,
ollama_base_url: Option<&str>, // For actionable error messages
query: &str,
mode: SearchMode,
filters: &SearchFilters,
fts_mode: FtsQueryMode,
) -> Result<(Vec<HybridResult>, Vec<String>)> {
let mut warnings: Vec<String> = Vec::new();
// Adaptive recall: proportional to requested limit and filter count
let requested = filters.clamp_limit();
let top_k = if filters.has_any_filter() {
(requested * 50).max(FILTERED_RECALL_MIN).min(RECALL_CAP)
} else {
(requested * 10).max(BASE_RECALL_MIN).min(RECALL_CAP)
};
match mode {
SearchMode::Lexical => {
// FTS only - use RRF with empty vector results for consistent ranking
let fts_results = search_fts(conn, query, top_k, fts_mode)?;
let fts_tuples: Vec<_> = fts_results.iter().map(|r| (r.document_id, r.rank)).collect();
let ranked = rank_rrf(&[], &fts_tuples);
let results = ranked
.into_iter()
.map(|r| HybridResult {
document_id: r.document_id,
score: r.normalized_score,
vector_rank: r.vector_rank,
fts_rank: r.fts_rank,
rrf_score: r.rrf_score,
})
.collect();
Ok((results, warnings))
}
SearchMode::Semantic => {
// Vector only - requires client
let client = client.ok_or_else(|| crate::core::error::GiError::OllamaUnavailable {
base_url: ollama_base_url.unwrap_or("http://localhost:11434").into(),
source: None,
})?;
let query_embedding = client.embed_batch(vec![query.to_string()]).await?;
let embedding = query_embedding.into_iter().next().unwrap();
let vec_results = search_vector(conn, &embedding, top_k)?;
// Use RRF with empty FTS results for consistent ranking
let vec_tuples: Vec<_> = vec_results.iter().map(|r| (r.document_id, r.distance)).collect();
let ranked = rank_rrf(&vec_tuples, &[]);
let results = ranked
.into_iter()
.map(|r| HybridResult {
document_id: r.document_id,
score: r.normalized_score,
vector_rank: r.vector_rank,
fts_rank: r.fts_rank,
rrf_score: r.rrf_score,
})
.collect();
Ok((results, warnings))
}
SearchMode::Hybrid => {
// Both retrievers with RRF fusion
let fts_results = search_fts(conn, query, top_k, fts_mode)?;
// Attempt vector search with graceful degradation on any failure
let vec_results = match client {
Some(client) => {
// Try to embed query; gracefully degrade on transient failures
match client.embed_batch(vec![query.to_string()]).await {
Ok(embeddings) => {
let embedding = embeddings.into_iter().next().unwrap();
search_vector(conn, &embedding, top_k)?
}
Err(e) => {
// Transient failure (network, timeout, rate limit, etc.)
// Log and fall back to FTS-only rather than failing the search
tracing::warn!("Vector search failed, falling back to lexical: {}", e);
warnings.push(format!(
"Vector search unavailable ({}), using lexical search only",
e
));
Vec::new()
}
}
}
None => {
// No client configured
warnings.push("Embedding service unavailable, using lexical search only".into());
Vec::new()
}
};
// RRF fusion
let vec_tuples: Vec<_> = vec_results.iter().map(|r| (r.document_id, r.distance)).collect();
let fts_tuples: Vec<_> = fts_results.iter().map(|r| (r.document_id, r.rank)).collect();
let ranked = rank_rrf(&vec_tuples, &fts_tuples);
let results = ranked
.into_iter()
.map(|r| HybridResult {
document_id: r.document_id,
score: r.normalized_score,
vector_rank: r.vector_rank,
fts_rank: r.fts_rank,
rrf_score: r.rrf_score,
})
.collect();
Ok((results, warnings))
}
}
}
Acceptance Criteria:
- Unfiltered search uses topK=max(50, limit*10), capped at 1500
- Filtered search uses topK=max(200, limit*50), capped at 1500
- Final results still limited by
--limit - Adaptive recall prevents "no results" under heavy filtering
5.4 Graceful Degradation
When Ollama unavailable during hybrid/semantic search:
- Log warning: "Embedding service unavailable, using lexical search only"
- Fall back to FTS-only search
- Include warning in response
Acceptance Criteria:
- Default mode is hybrid
--mode=lexicalworks without Ollama--mode=semanticrequires Ollama- Graceful degradation when Ollama down
--explainshows rank breakdown- All Phase 3 filters work in hybrid mode
Phase 6: Sync Orchestration
6.1 Dirty Source Tracking
File: src/ingestion/dirty_tracker.rs
use rusqlite::Connection;
use crate::core::error::Result;
use crate::core::time::now_ms;
use crate::documents::SourceType;
/// Maximum dirty sources to process per sync run.
const MAX_DIRTY_SOURCES_PER_RUN: usize = 500;
/// Mark a source as dirty (needs document regeneration).
///
/// Called during entity upsert operations.
/// Uses INSERT OR IGNORE to avoid duplicates.
pub fn mark_dirty(
conn: &Connection,
source_type: SourceType,
source_id: i64,
) -> Result<()> {
conn.execute(
"INSERT OR IGNORE INTO dirty_sources (source_type, source_id, queued_at)
VALUES (?, ?, ?)",
rusqlite::params![source_type.as_str(), source_id, now_ms()],
)?;
Ok(())
}
/// Get dirty sources ready for processing.
///
/// Uses `next_attempt_at` for efficient, index-friendly backoff queries.
/// Items with NULL `next_attempt_at` are ready immediately (first attempt).
/// Items with `next_attempt_at <= now` have waited long enough after failure.
///
/// Benefits over SQL bitshift calculation:
/// - No overflow risk from large attempt_count values
/// - Index-friendly: `WHERE next_attempt_at <= ?`
/// - Jitter can be added in Rust when computing next_attempt_at
///
/// This prevents hot-loop retries when a source consistently fails
/// to generate a document (e.g., malformed data, missing references).
pub fn get_dirty_sources(conn: &Connection) -> Result<Vec<(SourceType, i64)>> {
let now = now_ms();
let mut stmt = conn.prepare(
"SELECT source_type, source_id
FROM dirty_sources
WHERE next_attempt_at IS NULL OR next_attempt_at <= ?
ORDER BY attempt_count ASC, queued_at ASC
LIMIT ?"
)?;
let results = stmt
.query_map(rusqlite::params![now, MAX_DIRTY_SOURCES_PER_RUN], |row| {
let type_str: String = row.get(0)?;
let source_type = match type_str.as_str() {
"issue" => SourceType::Issue,
"merge_request" => SourceType::MergeRequest,
"discussion" => SourceType::Discussion,
other => return Err(rusqlite::Error::FromSqlConversionFailure(
0,
rusqlite::types::Type::Text,
Box::new(std::io::Error::new(
std::io::ErrorKind::InvalidData,
format!("invalid source_type: {other}"),
)),
)),
};
Ok((source_type, row.get(1)?))
})?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(results)
}
/// Clear dirty source after processing.
pub fn clear_dirty(
conn: &Connection,
source_type: SourceType,
source_id: i64,
) -> Result<()> {
conn.execute(
"DELETE FROM dirty_sources WHERE source_type = ? AND source_id = ?",
rusqlite::params![source_type.as_str(), source_id],
)?;
Ok(())
}
Acceptance Criteria:
- Upserted entities added to dirty_sources
- Duplicates ignored
- Queue cleared after document regeneration
- Processing bounded per run (max 500)
- Exponential backoff uses
next_attempt_at(index-friendly, no overflow) - Backoff computed with jitter to prevent thundering herd
- Failed items prioritized lower than fresh items (ORDER BY attempt_count ASC)
6.2 Pending Discussion Queue
File: src/ingestion/discussion_queue.rs
use rusqlite::Connection;
use crate::core::error::Result;
use crate::core::time::now_ms;
/// Noteable type for discussion fetching.
#[derive(Debug, Clone, Copy)]
pub enum NoteableType {
Issue,
MergeRequest,
}
impl NoteableType {
pub fn as_str(&self) -> &'static str {
match self {
Self::Issue => "Issue",
Self::MergeRequest => "MergeRequest",
}
}
}
/// Pending discussion fetch entry.
pub struct PendingFetch {
pub project_id: i64,
pub noteable_type: NoteableType,
pub noteable_iid: i64,
pub attempt_count: i64,
}
/// Queue a discussion fetch for an entity.
pub fn queue_discussion_fetch(
conn: &Connection,
project_id: i64,
noteable_type: NoteableType,
noteable_iid: i64,
) -> Result<()> {
conn.execute(
"INSERT OR REPLACE INTO pending_discussion_fetches
(project_id, noteable_type, noteable_iid, queued_at, attempt_count, last_attempt_at, last_error)
VALUES (?, ?, ?, ?, 0, NULL, NULL)",
rusqlite::params![project_id, noteable_type.as_str(), noteable_iid, now_ms()],
)?;
Ok(())
}
/// Get pending fetches ready for processing.
///
/// Uses `next_attempt_at` for efficient, index-friendly backoff queries.
/// Items with NULL `next_attempt_at` are ready immediately (first attempt).
/// Items with `next_attempt_at <= now` have waited long enough after failure.
///
/// Benefits over SQL bitshift calculation:
/// - No overflow risk from large attempt_count values
/// - Index-friendly: `WHERE next_attempt_at <= ?`
/// - Jitter can be added in Rust when computing next_attempt_at
///
/// Limited to `max_items` to bound API calls per sync run.
pub fn get_pending_fetches(conn: &Connection, max_items: usize) -> Result<Vec<PendingFetch>> {
let now = now_ms();
let mut stmt = conn.prepare(
"SELECT project_id, noteable_type, noteable_iid, attempt_count
FROM pending_discussion_fetches
WHERE next_attempt_at IS NULL OR next_attempt_at <= ?
ORDER BY attempt_count ASC, queued_at ASC
LIMIT ?"
)?;
let results = stmt
.query_map(rusqlite::params![now, max_items], |row| {
let type_str: String = row.get(1)?;
let noteable_type = if type_str == "Issue" {
NoteableType::Issue
} else {
NoteableType::MergeRequest
};
Ok(PendingFetch {
project_id: row.get(0)?,
noteable_type,
noteable_iid: row.get(2)?,
attempt_count: row.get(3)?,
})
})?
.collect::<std::result::Result<Vec<_>, _>>()?;
Ok(results)
}
/// Mark fetch as successful and remove from queue.
pub fn complete_fetch(
conn: &Connection,
project_id: i64,
noteable_type: NoteableType,
noteable_iid: i64,
) -> Result<()> {
conn.execute(
"DELETE FROM pending_discussion_fetches
WHERE project_id = ? AND noteable_type = ? AND noteable_iid = ?",
rusqlite::params![project_id, noteable_type.as_str(), noteable_iid],
)?;
Ok(())
}
/// Record fetch failure and compute next retry time.
///
/// Computes `next_attempt_at` using exponential backoff with jitter:
/// - Base delay: 1000ms * 2^attempt_count
/// - Cap: 1 hour (3600000ms)
/// - Jitter: ±10% to prevent thundering herd
pub fn record_fetch_error(
conn: &Connection,
project_id: i64,
noteable_type: NoteableType,
noteable_iid: i64,
error: &str,
current_attempt: i64,
) -> Result<()> {
let now = now_ms();
let next_attempt = compute_next_attempt_at(now, current_attempt + 1);
conn.execute(
"UPDATE pending_discussion_fetches
SET attempt_count = attempt_count + 1,
last_attempt_at = ?,
last_error = ?,
next_attempt_at = ?
WHERE project_id = ? AND noteable_type = ? AND noteable_iid = ?",
rusqlite::params![now, error, next_attempt, project_id, noteable_type.as_str(), noteable_iid],
)?;
Ok(())
}
/// Compute next_attempt_at with exponential backoff and jitter.
///
/// Formula: now + min(3600000, 1000 * 2^attempt_count) * (0.9 to 1.1)
/// - Capped at 1 hour to prevent runaway delays
/// - ±10% jitter prevents synchronized retries after outages
pub fn compute_next_attempt_at(now: i64, attempt_count: i64) -> i64 {
use rand::Rng;
// Cap attempt_count to prevent overflow (2^30 > 1 hour anyway)
let capped_attempts = attempt_count.min(30) as u32;
let base_delay_ms = 1000_i64.saturating_mul(1 << capped_attempts);
let capped_delay_ms = base_delay_ms.min(3_600_000); // 1 hour cap
// Add ±10% jitter
let jitter_factor = rand::thread_rng().gen_range(0.9..=1.1);
let delay_with_jitter = (capped_delay_ms as f64 * jitter_factor) as i64;
now + delay_with_jitter
}
Acceptance Criteria:
- Updated entities queued for discussion fetch
- Success removes from queue
- Failure increments attempt_count and sets next_attempt_at
- Processing bounded per run (max 100)
- Exponential backoff uses
next_attempt_at(index-friendly, no overflow) - Backoff computed with jitter to prevent thundering herd
6.3 Document Regenerator
File: src/documents/regenerator.rs
use rusqlite::Connection;
use crate::core::error::Result;
use crate::documents::{
extract_issue_document, extract_mr_document, extract_discussion_document,
DocumentData, SourceType,
};
use crate::ingestion::dirty_tracker::{get_dirty_sources, clear_dirty};
/// Result of regeneration run.
#[derive(Debug, Default)]
pub struct RegenerateResult {
pub regenerated: usize,
pub unchanged: usize,
pub errored: usize,
}
/// Regenerate documents from dirty queue.
///
/// Process:
/// 1. Query dirty_sources ordered by queued_at
/// 2. For each: regenerate document, compute new hash
/// 3. ALWAYS upsert document (labels/paths may change even if content_hash unchanged)
/// 4. Track whether content_hash changed (for stats)
/// 5. Delete from dirty_sources (or record error on failure)
pub fn regenerate_dirty_documents(conn: &Connection) -> Result<RegenerateResult> {
let dirty = get_dirty_sources(conn)?;
let mut result = RegenerateResult::default();
for (source_type, source_id) in &dirty {
match regenerate_one(conn, *source_type, *source_id) {
Ok(changed) => {
if changed {
result.regenerated += 1;
} else {
result.unchanged += 1;
}
clear_dirty(conn, *source_type, *source_id)?;
}
Err(e) => {
// Fail-soft: record error but continue processing remaining items
record_dirty_error(conn, *source_type, *source_id, &e.to_string())?;
result.errored += 1;
}
}
}
Ok(result)
}
/// Regenerate a single document. Returns true if content_hash changed.
///
/// If the source entity has been deleted, the corresponding document
/// is also deleted (cascade cleans up labels, paths, embeddings).
fn regenerate_one(
conn: &Connection,
source_type: SourceType,
source_id: i64,
) -> Result<bool> {
// Extractors return Option: None means source entity was deleted
let doc = match source_type {
SourceType::Issue => extract_issue_document(conn, source_id)?,
SourceType::MergeRequest => extract_mr_document(conn, source_id)?,
SourceType::Discussion => extract_discussion_document(conn, source_id)?,
};
let Some(doc) = doc else {
// Source was deleted — remove the document (cascade handles FTS/embeddings)
delete_document(conn, source_type, source_id)?;
return Ok(true);
};
let existing_hash = get_existing_hash(conn, source_type, source_id)?;
let changed = existing_hash.as_ref() != Some(&doc.content_hash);
// Always upsert: labels/paths can change independently of content_hash
upsert_document(conn, &doc)?;
Ok(changed)
}
/// Delete a document by source identity (cascade handles FTS trigger, labels, paths, embeddings).
fn delete_document(
conn: &Connection,
source_type: SourceType,
source_id: i64,
) -> Result<()> {
conn.execute(
"DELETE FROM documents WHERE source_type = ? AND source_id = ?",
rusqlite::params![source_type.as_str(), source_id],
)?;
Ok(())
}
/// Record a regeneration error on a dirty source for retry.
fn record_dirty_error(
conn: &Connection,
source_type: SourceType,
source_id: i64,
error: &str,
) -> Result<()> {
conn.execute(
"UPDATE dirty_sources
SET attempt_count = attempt_count + 1,
last_attempt_at = ?,
last_error = ?
WHERE source_type = ? AND source_id = ?",
rusqlite::params![now_ms(), error, source_type.as_str(), source_id],
)?;
Ok(())
}
/// Get existing content hash for a document, if it exists.
///
/// IMPORTANT: Uses `optional()` to distinguish between:
/// - No row found -> Ok(None)
/// - Row found -> Ok(Some(hash))
/// - DB error -> Err(...)
///
/// Using `.ok()` would hide real DB errors (disk I/O, corruption, etc.)
/// which should propagate up for proper error handling.
fn get_existing_hash(
conn: &Connection,
source_type: SourceType,
source_id: i64,
) -> Result<Option<String>> {
use rusqlite::OptionalExtension;
let mut stmt = conn.prepare(
"SELECT content_hash FROM documents WHERE source_type = ? AND source_id = ?"
)?;
let hash: Option<String> = stmt
.query_row(rusqlite::params![source_type.as_str(), source_id], |row| row.get(0))
.optional()?;
Ok(hash)
}
fn upsert_document(conn: &Connection, doc: &DocumentData) -> Result<()> {
use rusqlite::OptionalExtension;
// Check existing hashes before upserting (for write optimization)
let existing: Option<(i64, String, String)> = conn
.query_row(
"SELECT id, labels_hash, paths_hash FROM documents
WHERE source_type = ? AND source_id = ?",
rusqlite::params![doc.source_type.as_str(), doc.source_id],
|row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)),
)
.optional()?;
// Upsert main document (includes labels_hash, paths_hash)
conn.execute(
"INSERT INTO documents
(source_type, source_id, project_id, author_username, label_names,
labels_hash, paths_hash,
created_at, updated_at, url, title, content_text, content_hash,
is_truncated, truncated_reason)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(source_type, source_id) DO UPDATE SET
author_username = excluded.author_username,
label_names = excluded.label_names,
labels_hash = excluded.labels_hash,
paths_hash = excluded.paths_hash,
updated_at = excluded.updated_at,
url = excluded.url,
title = excluded.title,
content_text = excluded.content_text,
content_hash = excluded.content_hash,
is_truncated = excluded.is_truncated,
truncated_reason = excluded.truncated_reason",
rusqlite::params![
doc.source_type.as_str(),
doc.source_id,
doc.project_id,
doc.author_username,
serde_json::to_string(&doc.labels)?,
doc.labels_hash,
doc.paths_hash,
doc.created_at,
doc.updated_at,
doc.url,
doc.title,
doc.content_text,
doc.content_hash,
doc.is_truncated,
doc.truncated_reason,
],
)?;
// Get document ID (either existing or newly inserted)
let doc_id = match existing {
Some((id, _, _)) => id,
None => get_document_id(conn, doc.source_type, doc.source_id)?,
};
// Only update labels if hash changed (reduces write amplification)
let labels_changed = match &existing {
Some((_, old_hash, _)) => old_hash != &doc.labels_hash,
None => true, // New document, must insert
};
if labels_changed {
conn.execute(
"DELETE FROM document_labels WHERE document_id = ?",
[doc_id],
)?;
for label in &doc.labels {
conn.execute(
"INSERT INTO document_labels (document_id, label_name) VALUES (?, ?)",
rusqlite::params![doc_id, label],
)?;
}
}
// Only update paths if hash changed (reduces write amplification)
let paths_changed = match &existing {
Some((_, _, old_hash)) => old_hash != &doc.paths_hash,
None => true, // New document, must insert
};
if paths_changed {
conn.execute(
"DELETE FROM document_paths WHERE document_id = ?",
[doc_id],
)?;
for path in &doc.paths {
conn.execute(
"INSERT INTO document_paths (document_id, path) VALUES (?, ?)",
rusqlite::params![doc_id, path],
)?;
}
}
Ok(())
}
fn get_document_id(
conn: &Connection,
source_type: SourceType,
source_id: i64,
) -> Result<i64> {
let id: i64 = conn.query_row(
"SELECT id FROM documents WHERE source_type = ? AND source_id = ?",
rusqlite::params![source_type.as_str(), source_id],
|row| row.get(0),
)?;
Ok(id)
}
Acceptance Criteria:
- Dirty sources get documents regenerated
- Hash comparison prevents unnecessary updates
- FTS triggers fire on document update
- Queue cleared after processing
6.4 CLI: gi sync
File: src/cli/commands/sync.rs
//! Sync command - orchestrate full sync pipeline.
use serde::Serialize;
use crate::core::error::Result;
use crate::Config;
/// Sync result summary.
#[derive(Debug, Serialize)]
pub struct SyncResult {
pub issues_updated: usize,
pub mrs_updated: usize,
pub discussions_fetched: usize,
pub documents_regenerated: usize,
pub documents_embedded: usize,
}
/// Sync options.
#[derive(Debug, Default)]
pub struct SyncOptions {
pub full: bool, // Reset cursors, fetch everything
pub force: bool, // Override stale lock
pub no_embed: bool, // Skip embedding step
pub no_docs: bool, // Skip document regeneration
}
/// Run sync orchestration.
///
/// Steps:
/// 1. Acquire app lock with heartbeat
/// 2. Ingest delta (issues, MRs) based on cursors
/// 3. Process pending_discussion_fetches queue (bounded)
/// 4. Apply rolling backfill window (configurable, default 14 days)
/// 5. Regenerate documents from dirty_sources
/// 6. Embed documents with changed content_hash
/// 7. Release lock, record sync_run
pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResult> {
// Implementation uses existing ingestion orchestrator
// and new document/embedding pipelines
todo!()
}
/// Print human-readable sync output.
pub fn print_sync(result: &SyncResult, elapsed_secs: u64) {
println!("Sync complete:");
println!(" Issues updated: {:>6}", result.issues_updated);
println!(" MRs updated: {:>6}", result.mrs_updated);
println!(" Discussions fetched: {:>6}", result.discussions_fetched);
println!(" Documents regenerated: {:>6}", result.documents_regenerated);
println!(" Documents embedded: {:>6}", result.documents_embedded);
println!(" Elapsed: {}m {}s", elapsed_secs / 60, elapsed_secs % 60);
}
/// Print JSON output for robot mode.
pub fn print_sync_json(result: &SyncResult, elapsed_ms: u64) {
let output = serde_json::json!({
"ok": true,
"data": result,
"meta": {
"elapsed_ms": elapsed_ms
}
});
println!("{}", serde_json::to_string_pretty(&output).unwrap());
}
CLI integration:
/// Sync subcommand arguments.
#[derive(Args)]
pub struct SyncArgs {
/// Reset cursors, fetch everything
#[arg(long)]
full: bool,
/// Override stale lock
#[arg(long)]
force: bool,
/// Skip embedding step
#[arg(long)]
no_embed: bool,
/// Skip document regeneration
#[arg(long)]
no_docs: bool,
}
Acceptance Criteria:
- Orchestrates full sync pipeline
- Respects app lock
--fullresets cursors--no-embedskips embedding--no-docsskips document regeneration- Progress reporting in human mode
- JSON summary in robot mode
Testing Strategy
Unit Tests
| Module | Test File | Coverage |
|---|---|---|
| Document extractor | src/documents/extractor.rs (mod tests) |
Issue/MR/discussion extraction, consistent headers |
| Truncation | src/documents/truncation.rs (mod tests) |
All edge cases |
| RRF ranking | src/search/rrf.rs (mod tests) |
Score computation, merging |
| Content hash | src/documents/extractor.rs (mod tests) |
Deterministic hashing |
| FTS query sanitization | src/search/fts.rs (mod tests) |
to_fts_query() edge cases: -, ", :, *, C++ |
| SourceType parsing | src/documents/extractor.rs (mod tests) |
parse() accepts aliases: mr, mrs, issue, etc. |
| SearchFilters | src/search/filters.rs (mod tests) |
has_any_filter(), clamp_limit() |
| Backoff logic | src/ingestion/dirty_tracker.rs (mod tests) |
Exponential backoff query timing |
Integration Tests
| Feature | Test File | Coverage |
|---|---|---|
| FTS search | tests/fts_search.rs |
Stemming, empty results |
| Embedding storage | tests/embedding.rs |
sqlite-vec operations |
| Hybrid search | tests/hybrid_search.rs |
Combined retrieval |
| Sync orchestration | tests/sync.rs |
Full pipeline |
Golden Query Suite
File: tests/fixtures/golden_queries.json
[
{
"query": "authentication redesign",
"expected_urls": [".../-/issues/234", ".../-/merge_requests/847"],
"min_results": 1,
"max_rank": 10
}
]
Each query must have at least one expected URL in top 10 results.
CLI Smoke Tests
| Command | Expected | Pass Criteria |
|---|---|---|
gi generate-docs |
Progress, count | Completes, count > 0 |
gi generate-docs (re-run) |
0 regenerated | Hash comparison works |
gi embed |
Progress, count | Completes, count matches docs |
gi embed (re-run) |
0 embedded | Skips unchanged |
gi embed --retry-failed |
Processes failed | Only failed docs processed |
gi stats |
Coverage stats | Shows 100% after embed |
gi stats |
Queue depths | Shows dirty_sources and pending_discussion_fetches counts |
gi search "auth" --mode=lexical |
Results | Works without Ollama |
gi search "auth" |
Hybrid results | Vector + FTS combined |
gi search "auth" (Ollama down) |
FTS results + warning | Graceful degradation, warning in response |
gi search "auth" --explain |
Rank breakdown | Shows vector/FTS/RRF |
gi search "auth" --type=mr |
Filtered results | Only MRs |
gi search "auth" --type=mrs |
Filtered results | Alias works |
gi search "auth" --label=bug |
Filtered results | Only labeled docs |
gi search "-DWITH_SSL" |
Results | Leading dash doesn't cause FTS error |
gi search 'C++' |
Results | Special chars in query work |
gi search "nonexistent123" |
No results | Graceful empty state |
gi sync |
Full pipeline | All steps complete |
gi sync --no-embed |
Skip embedding | Docs generated, not embedded |
Data Integrity Checks
documentscount = issues + MRs + discussionsdocuments_ftscount =documentscountembeddingscount =documentscount (after full embed)embedding_metadata.content_hash=documents.content_hashfor all rows- All
document_labelsreference valid documents - All
document_pathsreference valid documents - No orphaned embeddings (embeddings.rowid without matching documents.id)
- Discussion documents exclude system notes
- Discussion documents include parent title
- All
dirty_sourcesentries reference existing source entities - All
pending_discussion_fetchesentries reference existing projects attempt_count>= 0 for all queue entries (never negative)last_attempt_atis NULL whenattempt_count= 0
Success Criteria
Checkpoint 3 is complete when:
-
Lexical search works without Ollama
gi search "query" --mode=lexicalreturns relevant results- All filters functional
- FTS5 syntax errors prevented by query sanitization
- Special characters in queries work correctly (
-DWITH_SSL,C++)
-
Semantic search works with Ollama
gi embedcompletes successfullygi search "query"returns semantically relevant results--explainshows ranking breakdown
-
Hybrid search combines both
- Documents appearing in both retrievers rank higher
- Graceful degradation when Ollama unavailable (falls back to FTS)
- Transient embed failures don't fail the entire search
- Warning message included in response on degradation
-
Incremental sync is efficient
gi synconly processes changed entities- Re-embedding only happens for changed documents
- Progress visible during long syncs
- Queue backoff prevents hot-loop retries on persistent failures
-
Data integrity maintained
- All counts match between tables
- No orphaned records
- Hashes consistent
get_existing_hash()properly distinguishes "not found" from DB errors
-
Observability
gi statsshows queue depths and failed item counts- Failed items visible for operator intervention
- Deterministic ordering ensures consistent paging
-
Tests pass
- Unit tests for core algorithms (including FTS sanitization, backoff)
- Integration tests for pipelines
- Golden queries return expected results