Files
gitlore/docs/prd/checkpoint-3.md
Taylor Eernisse f4dba386c9 docs: Restructure checkpoint-3 PRD with gated milestones
Reorganizes the Search & Sync MVP plan into three independently
verifiable gates (A: Lexical MVP, B: Hybrid MVP, C: Sync MVP)
to reduce integration risk. Each gate has explicit deliverables,
acceptance criteria, and can ship on its own.

Expands the specification with additional detail on document
generation, search API surface, sync orchestration, and
integrity repair paths. Removes the outdated rename note since
the project is now fully migrated to gitlore/lore naming.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 08:42:39 -05:00

110 KiB

Checkpoint 3: Search & Sync MVP

Status: Planning Prerequisite: Checkpoints 0, 1, 2 complete (issues, MRs, discussions ingested) Goal: Deliver working semantic + lexical hybrid search with efficient incremental sync

This checkpoint consolidates SPEC.md checkpoints 3A, 3B, 4, and 5 into a unified implementation plan. The work is structured for parallel agent execution where dependencies allow.

All code integrates with existing gitlore infrastructure:

  • Error handling via GiError and ErrorCode in src/core/error.rs
  • CLI patterns matching src/cli/commands/*.rs (run functions, JSON/human output)
  • Database via rusqlite::Connection with migrations in migrations/
  • Config via src/core/config.rs (EmbeddingConfig already defined)
  • Robot mode JSON with {"ok": true, "data": {...}} pattern

Executive Summary (Gated Milestones)

This checkpoint ships in three gates to reduce integration risk. Each gate is independently verifiable and shippable:

Gate A (Lexical MVP): documents + FTS + filters + lore search --mode=lexical + lore stats Gate B (Hybrid MVP): embeddings + vector + RRF fusion + graceful degradation Gate C (Sync MVP): lore sync orchestration + queues/backoff + integrity check/repair

Deliverables:

Gate A

  1. Document generation from issues/MRs/discussions with FTS5 indexing
  2. Lexical search + filters + snippets + lore stats

Gate B 3. Ollama-powered embedding pipeline with sqlite-vec storage 4. Hybrid search (RRF-ranked vector + lexical) with rich filtering + graceful degradation

Gate C 5. Orchestrated lore sync command with incremental doc regen + re-embedding 6. Integrity checks + repair paths for FTS/embeddings consistency

Key Design Decisions:

  • Documents are the search unit (not raw entities)
  • FTS5 works standalone when Ollama unavailable (graceful degradation)
  • sqlite-vec rowid = documents.id for simple joins
  • RRF ranking avoids score normalization complexity
  • Queue-based discussion fetching isolates failures
  • FTS5 query sanitization prevents syntax errors from user input
  • Exponential backoff on all queues prevents hot-loop retries
  • Transient embed failures trigger graceful degradation (not hard errors)

Phase 1: Schema Foundation

1.1 Documents Schema (Migration 007)

File: migrations/007_documents.sql

-- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion')),
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,            -- ms epoch UTC
  updated_at INTEGER,            -- ms epoch UTC
  url TEXT,
  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/search
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  labels_hash TEXT NOT NULL DEFAULT '',  -- SHA-256 over sorted labels (write optimization)
  paths_hash TEXT NOT NULL DEFAULT '',   -- SHA-256 over sorted paths (write optimization)
  is_truncated INTEGER NOT NULL DEFAULT 0,
  truncated_reason TEXT CHECK (
    truncated_reason IN ('token_limit_middle_drop','single_note_oversized','first_last_oversized')
    OR truncated_reason IS NULL
  ),
  UNIQUE(source_type, source_id)
);

CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
CREATE INDEX idx_documents_hash ON documents(content_hash);

-- Fast label filtering (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
) WITHOUT ROWID;
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- Fast path filtering (DiffNote file paths)
CREATE TABLE document_paths (
  document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  path TEXT NOT NULL,
  PRIMARY KEY(document_id, path)
) WITHOUT ROWID;
CREATE INDEX idx_document_paths_path ON document_paths(path);

-- Queue for incremental document regeneration (with retry tracking)
-- Uses next_attempt_at for index-friendly backoff queries
CREATE TABLE dirty_sources (
  source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion')),
  source_id INTEGER NOT NULL,
  queued_at INTEGER NOT NULL,    -- ms epoch UTC
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER,
  last_error TEXT,
  next_attempt_at INTEGER,       -- ms epoch UTC; NULL means ready immediately
  PRIMARY KEY(source_type, source_id)
);
CREATE INDEX idx_dirty_sources_next_attempt ON dirty_sources(next_attempt_at);

-- Resumable queue for dependent discussion fetching
-- Uses next_attempt_at for index-friendly backoff queries
CREATE TABLE pending_discussion_fetches (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  noteable_type TEXT NOT NULL,            -- 'Issue' | 'MergeRequest'
  noteable_iid INTEGER NOT NULL,
  queued_at INTEGER NOT NULL,             -- ms epoch UTC
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER,
  last_error TEXT,
  next_attempt_at INTEGER,                -- ms epoch UTC; NULL means ready immediately
  PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_next_attempt ON pending_discussion_fetches(next_attempt_at);

Acceptance Criteria:

  • Migration applies cleanly on fresh DB
  • Migration applies cleanly after CP2 schema
  • All foreign keys enforced
  • Indexes created
  • labels_hash and paths_hash columns present for write optimization
  • next_attempt_at indexed for efficient backoff queries

1.2 FTS5 Index (Migration 008)

File: migrations/008_fts5.sql

-- Full-text search with porter stemmer and prefix indexes for type-ahead
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id',
  tokenize='porter unicode61',
  prefix='2 3 4'
);

-- Keep FTS in sync via triggers.
-- IMPORTANT: COALESCE(title, '') ensures FTS5 external-content table never
-- receives NULL values, which can cause inconsistencies with delete operations.
-- FTS5 delete requires exact match of original values; NULL != NULL in SQL,
-- so a NULL title on insert would make the delete trigger fail silently.
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, COALESCE(new.title, ''), new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, COALESCE(old.title, ''), old.content_text);
END;

-- Only rebuild FTS when searchable text actually changes (not metadata-only updates)
CREATE TRIGGER documents_au AFTER UPDATE ON documents
WHEN old.title IS NOT new.title OR old.content_text != new.content_text
BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, COALESCE(old.title, ''), old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, COALESCE(new.title, ''), new.content_text);
END;

Acceptance Criteria:

  • documents_fts created as virtual table
  • Triggers fire on insert/update/delete
  • Update trigger only fires when title or content_text changes (not metadata-only updates)
  • FTS row count matches documents count after bulk insert
  • Prefix search works for type-ahead UX

1.3 Embeddings Schema (Migration 009)

File: migrations/009_embeddings.sql

-- NOTE: sqlite-vec vec0 virtual tables cannot participate in FK cascades.
-- We must use an explicit trigger to delete orphan embeddings when documents
-- are deleted. See documents_embeddings_ad trigger below.

-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
  embedding float[768]
);

-- Embedding provenance + change detection
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE,
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL,   -- ms epoch UTC
  last_error TEXT,               -- error message from last failed attempt
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER        -- ms epoch UTC
);

CREATE INDEX idx_embedding_metadata_errors
  ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
CREATE INDEX idx_embedding_metadata_hash ON embedding_metadata(content_hash);

-- CRITICAL: Delete orphan embeddings when documents are deleted.
-- vec0 virtual tables don't support FK ON DELETE CASCADE, so we need this trigger.
-- embedding_metadata has ON DELETE CASCADE, so only vec0 needs explicit cleanup
CREATE TRIGGER documents_embeddings_ad AFTER DELETE ON documents BEGIN
  DELETE FROM embeddings WHERE rowid = old.id;
END;

Acceptance Criteria:

  • embeddings vec0 table created
  • embedding_metadata tracks provenance
  • Error tracking fields present for retry logic
  • Orphan cleanup trigger fires on document deletion

Dependencies:

  • Requires sqlite-vec extension loaded at runtime
  • Extension loading already happens in src/core/db.rs
  • Migration runner must load sqlite-vec before applying migrations (including on fresh DB)

Phase 2: Document Generation

2.1 Document Module Structure

New module: src/documents/

src/documents/
├── mod.rs           # Module exports
├── extractor.rs     # Document extraction from entities
├── truncation.rs    # Note-boundary aware truncation
└── regenerator.rs   # Dirty source processing

File: src/documents/mod.rs

//! Document generation and management.
//!
//! Extracts searchable documents from issues, MRs, and discussions.

mod extractor;
mod regenerator;
mod truncation;

pub use extractor::{
    extract_discussion_document, extract_issue_document, extract_mr_document,
    DocumentData, SourceType,
};
// Note: extract_*_document() return Result<Option<DocumentData>>
// None means the source entity was deleted from the database
pub use regenerator::regenerate_dirty_documents;
pub use truncation::{truncate_content, TruncationResult};

Update src/lib.rs:

pub mod documents;  // Add to existing modules

2.2 Document Types

File: src/documents/extractor.rs

use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};

/// Source type for documents.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum SourceType {
    Issue,
    MergeRequest,
    Discussion,
}

impl SourceType {
    pub fn as_str(&self) -> &'static str {
        match self {
            Self::Issue => "issue",
            Self::MergeRequest => "merge_request",
            Self::Discussion => "discussion",
        }
    }

    /// Parse from CLI input, accepting common aliases.
    ///
    /// Accepts: "issue", "mr", "merge_request", "discussion"
    pub fn parse(s: &str) -> Option<Self> {
        match s.to_lowercase().as_str() {
            "issue" | "issues" => Some(Self::Issue),
            "mr" | "mrs" | "merge_request" | "merge_requests" => Some(Self::MergeRequest),
            "discussion" | "discussions" => Some(Self::Discussion),
            _ => None,
        }
    }
}

impl std::fmt::Display for SourceType {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "{}", self.as_str())
    }
}

/// Generated document ready for storage.
#[derive(Debug, Clone)]
pub struct DocumentData {
    pub source_type: SourceType,
    pub source_id: i64,
    pub project_id: i64,
    pub author_username: Option<String>,
    pub labels: Vec<String>,
    pub paths: Vec<String>,  // DiffNote file paths
    pub labels_hash: String, // SHA-256 over sorted labels (write optimization)
    pub paths_hash: String,  // SHA-256 over sorted paths (write optimization)
    pub created_at: i64,
    pub updated_at: i64,
    pub url: Option<String>,
    pub title: Option<String>,
    pub content_text: String,
    pub content_hash: String,
    pub is_truncated: bool,
    pub truncated_reason: Option<String>,
}

/// Compute SHA-256 hash of content.
pub fn compute_content_hash(content: &str) -> String {
    let mut hasher = Sha256::new();
    hasher.update(content.as_bytes());
    format!("{:x}", hasher.finalize())
}

/// Compute SHA-256 hash over a sorted list of strings.
/// Used for labels_hash and paths_hash to detect changes efficiently.
pub fn compute_list_hash(items: &[String]) -> String {
    let mut sorted = items.to_vec();
    sorted.sort();
    let joined = sorted.join("\n");
    compute_content_hash(&joined)
}

Document Formats:

All document types use consistent header format for better search relevance and context:

Source content_text
Issue Structured header + description (see below)
MR Structured header + description (see below)
Discussion Full thread with header (see below)

Issue Document Format:

[[Issue]] #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234
Labels: ["bug", "auth"]
State: opened
Author: @johndoe

--- Description ---

We need to modernize our authentication system...

MR Document Format:

[[MergeRequest]] !456: Implement JWT authentication
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/merge_requests/456
Labels: ["feature", "auth"]
State: opened
Author: @johndoe
Source: feature/jwt-auth -> main

--- Description ---

This MR implements JWT-based authentication as discussed in #234...

Discussion Document Format:

[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"]

--- Thread ---

@johndoe (2024-03-15):
I think we should move to JWT-based auth...

@janedoe (2024-03-15):
Agreed. What about refresh token strategy?

Acceptance Criteria:

  • Issue document: structured header with [[Issue]] prefix, project, URL, labels, state, author, then description
  • MR document: structured header with [[MergeRequest]] prefix, project, URL, labels, state, author, branches, then description
  • Discussion document: includes parent type+title, project, URL, labels, files, then thread
  • System notes (is_system=1) excluded from discussion content
  • DiffNote file paths extracted to paths vector
  • Labels extracted to labels vector
  • SHA-256 hash computed from content_text
  • Headers use consistent separator lines (--- Description ---, --- Thread ---)

2.3 Truncation Logic

File: src/documents/truncation.rs

/// Maximum content length (~8,000 tokens at 4 chars/token estimate).
pub const MAX_CONTENT_CHARS: usize = 32_000;

/// Truncation result with metadata.
#[derive(Debug, Clone)]
pub struct TruncationResult {
    pub content: String,
    pub is_truncated: bool,
    pub reason: Option<TruncationReason>,
}

/// Reason for truncation.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TruncationReason {
    TokenLimitMiddleDrop,
    SingleNoteOversized,
    FirstLastOversized,
}

impl TruncationReason {
    pub fn as_str(&self) -> &'static str {
        match self {
            Self::TokenLimitMiddleDrop => "token_limit_middle_drop",
            Self::SingleNoteOversized => "single_note_oversized",
            Self::FirstLastOversized => "first_last_oversized",
        }
    }
}

/// Truncate content at note boundaries.
///
/// Rules:
/// - Max content: 32,000 characters
/// - Truncate at NOTE boundaries (never mid-note)
/// - Preserve first N notes and last M notes
/// - Drop from middle, insert marker
pub fn truncate_content(notes: &[NoteContent], max_chars: usize) -> TruncationResult {
    // Implementation handles edge cases per table below
    todo!()
}

/// Note content for truncation.
pub struct NoteContent {
    pub author: String,
    pub date: String,
    pub body: String,
}

Edge Cases:

Scenario Handling
Single note > 32000 chars Truncate at char boundary, append [truncated], reason = single_note_oversized
First + last note > 32000 Keep only first note (truncated if needed), reason = first_last_oversized
Only one note Truncate at char boundary if needed

Acceptance Criteria:

  • Notes never cut mid-content
  • First and last notes preserved when possible
  • Truncation marker \n\n[... N notes omitted for length ...]\n\n inserted
  • Metadata fields set correctly
  • Edge cases handled per table above

2.4 CLI: lore generate-docs (Incremental by Default)

File: src/cli/commands/generate_docs.rs

//! Generate documents command - create searchable documents from entities.
//!
//! By default, runs incrementally (processes only dirty_sources queue).
//! Use --full to regenerate all documents from scratch.

use rusqlite::Connection;
use serde::Serialize;

use crate::core::error::Result;
use crate::documents::{DocumentData, SourceType};
use crate::Config;

/// Result of document generation.
#[derive(Debug, Serialize)]
pub struct GenerateDocsResult {
    pub issues: usize,
    pub mrs: usize,
    pub discussions: usize,
    pub total: usize,
    pub truncated: usize,
    pub skipped: usize,  // Unchanged documents
}

/// Chunk size for --full mode dirty queue seeding.
/// Balances throughput against WAL file growth and memory pressure.
const FULL_MODE_CHUNK_SIZE: usize = 2000;

/// Run document generation (incremental by default).
///
/// IMPORTANT: Both modes use the same regenerator codepath to avoid
/// logic divergence in label/path hashing, deletion semantics, and
/// write-optimization behavior. The only difference is how dirty_sources
/// gets populated.
///
/// Incremental mode (default):
/// - Processes only items already in dirty_sources queue
/// - Fast for routine syncs
///
/// Full mode (--full):
/// - Seeds dirty_sources with ALL source entities in chunks
/// - Drains through the same regenerator pipeline
/// - Uses keyset pagination (WHERE id > last_id) to avoid OFFSET degradation
/// - Final FTS optimize after all chunks complete
/// - Use when schema changes or after migration
pub fn run_generate_docs(
    config: &Config,
    full: bool,
    project_filter: Option<&str>,
) -> Result<GenerateDocsResult> {
    let conn = open_db(config)?;

    if full {
        // Full mode: seed dirty_sources with all source entities, then drain.
        // Uses keyset pagination to avoid O(n²) OFFSET degradation on large tables.
        //
        // Seeding is chunked to bound WAL growth:
        // 1. For each source type (issues, MRs, discussions):
        //    a. Query next chunk WHERE id > last_id ORDER BY id LIMIT chunk_size
        //    b. INSERT OR IGNORE each into dirty_sources
        //    c. Advance last_id = chunk.last().id
        //    d. Loop until chunk is empty
        // 2. Drain dirty_sources through regenerator (same as incremental)
        // 3. Final FTS optimize (not full rebuild — triggers handle consistency)
        //
        // Benefits of unified codepath:
        // - No divergence in label/path hash behavior
        // - No divergence in deletion semantics
        // - No divergence in write-optimization logic (labels_hash, paths_hash)
        // - FTS triggers fire identically in both modes

        // Seed issues
        let mut last_id: i64 = 0;
        loop {
            let chunk = query_issue_ids_after(&conn, project_filter, FULL_MODE_CHUNK_SIZE, last_id)?;
            if chunk.is_empty() { break; }
            let tx = conn.transaction()?;
            for id in &chunk {
                mark_dirty(&tx, SourceType::Issue, *id)?;
            }
            tx.commit()?;
            last_id = *chunk.last().unwrap();
        }

        // Similar keyset-paginated seeding for MRs and discussions...

        // Report: seeding complete, now regenerating
    }

    // Both modes: drain dirty_sources through the regenerator
    let regen = regenerate_dirty_documents(&conn)?;

    if full {
        // FTS optimize after bulk operations (compacts index segments)
        conn.execute(
            "INSERT INTO documents_fts(documents_fts) VALUES('optimize')",
            [],
        )?;
    }

    // Map regen -> GenerateDocsResult stats
    todo!()
}

/// Print human-readable output.
pub fn print_generate_docs(result: &GenerateDocsResult) {
    println!("Document generation complete:");
    println!("  Issues:      {:>6} documents", result.issues);
    println!("  MRs:         {:>6} documents", result.mrs);
    println!("  Discussions: {:>6} documents", result.discussions);
    println!("  ─────────────────────");
    println!("  Total:       {:>6} documents", result.total);
    if result.truncated > 0 {
        println!("  Truncated:   {:>6}", result.truncated);
    }
    if result.skipped > 0 {
        println!("  Skipped:     {:>6} (unchanged)", result.skipped);
    }
}

/// Print JSON output for robot mode.
pub fn print_generate_docs_json(result: &GenerateDocsResult) {
    let output = serde_json::json!({
        "ok": true,
        "data": result
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration in src/cli/mod.rs:

/// Generate-docs subcommand arguments.
#[derive(Args)]
pub struct GenerateDocsArgs {
    /// Regenerate ALL documents (not just dirty queue)
    #[arg(long)]
    full: bool,

    /// Only generate for specific project
    #[arg(long)]
    project: Option<String>,
}

Acceptance Criteria:

  • Creates document for each issue
  • Creates document for each MR
  • Creates document for each discussion
  • Default mode processes dirty_sources queue only (incremental)
  • --full regenerates all documents from scratch
  • --full uses chunked transactions (2k docs/tx) to bound WAL growth
  • Final FTS rebuild after all chunks complete
  • Progress bar in human mode (via indicatif)
  • JSON output in robot mode

3.1 Search Module Structure

New module: src/search/

src/search/
├── mod.rs       # Module exports
├── fts.rs       # FTS5 search
├── vector.rs    # Vector search (sqlite-vec)
├── hybrid.rs    # Combined hybrid search
└── filters.rs   # Filter parsing and application

File: src/search/mod.rs

//! Search functionality for documents.
//!
//! Supports lexical (FTS5), semantic (vector), and hybrid search.

mod filters;
mod fts;
mod hybrid;
mod rrf;
mod vector;

pub use filters::{SearchFilters, PathFilter, apply_filters};
pub use fts::{search_fts, to_fts_query, FtsResult, FtsQueryMode, generate_fallback_snippet, get_result_snippet};
pub use hybrid::{search_hybrid, HybridResult, SearchMode};
pub use rrf::{rank_rrf, RrfResult};
pub use vector::{search_vector, VectorResult};

3.2 FTS5 Search Function

File: src/search/fts.rs

use rusqlite::Connection;
use crate::core::error::Result;

/// FTS search result.
#[derive(Debug, Clone)]
pub struct FtsResult {
    pub document_id: i64,
    pub rank: f64,     // BM25 score (lower = better match)
    pub snippet: String, // Context snippet around match
}

/// Generate fallback snippet for semantic-only results.
///
/// When FTS snippets aren't available (semantic-only mode), this generates
/// a context snippet by truncating the document content. Useful for displaying
/// search results without FTS hits.
///
/// Args:
///   content_text: Full document content
///   max_chars: Maximum snippet length (default 200)
///
/// Returns a truncated string with ellipsis if truncated.
pub fn generate_fallback_snippet(content_text: &str, max_chars: usize) -> String {
    let trimmed = content_text.trim();
    if trimmed.len() <= max_chars {
        return trimmed.to_string();
    }

    // Find word boundary near max_chars to avoid cutting mid-word
    let truncation_point = trimmed[..max_chars]
        .rfind(|c: char| c.is_whitespace())
        .unwrap_or(max_chars);

    format!("{}...", &trimmed[..truncation_point])
}

/// Get snippet for search result, preferring FTS when available.
///
/// Priority:
/// 1. FTS snippet (if document matched FTS query)
/// 2. Fallback: truncated content_text
pub fn get_result_snippet(
    fts_snippet: Option<&str>,
    content_text: &str,
) -> String {
    match fts_snippet {
        Some(snippet) if !snippet.is_empty() => snippet.to_string(),
        _ => generate_fallback_snippet(content_text, 200),
    }
}

/// FTS query parsing mode.
#[derive(Debug, Clone, Copy, Default)]
pub enum FtsQueryMode {
    /// Safe parsing (default): escapes dangerous syntax but preserves
    /// trailing `*` for obvious prefix queries (type-ahead UX).
    #[default]
    Safe,
    /// Raw mode: passes user MATCH syntax through unchanged.
    /// Use with caution - invalid syntax will cause FTS5 errors.
    Raw,
}

/// Convert user query to FTS5-safe MATCH expression.
///
/// FTS5 MATCH syntax has special characters that cause errors if passed raw:
/// - `-` (NOT operator)
/// - `"` (phrase quotes)
/// - `:` (column filter)
/// - `*` (prefix)
/// - `AND`, `OR`, `NOT` (operators)
///
/// Strategy for Safe mode:
/// - Wrap each whitespace-delimited token in double quotes
/// - Escape internal quotes by doubling them
/// - PRESERVE trailing `*` for simple prefix queries (alphanumeric tokens)
/// - This forces FTS5 to treat tokens as literals while allowing type-ahead
///
/// Raw mode passes the query through unchanged for power users who want
/// full FTS5 syntax (phrase queries, column scopes, boolean operators).
///
/// Examples (Safe mode):
/// - "auth error" -> `"auth" "error"` (implicit AND)
/// - "auth*" -> `"auth"*` (prefix preserved!)
/// - "jwt_token*" -> `"jwt_token"*` (prefix preserved!)
/// - "C++" -> `"C++"` (special chars preserved, no prefix)
/// - "don't panic" -> `"don't" "panic"` (apostrophe preserved)
/// - "-DWITH_SSL" -> `"-DWITH_SSL"` (leading dash neutralized)
pub fn to_fts_query(raw: &str, mode: FtsQueryMode) -> String {
    if matches!(mode, FtsQueryMode::Raw) {
        return raw.trim().to_string();
    }
    
    raw.split_whitespace()
        .map(|token| {
            let t = token.trim();
            if t.is_empty() {
                return "\"\"".to_string();
            }
            
            // Detect simple prefix queries: alphanumeric/underscore followed by *
            // e.g., "auth*", "jwt_token*", "user123*"
            let is_prefix = t.ends_with('*')
                && t.len() > 1
                && t[..t.len() - 1]
                    .chars()
                    .all(|c| c.is_ascii_alphanumeric() || c == '_');
            
            // Escape internal double quotes by doubling them
            let escaped = t.replace('"', "\"\"");
            
            if is_prefix {
                // Strip trailing *, quote the core, then re-add *
                let core = &escaped[..escaped.len() - 1];
                format!("\"{}\"*", core)
            } else {
                format!("\"{}\"", escaped)
            }
        })
        .collect::<Vec<_>>()
        .join(" ")
}

/// Search documents using FTS5.
///
/// Returns matching document IDs with BM25 rank scores and snippets.
/// Lower rank values indicate better matches.
/// Uses bm25() explicitly (not the `rank` alias) and snippet() for context.
///
/// IMPORTANT: User input is sanitized via `to_fts_query()` to prevent
/// FTS5 syntax errors from special characters while preserving prefix search.
pub fn search_fts(
    conn: &Connection,
    query: &str,
    limit: usize,
    mode: FtsQueryMode,
) -> Result<Vec<FtsResult>> {
    if query.trim().is_empty() {
        return Ok(Vec::new());
    }

    let safe_query = to_fts_query(query, mode);

    let mut stmt = conn.prepare(
        "SELECT rowid,
                bm25(documents_fts),
                snippet(documents_fts, 1, '<mark>', '</mark>', '...', 64)
         FROM documents_fts
         WHERE documents_fts MATCH ?
         ORDER BY bm25(documents_fts)
         LIMIT ?"
    )?;

    let results = stmt
        .query_map(rusqlite::params![safe_query, limit as i64], |row| {
            Ok(FtsResult {
                document_id: row.get(0)?,
                rank: row.get(1)?,
                snippet: row.get(2)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

Acceptance Criteria:

  • Returns matching document IDs with BM25 rank
  • Porter stemming works (search/searching match)
  • Prefix search works (type-ahead UX): auth* returns results starting with "auth"
  • Empty query returns empty results
  • Nonsense query returns empty results
  • Special characters in query don't cause FTS5 syntax errors (-, ", :, *)
  • Query "-DWITH_SSL" returns results (not treated as NOT operator)
  • Query C++ returns results (special chars preserved)
  • Safe mode preserves trailing * on alphanumeric tokens
  • Raw mode (--fts-mode=raw) passes query through unchanged

3.3 Search Filters

File: src/search/filters.rs

use rusqlite::Connection;
use crate::core::error::Result;
use crate::documents::SourceType;

/// Maximum allowed limit for search results.
const MAX_SEARCH_LIMIT: usize = 100;

/// Default limit for search results.
const DEFAULT_SEARCH_LIMIT: usize = 20;

/// Search filters applied post-retrieval.
#[derive(Debug, Clone, Default)]
pub struct SearchFilters {
    pub source_type: Option<SourceType>,
    pub author: Option<String>,
    pub project_id: Option<i64>,
    pub after: Option<i64>,           // ms epoch (created_at >=)
    pub updated_after: Option<i64>,   // ms epoch (updated_at >=)
    pub labels: Vec<String>,          // AND logic
    pub path: Option<PathFilter>,
    pub limit: usize,                 // Default 20, max 100
}

impl SearchFilters {
    /// Check if any filter is set (used for adaptive recall).
    pub fn has_any_filter(&self) -> bool {
        self.source_type.is_some()
            || self.author.is_some()
            || self.project_id.is_some()
            || self.after.is_some()
            || self.updated_after.is_some()
            || !self.labels.is_empty()
            || self.path.is_some()
    }

    /// Clamp limit to valid range [1, MAX_SEARCH_LIMIT].
    pub fn clamp_limit(&self) -> usize {
        if self.limit == 0 {
            DEFAULT_SEARCH_LIMIT
        } else {
            self.limit.min(MAX_SEARCH_LIMIT)
        }
    }
}

/// Path filter with prefix or exact match.
#[derive(Debug, Clone)]
pub enum PathFilter {
    Prefix(String),  // Trailing `/` -> LIKE 'path/%'
    Exact(String),   // No trailing `/` -> = 'path'
}

impl PathFilter {
    pub fn from_str(s: &str) -> Self {
        if s.ends_with('/') {
            Self::Prefix(s.to_string())
        } else {
            Self::Exact(s.to_string())
        }
    }
}

/// Apply filters to document IDs, returning filtered set.
///
/// IMPORTANT: Preserves ranking order from input document_ids.
/// Filters must not reorder results - maintain the RRF/search ranking.
///
/// Uses JSON1 extension for efficient ordered ID passing:
/// - Passes document_ids as JSON array: `[1,2,3,...]`
/// - Uses `json_each()` to expand into rows with `key` as position
/// - JOINs with documents table and applies filters
/// - Orders by original position to preserve ranking
pub fn apply_filters(
    conn: &Connection,
    document_ids: &[i64],
    filters: &SearchFilters,
) -> Result<Vec<i64>> {
    if document_ids.is_empty() {
        return Ok(Vec::new());
    }

    // Build JSON array of document IDs
    let ids_json = serde_json::to_string(document_ids)?;

    // Build dynamic WHERE clauses
    let mut conditions: Vec<String> = Vec::new();
    let mut params: Vec<Box<dyn rusqlite::ToSql>> = Vec::new();

    // Always bind the JSON array first
    params.push(Box::new(ids_json));

    if let Some(ref source_type) = filters.source_type {
        conditions.push("d.source_type = ?".into());
        params.push(Box::new(source_type.as_str().to_string()));
    }

    if let Some(ref author) = filters.author {
        conditions.push("d.author_username = ?".into());
        params.push(Box::new(author.clone()));
    }

    if let Some(project_id) = filters.project_id {
        conditions.push("d.project_id = ?".into());
        params.push(Box::new(project_id));
    }

    if let Some(after) = filters.after {
        conditions.push("d.created_at >= ?".into());
        params.push(Box::new(after));
    }

    if let Some(updated_after) = filters.updated_after {
        conditions.push("d.updated_at >= ?".into());
        params.push(Box::new(updated_after));
    }

    // Labels: AND logic - all labels must be present
    for label in &filters.labels {
        conditions.push(
            "EXISTS (SELECT 1 FROM document_labels dl WHERE dl.document_id = d.id AND dl.label_name = ?)".into()
        );
        params.push(Box::new(label.clone()));
    }

    // Path filter
    if let Some(ref path_filter) = filters.path {
        match path_filter {
            PathFilter::Exact(path) => {
                conditions.push(
                    "EXISTS (SELECT 1 FROM document_paths dp WHERE dp.document_id = d.id AND dp.path = ?)".into()
                );
                params.push(Box::new(path.clone()));
            }
            PathFilter::Prefix(prefix) => {
                // IMPORTANT: Must use ESCAPE clause for backslash escaping to work in SQLite LIKE
                conditions.push(
                    "EXISTS (SELECT 1 FROM document_paths dp WHERE dp.document_id = d.id AND dp.path LIKE ? ESCAPE '\\')".into()
                );
                // Escape LIKE wildcards and add trailing %
                let like_pattern = format!(
                    "{}%",
                    prefix.replace('%', "\\%").replace('_', "\\_")
                );
                params.push(Box::new(like_pattern));
            }
        }
    }

    let where_clause = if conditions.is_empty() {
        String::new()
    } else {
        format!("AND {}", conditions.join(" AND "))
    };

    let limit = filters.clamp_limit();

    // SQL using JSON1 for ordered ID passing
    // json_each() returns rows with `key` (0-indexed position) and `value` (the ID)
    let sql = format!(
        r#"
        SELECT d.id
        FROM json_each(?) AS j
        JOIN documents d ON d.id = j.value
        WHERE 1=1 {}
        ORDER BY j.key
        LIMIT ?
        "#,
        where_clause
    );

    params.push(Box::new(limit as i64));

    let mut stmt = conn.prepare(&sql)?;
    let params_refs: Vec<&dyn rusqlite::ToSql> = params.iter().map(|p| p.as_ref()).collect();

    let results = stmt
        .query_map(params_refs.as_slice(), |row| row.get(0))?
        .collect::<std::result::Result<Vec<i64>, _>>()?;

    Ok(results)
}

Supported filters:

Filter SQL Column Notes
--type source_type issue, mr, discussion
--author author_username Exact match
--project project_id Resolve path to ID
--after created_at >= date (ms epoch)
--updated-after updated_at >= date (ms epoch), common triage filter
--label document_labels JOIN, multiple = AND
--path document_paths JOIN, trailing / = prefix
--limit N/A Default 20, max 100

Acceptance Criteria:

  • Each filter correctly restricts results
  • Multiple --label flags use AND logic
  • Path prefix vs exact match works correctly
  • --updated-after filters on updated_at (not created_at)
  • Filters compose (all applied together)
  • Ranking order preserved after filtering (ORDER BY position)
  • Limit clamped to valid range [1, 100]
  • Default limit is 20 when not specified
  • JSON1 json_each() correctly expands document IDs

3.4 CLI: lore search --mode=lexical

File: src/cli/commands/search.rs

//! Search command - find documents using lexical, semantic, or hybrid search.

use console::style;
use serde::Serialize;

use crate::core::error::Result;
use crate::core::time::ms_to_iso;
use crate::search::{SearchFilters, SearchMode, search_fts, search_vector, rank_rrf, RrfResult};
use crate::Config;

/// Search result for display.
#[derive(Debug, Serialize)]
pub struct SearchResultDisplay {
    pub document_id: i64,
    pub source_type: String,
    pub title: Option<String>,
    pub url: Option<String>,
    pub project_path: String,
    pub author: Option<String>,
    pub created_at: String,  // ISO format
    pub updated_at: String,  // ISO format
    pub score: f64,          // Normalized 0-1
    pub snippet: String,     // Context around match
    pub labels: Vec<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub explain: Option<ExplainData>,
}

/// Ranking explanation for --explain flag.
#[derive(Debug, Serialize)]
pub struct ExplainData {
    pub vector_rank: Option<usize>,
    pub fts_rank: Option<usize>,
    pub rrf_score: f64,
}

/// Search results response.
#[derive(Debug, Serialize)]
pub struct SearchResponse {
    pub query: String,
    pub mode: String,
    pub total_results: usize,
    pub results: Vec<SearchResultDisplay>,
    #[serde(skip_serializing_if = "Vec::is_empty")]
    pub warnings: Vec<String>,
}

/// Run search command.
pub fn run_search(
    config: &Config,
    query: &str,
    mode: SearchMode,
    filters: SearchFilters,
    explain: bool,
) -> Result<SearchResponse> {
    // 1. Parse query and filters
    // 2. Execute search based on mode -> ranked doc_ids (+ explain ranks)
    // 3. Apply post-retrieval filters preserving ranking order
    // 4. HYDRATE in one DB round-trip (see hydration query below):
    //    - documents fields (title, url, created_at, updated_at, content_text)
    //    - project_path via JOIN projects
    //    - labels aggregated via json_group_array
    //    - paths aggregated via json_group_array (optional)
    // 5. Attach snippet:
    //    - prefer FTS snippet when doc hit FTS
    //    - fallback: truncated content_text via generate_fallback_snippet()
    // 6. For --mode=semantic with 0% embedding coverage:
    //    return early with actionable error message (distinct from "Ollama down")
    todo!()
}

/// Hydration query: fetch all display fields for ranked doc IDs in a single round-trip.
///
/// Uses json_each(?) to preserve ranking order from the search pipeline.
/// Aggregates labels and paths inline to avoid N+1 queries.
///
/// ```sql
/// SELECT d.id, d.source_type, d.title, d.url, d.author_username,
///        d.created_at, d.updated_at, d.content_text,
///        p.path AS project_path,
///        (SELECT json_group_array(dl.label_name)
///         FROM document_labels dl WHERE dl.document_id = d.id) AS labels,
///        (SELECT json_group_array(dp.path)
///         FROM document_paths dp WHERE dp.document_id = d.id) AS paths
/// FROM json_each(?) AS j
/// JOIN documents d ON d.id = j.value
/// JOIN projects p ON p.id = d.project_id
/// ORDER BY j.key
/// ```
///
/// This single query replaces what would otherwise be:
/// - 1 query per document for metadata
/// - 1 query per document for labels
/// - 1 query per document for paths
/// For 20 results, that's 60 queries reduced to 1.
fn hydrate_results(
    conn: &Connection,
    doc_ids: &[i64],
) -> Result<Vec<HydratedDocument>> {
    todo!()
}

/// Print human-readable search results.
pub fn print_search_results(response: &SearchResponse, explain: bool) {
    println!(
        "Found {} results ({} search)\n",
        response.total_results,
        response.mode
    );

    for (i, result) in response.results.iter().enumerate() {
        let type_prefix = match result.source_type.as_str() {
            "merge_request" => "MR",
            "issue" => "Issue",
            "discussion" => "Discussion",
            _ => &result.source_type,
        };

        let title = result.title.as_deref().unwrap_or("(untitled)");
        println!(
            "[{}] {} - {} ({})",
            i + 1,
            style(type_prefix).cyan(),
            title,
            format!("{:.2}", result.score)
        );

        if explain {
            if let Some(exp) = &result.explain {
                let vec_str = exp.vector_rank.map(|r| format!("#{}", r)).unwrap_or_else(|| "-".into());
                let fts_str = exp.fts_rank.map(|r| format!("#{}", r)).unwrap_or_else(|| "-".into());
                println!(
                    "    Vector: {}, FTS: {}, RRF: {:.4}",
                    vec_str, fts_str, exp.rrf_score
                );
            }
        }

        if let Some(author) = &result.author {
            println!(
                "    @{} · {} · {}",
                author, &result.created_at[..10], result.project_path
            );
        }

        println!("    \"{}...\"", &result.snippet);

        if let Some(url) = &result.url {
            println!("    {}", style(url).dim());
        }
        println!();
    }
}

/// Print JSON search results for robot mode.
pub fn print_search_results_json(response: &SearchResponse, elapsed_ms: u64) {
    let output = serde_json::json!({
        "ok": true,
        "data": response,
        "meta": {
            "elapsed_ms": elapsed_ms
        }
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration in src/cli/mod.rs:

/// Search subcommand arguments.
#[derive(Args)]
pub struct SearchArgs {
    /// Search query
    query: String,

    /// Search mode
    #[arg(long, default_value = "hybrid")]
    mode: String,  // "hybrid" | "lexical" | "semantic"

    /// Filter by source type
    #[arg(long, value_name = "TYPE")]
    r#type: Option<String>,

    /// Filter by author username
    #[arg(long)]
    author: Option<String>,

    /// Filter by project path
    #[arg(long)]
    project: Option<String>,

    /// Filter by creation date (after)
    #[arg(long)]
    after: Option<String>,

    /// Filter by updated date (recently active items)
    #[arg(long)]
    updated_after: Option<String>,

    /// Filter by label (can specify multiple)
    #[arg(long, action = clap::ArgAction::Append)]
    label: Vec<String>,

    /// Filter by file path
    #[arg(long)]
    path: Option<String>,

    /// Maximum results
    #[arg(long, default_value = "20")]
    limit: usize,

    /// Show ranking breakdown
    #[arg(long)]
    explain: bool,

    /// FTS query mode: "safe" (default) or "raw"
    /// - safe: Escapes special chars but preserves `*` for prefix queries
    /// - raw: Pass FTS5 MATCH syntax through unchanged (advanced)
    #[arg(long, default_value = "safe")]
    fts_mode: String,  // "safe" | "raw"
}

Acceptance Criteria:

  • Works without Ollama running
  • All filters functional (including --updated-after)
  • Human-readable output with snippets
  • Semantic-only results get fallback snippets from content_text
  • Results hydrated in single DB round-trip (no N+1 queries)
  • JSON output matches schema
  • Empty results show helpful message
  • "No data indexed" message if documents table empty
  • --mode=semantic with 0% embedding coverage returns actionable error (distinct from "Ollama unavailable" — tells user to run lore embed first)
  • --fts-mode=safe (default) preserves prefix * while escaping special chars
  • --fts-mode=raw passes FTS5 MATCH syntax through unchanged

Phase 4: Embedding Pipeline

4.1 Embedding Module Structure

New module: src/embedding/

src/embedding/
├── mod.rs              # Module exports
├── ollama.rs           # Ollama API client
├── pipeline.rs         # Batch embedding orchestration
└── change_detector.rs  # Detect documents needing re-embedding

File: src/embedding/mod.rs

//! Embedding generation and storage.
//!
//! Uses Ollama for embedding generation and sqlite-vec for storage.

mod change_detector;
mod ollama;
mod pipeline;

pub use change_detector::detect_embedding_changes;
pub use ollama::{OllamaClient, OllamaConfig, check_ollama_health};
pub use pipeline::{embed_documents, EmbedResult};

4.2 Ollama Client

File: src/embedding/ollama.rs

use reqwest::Client;
use serde::{Deserialize, Serialize};

use crate::core::error::{GiError, Result};

/// Ollama client configuration.
#[derive(Debug, Clone)]
pub struct OllamaConfig {
    pub base_url: String,      // "http://localhost:11434"
    pub model: String,         // "nomic-embed-text"
    pub timeout_secs: u64,     // Request timeout
}

impl Default for OllamaConfig {
    fn default() -> Self {
        Self {
            base_url: "http://localhost:11434".into(),
            model: "nomic-embed-text".into(),
            timeout_secs: 60,
        }
    }
}

/// Ollama API client.
pub struct OllamaClient {
    client: Client,
    config: OllamaConfig,
}

/// Batch embed request.
#[derive(Serialize)]
struct EmbedRequest {
    model: String,
    input: Vec<String>,
}

/// Batch embed response.
#[derive(Deserialize)]
struct EmbedResponse {
    model: String,
    embeddings: Vec<Vec<f32>>,
}

/// Model info from /api/tags.
#[derive(Deserialize)]
struct TagsResponse {
    models: Vec<ModelInfo>,
}

#[derive(Deserialize)]
struct ModelInfo {
    name: String,
}

impl OllamaClient {
    pub fn new(config: OllamaConfig) -> Self {
        let client = Client::builder()
            .timeout(std::time::Duration::from_secs(config.timeout_secs))
            .build()
            .expect("Failed to create HTTP client");

        Self { client, config }
    }

    /// Check if Ollama is available and model is loaded.
    pub async fn health_check(&self) -> Result<()> {
        let url = format!("{}/api/tags", self.config.base_url);

        let response = self.client.get(&url).send().await.map_err(|e| {
            GiError::OllamaUnavailable {
                base_url: self.config.base_url.clone(),
                source: Some(e),
            }
        })?;

        let tags: TagsResponse = response.json().await?;

        let model_available = tags.models.iter().any(|m| m.name.starts_with(&self.config.model));

        if !model_available {
            return Err(GiError::OllamaModelNotFound {
                model: self.config.model.clone(),
            });
        }

        Ok(())
    }

    /// Generate embeddings for a batch of texts.
    ///
    /// Returns 768-dimensional vectors for each input text.
    pub async fn embed_batch(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
        let url = format!("{}/api/embed", self.config.base_url);

        let request = EmbedRequest {
            model: self.config.model.clone(),
            input: texts,
        };

        let response = self.client
            .post(&url)
            .json(&request)
            .send()
            .await
            .map_err(|e| GiError::OllamaUnavailable {
                base_url: self.config.base_url.clone(),
                source: Some(e),
            })?;

        if !response.status().is_success() {
            let status = response.status();
            let body = response.text().await.unwrap_or_default();
            return Err(GiError::EmbeddingFailed {
                document_id: 0, // Batch failure
                reason: format!("HTTP {}: {}", status, body),
            });
        }

        let embed_response: EmbedResponse = response.json().await?;
        Ok(embed_response.embeddings)
    }
}

/// Quick health check without full client.
pub async fn check_ollama_health(base_url: &str) -> bool {
    let client = Client::new();
    client
        .get(format!("{}/api/tags", base_url))
        .send()
        .await
        .is_ok()
}

Endpoints:

Endpoint Purpose
GET /api/tags Health check, verify model available
POST /api/embed Batch embedding (preferred)

Acceptance Criteria:

  • Health check detects Ollama availability
  • Batch embedding works with up to 32 texts
  • Clear error messages for common failures

4.3 Error Handling Extensions

File: src/core/error.rs (extend existing)

Add to ErrorCode:

pub enum ErrorCode {
    // ... existing variants ...
    InvalidEnumValue,
    OllamaUnavailable,
    OllamaModelNotFound,
    EmbeddingFailed,
}

impl ErrorCode {
    pub fn exit_code(&self) -> i32 {
        match self {
            // ... existing mappings ...
            Self::InvalidEnumValue => 13,
            Self::OllamaUnavailable => 14,
            Self::OllamaModelNotFound => 15,
            Self::EmbeddingFailed => 16,
        }
    }
}

impl std::fmt::Display for ErrorCode {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let code = match self {
            // ... existing mappings ...
            Self::InvalidEnumValue => "INVALID_ENUM_VALUE",
            Self::OllamaUnavailable => "OLLAMA_UNAVAILABLE",
            Self::OllamaModelNotFound => "OLLAMA_MODEL_NOT_FOUND",
            Self::EmbeddingFailed => "EMBEDDING_FAILED",
        };
        write!(f, "{code}")
    }
}

Add to GiError:

pub enum GiError {
    // ... existing variants ...

    #[error("Cannot connect to Ollama at {base_url}. Is it running?")]
    OllamaUnavailable {
        base_url: String,
        #[source]
        source: Option<reqwest::Error>,
    },

    #[error("Ollama model '{model}' not found. Run: ollama pull {model}")]
    OllamaModelNotFound { model: String },

    #[error("Embedding failed for document {document_id}: {reason}")]
    EmbeddingFailed { document_id: i64, reason: String },
}

impl GiError {
    pub fn code(&self) -> ErrorCode {
        match self {
            // ... existing mappings ...
            Self::OllamaUnavailable { .. } => ErrorCode::OllamaUnavailable,
            Self::OllamaModelNotFound { .. } => ErrorCode::OllamaModelNotFound,
            Self::EmbeddingFailed { .. } => ErrorCode::EmbeddingFailed,
        }
    }

    pub fn suggestion(&self) -> Option<&'static str> {
        match self {
            // ... existing mappings ...
            Self::OllamaUnavailable { .. } => Some("Start Ollama: ollama serve"),
            Self::OllamaModelNotFound { model } => Some("Pull the model: ollama pull nomic-embed-text"),
            Self::EmbeddingFailed { .. } => Some("Check Ollama logs or retry with 'lore embed --retry-failed'"),
        }
    }
}

4.4 Embedding Pipeline

File: src/embedding/pipeline.rs

use indicatif::{ProgressBar, ProgressStyle};
use rusqlite::Connection;

use crate::core::error::Result;
use crate::embedding::OllamaClient;

/// Batch size for embedding requests.
const BATCH_SIZE: usize = 32;

/// SQLite page size for paging through pending documents.
/// Uses keyset paging (id > last_id) to avoid rescanning previously-processed rows.
const DB_PAGE_SIZE: usize = 500;

/// Expected embedding dimensions for nomic-embed-text model.
/// IMPORTANT: Validates against this to prevent silent corruption.
const EXPECTED_DIMS: usize = 768;

/// Which documents to embed.
#[derive(Debug, Clone, Copy)]
pub enum EmbedSelection {
    /// New or changed documents (default).
    Pending,
    /// Only previously failed documents.
    RetryFailed,
}

/// Result of embedding run.
#[derive(Debug, Default)]
pub struct EmbedResult {
    pub embedded: usize,
    pub failed: usize,
    pub skipped: usize,
}

/// Embed documents that need embedding.
///
/// Process:
/// 1. Select documents needing embeddings:
///    - Pending: missing embedding_metadata row OR content_hash mismatch
///    - RetryFailed: embedding_metadata.last_error IS NOT NULL
/// 2. Page through candidates using keyset pagination (id > last_id)
///    to avoid rescanning already-processed rows
/// 3. Batch texts -> Ollama `/api/embed` with concurrent HTTP requests
/// 4. Write embeddings + embedding_metadata in per-batch transactions
/// 5. Failed batches record `last_error` in embedding_metadata
///    (excluded from Pending selection; retried via RetryFailed)
/// 6. Progress reported as (embedded + failed) vs total_pending
pub async fn embed_documents(
    conn: &Connection,
    client: &OllamaClient,
    selection: EmbedSelection,
    concurrency: usize,
    progress_callback: Option<Box<dyn Fn(usize, usize)>>,
) -> Result<EmbedResult> {
    use futures::stream::{FuturesUnordered, StreamExt};

    let mut result = EmbedResult::default();
    let total_pending = count_pending_documents(conn, selection)?;

    if total_pending == 0 {
        return Ok(result);
    }

    // Page through pending documents using keyset pagination to avoid
    // both memory pressure and OFFSET performance degradation.
    let mut last_id: i64 = 0;
    loop {
        let pending = find_pending_documents(conn, DB_PAGE_SIZE, last_id, selection)?;
        if pending.is_empty() {
            break;
        }

        // Launch concurrent HTTP requests, collect results
        let mut futures = FuturesUnordered::new();

        for batch in pending.chunks(BATCH_SIZE) {
            let texts: Vec<String> = batch.iter().map(|d| d.content.clone()).collect();
            let batch_meta: Vec<(i64, String)> = batch
                .iter()
                .map(|d| (d.id, d.content_hash.clone()))
                .collect();

            futures.push(async move {
                let embed_result = client.embed_batch(texts).await;
                (batch_meta, embed_result)
            });

            // Cap in-flight requests
            if futures.len() >= concurrency {
                if let Some((meta, res)) = futures.next().await {
                    collect_writes(conn, &meta, res, &mut result)?;
                }
            }
        }

        // Drain remaining futures
        while let Some((meta, res)) = futures.next().await {
            collect_writes(conn, &meta, res, &mut result)?;
        }

        // Advance keyset cursor for next page
        if let Some(last) = pending.last() {
            last_id = last.id;
        }

        if let Some(ref cb) = progress_callback {
            cb(result.embedded + result.failed, total_pending);
        }
    }

    Ok(result)
}

/// Collect embedding results and write to DB (sequential, on main thread).
///
/// IMPORTANT: Validates embedding dimensions to prevent silent corruption.
/// If model returns wrong dimensions (e.g., different model configured),
/// the document is marked as failed rather than storing corrupt data.
fn collect_writes(
    conn: &Connection,
    batch_meta: &[(i64, String)],
    embed_result: Result<Vec<Vec<f32>>>,
    result: &mut EmbedResult,
) -> Result<()> {
    let tx = conn.transaction()?;
    match embed_result {
        Ok(embeddings) => {
            for ((doc_id, hash), embedding) in batch_meta.iter().zip(embeddings.iter()) {
                // Validate dimensions to prevent silent corruption
                if embedding.len() != EXPECTED_DIMS {
                    record_embedding_error(
                        &tx,
                        *doc_id,
                        hash,
                        &format!(
                            "embedding dimension mismatch: got {}, expected {}",
                            embedding.len(),
                            EXPECTED_DIMS
                        ),
                    )?;
                    result.failed += 1;
                    continue;
                }
                store_embedding(&tx, *doc_id, embedding, hash)?;
                result.embedded += 1;
            }
        }
        Err(e) => {
            for (doc_id, hash) in batch_meta {
                record_embedding_error(&tx, *doc_id, hash, &e.to_string())?;
                result.failed += 1;
            }
        }
    }
    tx.commit()?;
    Ok(())
}

struct PendingDocument {
    id: i64,
    content: String,
    content_hash: String,
}

/// Count total pending documents (for progress reporting).
fn count_pending_documents(conn: &Connection, selection: EmbedSelection) -> Result<usize> {
    let sql = match selection {
        EmbedSelection::Pending =>
            "SELECT COUNT(*)
             FROM documents d
             LEFT JOIN embedding_metadata em ON d.id = em.document_id
             WHERE em.document_id IS NULL
                OR em.content_hash != d.content_hash",
        EmbedSelection::RetryFailed =>
            "SELECT COUNT(*)
             FROM documents d
             JOIN embedding_metadata em ON d.id = em.document_id
             WHERE em.last_error IS NOT NULL",
    };
    let count: usize = conn.query_row(sql, [], |row| row.get(0))?;
    Ok(count)
}

/// Find pending documents for embedding using keyset pagination.
///
/// IMPORTANT: Uses keyset pagination (d.id > last_id) instead of OFFSET.
/// OFFSET degrades O(n²) on large result sets because SQLite must scan
/// and discard all rows before the offset. Keyset pagination is O(1) per page
/// since the index seek goes directly to the starting row.
fn find_pending_documents(
    conn: &Connection,
    limit: usize,
    last_id: i64,
    selection: EmbedSelection,
) -> Result<Vec<PendingDocument>> {
    let sql = match selection {
        EmbedSelection::Pending =>
            "SELECT d.id, d.content_text, d.content_hash
             FROM documents d
             LEFT JOIN embedding_metadata em ON d.id = em.document_id
             WHERE (em.document_id IS NULL
                OR em.content_hash != d.content_hash)
               AND d.id > ?
             ORDER BY d.id
             LIMIT ?",
        EmbedSelection::RetryFailed =>
            "SELECT d.id, d.content_text, d.content_hash
             FROM documents d
             JOIN embedding_metadata em ON d.id = em.document_id
             WHERE em.last_error IS NOT NULL
               AND d.id > ?
             ORDER BY d.id
             LIMIT ?",
    };
    let mut stmt = conn.prepare(sql)?;

    let docs = stmt
        .query_map(rusqlite::params![last_id, limit as i64], |row| {
            Ok(PendingDocument {
                id: row.get(0)?,
                content: row.get(1)?,
                content_hash: row.get(2)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(docs)
}

fn store_embedding(
    tx: &rusqlite::Transaction,
    document_id: i64,
    embedding: &[f32],
    content_hash: &str,
) -> Result<()> {
    // Convert embedding to bytes for sqlite-vec
    // sqlite-vec expects raw little-endian bytes, not the array directly
    let embedding_bytes: Vec<u8> = embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    // Store in sqlite-vec (rowid = document_id)
    tx.execute(
        "INSERT OR REPLACE INTO embeddings(rowid, embedding) VALUES (?, ?)",
        rusqlite::params![document_id, embedding_bytes],
    )?;

    // Update metadata
    let now = crate::core::time::now_ms();
    tx.execute(
        "INSERT OR REPLACE INTO embedding_metadata
         (document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at)
         VALUES (?, 'nomic-embed-text', 768, ?, ?, NULL, 0, ?)",
        rusqlite::params![document_id, content_hash, now, now],
    )?;

    Ok(())
}

fn record_embedding_error(
    tx: &rusqlite::Transaction,
    document_id: i64,
    content_hash: &str,
    error: &str,
) -> Result<()> {
    let now = crate::core::time::now_ms();
    tx.execute(
        "INSERT INTO embedding_metadata
         (document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at)
         VALUES (?, 'nomic-embed-text', 768, ?, ?, ?, 1, ?)
         ON CONFLICT(document_id) DO UPDATE SET
           last_error = excluded.last_error,
           attempt_count = attempt_count + 1,
           last_attempt_at = excluded.last_attempt_at",
        rusqlite::params![document_id, content_hash, now, error, now],
    )?;

    Ok(())
}

Acceptance Criteria:

  • New documents get embedded
  • Changed documents (hash mismatch) get re-embedded
  • Unchanged documents skipped
  • Failures recorded in embedding_metadata.last_error
  • Failures record actual content_hash (not empty string)
  • Writes batched in transactions for performance
  • Concurrency parameter respected
  • Progress reported during embedding
  • Deterministic ORDER BY d.id ensures consistent paging
  • EmbedSelection parameter controls pending vs retry-failed mode

4.5 CLI: lore embed

File: src/cli/commands/embed.rs

//! Embed command - generate embeddings for documents.

use indicatif::{ProgressBar, ProgressStyle};
use serde::Serialize;

use crate::core::error::Result;
use crate::embedding::{embed_documents, EmbedResult, OllamaClient, OllamaConfig};
use crate::Config;

/// Run embedding command.
pub async fn run_embed(
    config: &Config,
    retry_failed: bool,
) -> Result<EmbedResult> {
    use crate::core::db::open_database;
    use crate::embedding::pipeline::EmbedSelection;

    let ollama_config = OllamaConfig {
        base_url: config.embedding.base_url.clone(),
        model: config.embedding.model.clone(),
        timeout_secs: 120,
    };

    let client = OllamaClient::new(ollama_config);

    // Health check
    client.health_check().await?;

    // Open database connection
    let conn = open_database(config)?;

    // Determine selection mode
    let selection = if retry_failed {
        EmbedSelection::RetryFailed
    } else {
        EmbedSelection::Pending
    };

    // Run embedding
    let result = embed_documents(
        &conn,
        &client,
        selection,
        config.embedding.concurrency as usize,
        None,
    ).await?;

    Ok(result)
}

/// Print human-readable output.
pub fn print_embed(result: &EmbedResult, elapsed_secs: u64) {
    println!("Embedding complete:");
    println!("  Embedded: {:>6} documents", result.embedded);
    println!("  Failed:   {:>6} documents", result.failed);
    println!("  Skipped:  {:>6} documents", result.skipped);
    println!("  Elapsed: {}m {}s", elapsed_secs / 60, elapsed_secs % 60);
}

/// Print JSON output for robot mode.
pub fn print_embed_json(result: &EmbedResult, elapsed_ms: u64) {
    let output = serde_json::json!({
        "ok": true,
        "data": {
            "embedded": result.embedded,
            "failed": result.failed,
            "skipped": result.skipped
        },
        "meta": {
            "elapsed_ms": elapsed_ms
        }
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration:

/// Embed subcommand arguments.
#[derive(Args)]
pub struct EmbedArgs {
    /// Retry only previously failed documents
    #[arg(long)]
    retry_failed: bool,
}

Acceptance Criteria:

  • Embeds documents without embeddings
  • Re-embeds documents with changed hash
  • --retry-failed only processes failed documents
  • Progress bar with count
  • Clear error if Ollama unavailable

4.6 CLI: lore stats

File: src/cli/commands/stats.rs

//! Stats command - display document and embedding statistics.

use rusqlite::Connection;
use serde::Serialize;

use crate::core::error::Result;
use crate::Config;

/// Document statistics.
#[derive(Debug, Serialize)]
pub struct Stats {
    pub documents: DocumentStats,
    pub embeddings: EmbeddingStats,
    pub fts: FtsStats,
    pub queues: QueueStats,
}

#[derive(Debug, Serialize)]
pub struct DocumentStats {
    pub issues: usize,
    pub mrs: usize,
    pub discussions: usize,
    pub total: usize,
    pub truncated: usize,
}

#[derive(Debug, Serialize)]
pub struct EmbeddingStats {
    pub embedded: usize,
    pub pending: usize,
    pub failed: usize,
    pub coverage_pct: f64,
}

#[derive(Debug, Serialize)]
pub struct FtsStats {
    pub indexed: usize,
}

/// Queue statistics for observability.
///
/// Exposes internal queue depths so operators can detect backlogs
/// and failing items that need manual intervention.
#[derive(Debug, Serialize)]
pub struct QueueStats {
    /// Items in dirty_sources queue (pending document regeneration)
    pub dirty_sources: usize,
    /// Items in dirty_sources with last_error set (failing regeneration)
    pub dirty_sources_failed: usize,
    /// Items in pending_discussion_fetches queue
    pub pending_discussion_fetches: usize,
    /// Items in pending_discussion_fetches with last_error set
    pub pending_discussion_fetches_failed: usize,
}

/// Integrity check result.
#[derive(Debug, Serialize)]
pub struct IntegrityCheck {
    pub documents_count: usize,
    pub fts_count: usize,
    pub embeddings_count: usize,
    pub metadata_count: usize,
    pub orphaned_embeddings: usize,
    pub hash_mismatches: usize,
    pub ok: bool,
}

/// Run stats command.
pub fn run_stats(config: &Config) -> Result<Stats> {
    // Query counts from database
    todo!()
}

/// Run integrity check (--check flag).
///
/// Verifies:
/// - documents count == documents_fts count
/// - embeddings.rowid all exist in documents.id
/// - embedding_metadata.content_hash == documents.content_hash
pub fn run_integrity_check(config: &Config) -> Result<IntegrityCheck> {
    // 1. Count documents
    // 2. Count FTS entries
    // 3. Find orphaned embeddings (no matching document)
    // 4. Find hash mismatches between embedding_metadata and documents
    // 5. Return check results
    todo!()
}

/// Repair result from --repair flag.
#[derive(Debug, Serialize)]
pub struct RepairResult {
    pub orphaned_embeddings_deleted: usize,
    pub stale_embeddings_cleared: usize,
    pub missing_fts_repopulated: usize,
}

/// Repair issues found by integrity check (--repair flag).
///
/// Fixes:
/// - Deletes orphaned embeddings (embedding_metadata rows with no matching document)
/// - Clears stale embedding_metadata (hash mismatch) so they get re-embedded
/// - Rebuilds FTS index from scratch (correct-by-construction)
///
/// NOTE: FTS repair uses `rebuild` rather than partial row insertion.
/// With `content='documents'` (external-content FTS), partial repopulation
/// via INSERT of missing rows is fragile — if the external content table
/// and FTS content diverge in any way, partial fixes can leave the index
/// in an inconsistent state. A full rebuild is slower but guaranteed correct.
pub fn run_repair(config: &Config) -> Result<RepairResult> {
    let conn = open_db(config)?;

    // Delete orphaned embeddings (no matching document)
    let orphaned_deleted = conn.execute(
        "DELETE FROM embedding_metadata
         WHERE document_id NOT IN (SELECT id FROM documents)",
        [],
    )?;

    // Also delete from embeddings virtual table (sqlite-vec)
    conn.execute(
        "DELETE FROM embeddings
         WHERE rowid NOT IN (SELECT id FROM documents)",
        [],
    )?;

    // Clear stale embedding_metadata (hash mismatch) - will be re-embedded
    let stale_cleared = conn.execute(
        "DELETE FROM embedding_metadata
         WHERE (document_id, content_hash) NOT IN (
             SELECT id, content_hash FROM documents
         )",
        [],
    )?;

    // Rebuild FTS index from scratch — correct-by-construction.
    // This re-reads all rows from the external content table (documents)
    // and rebuilds the index. Slower than partial fix but guaranteed consistent.
    conn.execute(
        "INSERT INTO documents_fts(documents_fts) VALUES('rebuild')",
        [],
    )?;
    let fts_rebuilt = 1; // rebuild is all-or-nothing

    Ok(RepairResult {
        orphaned_embeddings_deleted: orphaned_deleted,
        stale_embeddings_cleared: stale_cleared,
        missing_fts_repopulated: fts_rebuilt,
    })
}

/// Print human-readable stats.
pub fn print_stats(stats: &Stats) {
    println!("Document Statistics:");
    println!("  Issues:      {:>6} documents", stats.documents.issues);
    println!("  MRs:         {:>6} documents", stats.documents.mrs);
    println!("  Discussions: {:>6} documents", stats.documents.discussions);
    println!("  Total:       {:>6} documents", stats.documents.total);
    if stats.documents.truncated > 0 {
        println!("  Truncated:   {:>6}", stats.documents.truncated);
    }
    println!();
    println!("Embedding Coverage:");
    println!("  Embedded: {:>6} ({:.1}%)", stats.embeddings.embedded, stats.embeddings.coverage_pct);
    println!("  Pending:  {:>6}", stats.embeddings.pending);
    println!("  Failed:   {:>6}", stats.embeddings.failed);
    println!();
    println!("FTS Index:");
    println!("  Indexed:  {:>6} documents", stats.fts.indexed);
    println!();
    println!("Queue Depths:");
    println!("  Dirty sources:     {:>6} ({} failed)",
        stats.queues.dirty_sources,
        stats.queues.dirty_sources_failed
    );
    println!("  Discussion fetches:{:>6} ({} failed)",
        stats.queues.pending_discussion_fetches,
        stats.queues.pending_discussion_fetches_failed
    );
}

/// Print integrity check results.
pub fn print_integrity_check(check: &IntegrityCheck) {
    println!("Integrity Check:");
    println!("  Documents:      {:>6}", check.documents_count);
    println!("  FTS entries:    {:>6}", check.fts_count);
    println!("  Embeddings:     {:>6}", check.embeddings_count);
    println!("  Metadata:       {:>6}", check.metadata_count);
    if check.orphaned_embeddings > 0 {
        println!("  Orphaned embeddings: {:>6} (WARN)", check.orphaned_embeddings);
    }
    if check.hash_mismatches > 0 {
        println!("  Hash mismatches:     {:>6} (WARN)", check.hash_mismatches);
    }
    println!();
    println!("  Status: {}", if check.ok { "OK" } else { "ISSUES FOUND" });
}

/// Print JSON stats for robot mode.
pub fn print_stats_json(stats: &Stats) {
    let output = serde_json::json!({
        "ok": true,
        "data": stats
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

/// Print repair results. pub fn print_repair_result(result: &RepairResult) { println!("Repair Results:"); println!(" Orphaned embeddings deleted: {}", result.orphaned_embeddings_deleted); println!(" Stale embeddings cleared: {}", result.stale_embeddings_cleared); println!(" Missing FTS repopulated: {}", result.missing_fts_repopulated); println!(); let total = result.orphaned_embeddings_deleted + result.stale_embeddings_cleared + result.missing_fts_repopulated; if total == 0 { println!(" No issues found to repair."); } else { println!(" Fixed {} issues.", total); } }


**CLI integration:**
```rust
/// Stats subcommand arguments.
#[derive(Args)]
pub struct StatsArgs {
    /// Run integrity checks (document/FTS/embedding consistency)
    #[arg(long)]
    check: bool,

    /// Repair issues found by --check (deletes orphaned embeddings, clears stale metadata)
    #[arg(long, requires = "check")]
    repair: bool,
}

Acceptance Criteria:

  • Shows document counts by type
  • Shows embedding coverage
  • Shows FTS index count
  • Identifies truncated documents
  • Shows queue depths (dirty_sources, pending_discussion_fetches)
  • Shows failed item counts for each queue
  • --check verifies document/FTS/embedding consistency
  • --repair fixes orphaned embeddings, stale metadata, missing FTS entries
  • JSON output for scripting

5.1 Vector Search Function

File: src/search/vector.rs

use rusqlite::Connection;
use crate::core::error::Result;

/// Vector search result.
#[derive(Debug, Clone)]
pub struct VectorResult {
    pub document_id: i64,
    pub distance: f64,  // Lower = more similar
}

/// Search documents using vector similarity.
///
/// Uses sqlite-vec for efficient vector search.
/// Returns document IDs sorted by distance (lower = better match).
///
/// IMPORTANT: sqlite-vec KNN queries require:
/// - k parameter for number of results
/// - embedding passed as raw little-endian bytes
pub fn search_vector(
    conn: &Connection,
    query_embedding: &[f32],
    limit: usize,
) -> Result<Vec<VectorResult>> {
    // Convert embedding to bytes for sqlite-vec
    let embedding_bytes: Vec<u8> = query_embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    let mut stmt = conn.prepare(
        "SELECT rowid, distance
         FROM embeddings
         WHERE embedding MATCH ? AND k = ?
         ORDER BY distance
         LIMIT ?"
    )?;

    let results = stmt
        .query_map(rusqlite::params![embedding_bytes, limit, limit], |row| {
            Ok(VectorResult {
                document_id: row.get(0)?,
                distance: row.get(1)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

Acceptance Criteria:

  • Returns document IDs with distances
  • Lower distance = better match
  • Works with 768-dim vectors
  • Uses k parameter for KNN query
  • Embedding passed as bytes

5.2 RRF Ranking

File: src/search/rrf.rs

use std::collections::HashMap;

/// RRF ranking constant.
const RRF_K: f64 = 60.0;

/// RRF-ranked result.
#[derive(Debug, Clone)]
pub struct RrfResult {
    pub document_id: i64,
    pub rrf_score: f64,         // Raw RRF score
    pub normalized_score: f64,  // Normalized to 0-1
    pub vector_rank: Option<usize>,
    pub fts_rank: Option<usize>,
}

/// Rank documents using Reciprocal Rank Fusion.
///
/// Algorithm:
/// RRF_score(d) = Σ 1 / (k + rank_i(d))
///
/// Where:
/// - k = 60 (tunable constant)
/// - rank_i(d) = rank of document d in retriever i (1-indexed)
/// - Sum over all retrievers where document appears
pub fn rank_rrf(
    vector_results: &[(i64, f64)],  // (doc_id, distance)
    fts_results: &[(i64, f64)],     // (doc_id, bm25_score)
) -> Vec<RrfResult> {
    let mut scores: HashMap<i64, (f64, Option<usize>, Option<usize>)> = HashMap::new();

    // Add vector results (1-indexed ranks)
    for (rank, (doc_id, _)) in vector_results.iter().enumerate() {
        let rrf_contribution = 1.0 / (RRF_K + (rank + 1) as f64);
        let entry = scores.entry(*doc_id).or_insert((0.0, None, None));
        entry.0 += rrf_contribution;
        entry.1 = Some(rank + 1);
    }

    // Add FTS results (1-indexed ranks)
    for (rank, (doc_id, _)) in fts_results.iter().enumerate() {
        let rrf_contribution = 1.0 / (RRF_K + (rank + 1) as f64);
        let entry = scores.entry(*doc_id).or_insert((0.0, None, None));
        entry.0 += rrf_contribution;
        entry.2 = Some(rank + 1);
    }

    // Convert to results and sort by RRF score descending
    let mut results: Vec<_> = scores
        .into_iter()
        .map(|(doc_id, (rrf_score, vector_rank, fts_rank))| {
            RrfResult {
                document_id: doc_id,
                rrf_score,
                normalized_score: 0.0, // Will be set below
                vector_rank,
                fts_rank,
            }
        })
        .collect();

    results.sort_by(|a, b| b.rrf_score.partial_cmp(&a.rrf_score).unwrap());

    // Normalize scores to 0-1
    if let Some(max_score) = results.first().map(|r| r.rrf_score) {
        for result in &mut results {
            result.normalized_score = result.rrf_score / max_score;
        }
    }

    results
}

Acceptance Criteria:

  • Documents in both lists score higher
  • Documents in one list still included
  • Normalized score = rrfScore / max(rrfScore)
  • Raw RRF score available in --explain output

5.3 Adaptive Recall

File: src/search/hybrid.rs

use rusqlite::Connection;

use crate::core::error::Result;
use crate::embedding::OllamaClient;
use crate::search::{SearchFilters, SearchMode, search_fts, search_vector, rank_rrf, RrfResult, FtsQueryMode};

/// Minimum base recall for unfiltered search.
const BASE_RECALL_MIN: usize = 50;

/// Minimum recall when filters are applied.
const FILTERED_RECALL_MIN: usize = 200;

/// Maximum recall to prevent excessive resource usage.
const RECALL_CAP: usize = 1500;

/// Search mode.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SearchMode {
    Hybrid,    // Vector + FTS with RRF
    Lexical,   // FTS only
    Semantic,  // Vector only
}

impl SearchMode {
    pub fn from_str(s: &str) -> Option<Self> {
        match s.to_lowercase().as_str() {
            "hybrid" => Some(Self::Hybrid),
            "lexical" | "fts" => Some(Self::Lexical),
            "semantic" | "vector" => Some(Self::Semantic),
            _ => None,
        }
    }

    pub fn as_str(&self) -> &'static str {
        match self {
            Self::Hybrid => "hybrid",
            Self::Lexical => "lexical",
            Self::Semantic => "semantic",
        }
    }
}

/// Hybrid search result.
#[derive(Debug)]
pub struct HybridResult {
    pub document_id: i64,
    pub score: f64,
    pub vector_rank: Option<usize>,
    pub fts_rank: Option<usize>,
    pub rrf_score: f64,
}

/// Execute hybrid search.
///
/// Adaptive recall: expands topK proportionally to requested limit and filter
/// restrictiveness to prevent "no results" when relevant docs would be filtered out.
///
/// Formula:
/// - Unfiltered: max(50, limit * 10), capped at 1500
/// - Filtered: max(200, limit * 50), capped at 1500
///
/// IMPORTANT: All modes use RRF consistently to ensure rank fields
/// are populated correctly for --explain output.
pub async fn search_hybrid(
    conn: &Connection,
    client: Option<&OllamaClient>,
    ollama_base_url: Option<&str>,  // For actionable error messages
    query: &str,
    mode: SearchMode,
    filters: &SearchFilters,
    fts_mode: FtsQueryMode,
) -> Result<(Vec<HybridResult>, Vec<String>)> {
    let mut warnings: Vec<String> = Vec::new();

    // Adaptive recall: proportional to requested limit and filter count
    let requested = filters.clamp_limit();
    let top_k = if filters.has_any_filter() {
        (requested * 50).max(FILTERED_RECALL_MIN).min(RECALL_CAP)
    } else {
        (requested * 10).max(BASE_RECALL_MIN).min(RECALL_CAP)
    };

    match mode {
        SearchMode::Lexical => {
            // FTS only - use RRF with empty vector results for consistent ranking
            let fts_results = search_fts(conn, query, top_k, fts_mode)?;

            let fts_tuples: Vec<_> = fts_results.iter().map(|r| (r.document_id, r.rank)).collect();
            let ranked = rank_rrf(&[], &fts_tuples);

            let results = ranked
                .into_iter()
                .map(|r| HybridResult {
                    document_id: r.document_id,
                    score: r.normalized_score,
                    vector_rank: r.vector_rank,
                    fts_rank: r.fts_rank,
                    rrf_score: r.rrf_score,
                })
                .collect();
            Ok((results, warnings))
        }
        SearchMode::Semantic => {
            // Vector only - requires client
            let client = client.ok_or_else(|| crate::core::error::GiError::OllamaUnavailable {
                base_url: ollama_base_url.unwrap_or("http://localhost:11434").into(),
                source: None,
            })?;

            let query_embedding = client.embed_batch(vec![query.to_string()]).await?;
            let embedding = query_embedding.into_iter().next().unwrap();

            let vec_results = search_vector(conn, &embedding, top_k)?;

            // Use RRF with empty FTS results for consistent ranking
            let vec_tuples: Vec<_> = vec_results.iter().map(|r| (r.document_id, r.distance)).collect();
            let ranked = rank_rrf(&vec_tuples, &[]);

            let results = ranked
                .into_iter()
                .map(|r| HybridResult {
                    document_id: r.document_id,
                    score: r.normalized_score,
                    vector_rank: r.vector_rank,
                    fts_rank: r.fts_rank,
                    rrf_score: r.rrf_score,
                })
                .collect();
            Ok((results, warnings))
        }
        SearchMode::Hybrid => {
            // Both retrievers with RRF fusion
            let fts_results = search_fts(conn, query, top_k, fts_mode)?;

            // Attempt vector search with graceful degradation on any failure
            let vec_results = match client {
                Some(client) => {
                    // Try to embed query; gracefully degrade on transient failures
                    match client.embed_batch(vec![query.to_string()]).await {
                        Ok(embeddings) => {
                            let embedding = embeddings.into_iter().next().unwrap();
                            search_vector(conn, &embedding, top_k)?
                        }
                        Err(e) => {
                            // Transient failure (network, timeout, rate limit, etc.)
                            // Log and fall back to FTS-only rather than failing the search
                            tracing::warn!("Vector search failed, falling back to lexical: {}", e);
                            warnings.push(format!(
                                "Vector search unavailable ({}), using lexical search only",
                                e
                            ));
                            Vec::new()
                        }
                    }
                }
                None => {
                    // No client configured
                    warnings.push("Embedding service unavailable, using lexical search only".into());
                    Vec::new()
                }
            };

            // RRF fusion
            let vec_tuples: Vec<_> = vec_results.iter().map(|r| (r.document_id, r.distance)).collect();
            let fts_tuples: Vec<_> = fts_results.iter().map(|r| (r.document_id, r.rank)).collect();

            let ranked = rank_rrf(&vec_tuples, &fts_tuples);

            let results = ranked
                .into_iter()
                .map(|r| HybridResult {
                    document_id: r.document_id,
                    score: r.normalized_score,
                    vector_rank: r.vector_rank,
                    fts_rank: r.fts_rank,
                    rrf_score: r.rrf_score,
                })
                .collect();
            Ok((results, warnings))
        }
    }
}

Acceptance Criteria:

  • Unfiltered search uses topK=max(50, limit*10), capped at 1500
  • Filtered search uses topK=max(200, limit*50), capped at 1500
  • Final results still limited by --limit
  • Adaptive recall prevents "no results" under heavy filtering

5.4 Graceful Degradation

When Ollama unavailable during hybrid/semantic search:

  1. Log warning: "Embedding service unavailable, using lexical search only"
  2. Fall back to FTS-only search
  3. Include warning in response

Acceptance Criteria:

  • Default mode is hybrid
  • --mode=lexical works without Ollama
  • --mode=semantic requires Ollama
  • Graceful degradation when Ollama down
  • --explain shows rank breakdown
  • All Phase 3 filters work in hybrid mode

Phase 6: Sync Orchestration

6.1 Dirty Source Tracking

File: src/ingestion/dirty_tracker.rs

use rusqlite::Connection;
use crate::core::error::Result;
use crate::core::time::now_ms;
use crate::documents::SourceType;

/// Maximum dirty sources to process per sync run.
const MAX_DIRTY_SOURCES_PER_RUN: usize = 500;

/// Mark a source as dirty (needs document regeneration).
///
/// Called during entity upsert operations.
/// Uses INSERT OR IGNORE to avoid duplicates.
pub fn mark_dirty(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<()> {
    conn.execute(
        "INSERT OR IGNORE INTO dirty_sources (source_type, source_id, queued_at)
         VALUES (?, ?, ?)",
        rusqlite::params![source_type.as_str(), source_id, now_ms()],
    )?;
    Ok(())
}

/// Get dirty sources ready for processing.
///
/// Uses `next_attempt_at` for efficient, index-friendly backoff queries.
/// Items with NULL `next_attempt_at` are ready immediately (first attempt).
/// Items with `next_attempt_at <= now` have waited long enough after failure.
///
/// Benefits over SQL bitshift calculation:
/// - No overflow risk from large attempt_count values
/// - Index-friendly: `WHERE next_attempt_at <= ?`
/// - Jitter can be added in Rust when computing next_attempt_at
///
/// This prevents hot-loop retries when a source consistently fails
/// to generate a document (e.g., malformed data, missing references).
pub fn get_dirty_sources(conn: &Connection) -> Result<Vec<(SourceType, i64)>> {
    let now = now_ms();

    let mut stmt = conn.prepare(
        "SELECT source_type, source_id
         FROM dirty_sources
         WHERE next_attempt_at IS NULL OR next_attempt_at <= ?
         ORDER BY attempt_count ASC, queued_at ASC
         LIMIT ?"
    )?;

    let results = stmt
        .query_map(rusqlite::params![now, MAX_DIRTY_SOURCES_PER_RUN], |row| {
            let type_str: String = row.get(0)?;
            let source_type = match type_str.as_str() {
                "issue" => SourceType::Issue,
                "merge_request" => SourceType::MergeRequest,
                "discussion" => SourceType::Discussion,
                other => return Err(rusqlite::Error::FromSqlConversionFailure(
                    0,
                    rusqlite::types::Type::Text,
                    Box::new(std::io::Error::new(
                        std::io::ErrorKind::InvalidData,
                        format!("invalid source_type: {other}"),
                    )),
                )),
            };
            Ok((source_type, row.get(1)?))
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

/// Clear dirty source after processing.
pub fn clear_dirty(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<()> {
    conn.execute(
        "DELETE FROM dirty_sources WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
    )?;
    Ok(())
}

Acceptance Criteria:

  • Upserted entities added to dirty_sources
  • Duplicates ignored
  • Queue cleared after document regeneration
  • Processing bounded per run (max 500)
  • Exponential backoff uses next_attempt_at (index-friendly, no overflow)
  • Backoff computed with jitter to prevent thundering herd
  • Failed items prioritized lower than fresh items (ORDER BY attempt_count ASC)

6.2 Pending Discussion Queue

File: src/ingestion/discussion_queue.rs

use rusqlite::Connection;
use crate::core::error::Result;
use crate::core::time::now_ms;

/// Noteable type for discussion fetching.
#[derive(Debug, Clone, Copy)]
pub enum NoteableType {
    Issue,
    MergeRequest,
}

impl NoteableType {
    pub fn as_str(&self) -> &'static str {
        match self {
            Self::Issue => "Issue",
            Self::MergeRequest => "MergeRequest",
        }
    }
}

/// Pending discussion fetch entry.
pub struct PendingFetch {
    pub project_id: i64,
    pub noteable_type: NoteableType,
    pub noteable_iid: i64,
    pub attempt_count: i64,
}

/// Queue a discussion fetch for an entity.
pub fn queue_discussion_fetch(
    conn: &Connection,
    project_id: i64,
    noteable_type: NoteableType,
    noteable_iid: i64,
) -> Result<()> {
    conn.execute(
        "INSERT OR REPLACE INTO pending_discussion_fetches
         (project_id, noteable_type, noteable_iid, queued_at, attempt_count, last_attempt_at, last_error)
         VALUES (?, ?, ?, ?, 0, NULL, NULL)",
        rusqlite::params![project_id, noteable_type.as_str(), noteable_iid, now_ms()],
    )?;
    Ok(())
}

/// Get pending fetches ready for processing.
///
/// Uses `next_attempt_at` for efficient, index-friendly backoff queries.
/// Items with NULL `next_attempt_at` are ready immediately (first attempt).
/// Items with `next_attempt_at <= now` have waited long enough after failure.
///
/// Benefits over SQL bitshift calculation:
/// - No overflow risk from large attempt_count values
/// - Index-friendly: `WHERE next_attempt_at <= ?`
/// - Jitter can be added in Rust when computing next_attempt_at
///
/// Limited to `max_items` to bound API calls per sync run.
pub fn get_pending_fetches(conn: &Connection, max_items: usize) -> Result<Vec<PendingFetch>> {
    let now = now_ms();

    let mut stmt = conn.prepare(
        "SELECT project_id, noteable_type, noteable_iid, attempt_count
         FROM pending_discussion_fetches
         WHERE next_attempt_at IS NULL OR next_attempt_at <= ?
         ORDER BY attempt_count ASC, queued_at ASC
         LIMIT ?"
    )?;

    let results = stmt
        .query_map(rusqlite::params![now, max_items], |row| {
            let type_str: String = row.get(1)?;
            let noteable_type = if type_str == "Issue" {
                NoteableType::Issue
            } else {
                NoteableType::MergeRequest
            };
            Ok(PendingFetch {
                project_id: row.get(0)?,
                noteable_type,
                noteable_iid: row.get(2)?,
                attempt_count: row.get(3)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

/// Mark fetch as successful and remove from queue.
pub fn complete_fetch(
    conn: &Connection,
    project_id: i64,
    noteable_type: NoteableType,
    noteable_iid: i64,
) -> Result<()> {
    conn.execute(
        "DELETE FROM pending_discussion_fetches
         WHERE project_id = ? AND noteable_type = ? AND noteable_iid = ?",
        rusqlite::params![project_id, noteable_type.as_str(), noteable_iid],
    )?;
    Ok(())
}

/// Record fetch failure and compute next retry time.
///
/// Computes `next_attempt_at` using exponential backoff with jitter:
/// - Base delay: 1000ms * 2^attempt_count
/// - Cap: 1 hour (3600000ms)
/// - Jitter: ±10% to prevent thundering herd
pub fn record_fetch_error(
    conn: &Connection,
    project_id: i64,
    noteable_type: NoteableType,
    noteable_iid: i64,
    error: &str,
    current_attempt: i64,
) -> Result<()> {
    let now = now_ms();
    let next_attempt = crate::core::backoff::compute_next_attempt_at(now, current_attempt + 1);

    conn.execute(
        "UPDATE pending_discussion_fetches
         SET attempt_count = attempt_count + 1,
             last_attempt_at = ?,
             last_error = ?,
             next_attempt_at = ?
         WHERE project_id = ? AND noteable_type = ? AND noteable_iid = ?",
        rusqlite::params![now, error, next_attempt, project_id, noteable_type.as_str(), noteable_iid],
    )?;
    Ok(())
}

// NOTE: Backoff computation uses the shared utility in `src/core/backoff.rs`.
// See Phase 6.X below for the shared implementation.

Acceptance Criteria:

  • Updated entities queued for discussion fetch
  • Success removes from queue
  • Failure increments attempt_count and sets next_attempt_at
  • Processing bounded per run (max 100)
  • Exponential backoff uses next_attempt_at (index-friendly, no overflow)
  • Backoff computed with jitter to prevent thundering herd

6.X Shared Backoff Utility

File: src/core/backoff.rs

Single implementation of exponential backoff with jitter, used by both dirty_sources and pending_discussion_fetches queues. Living in src/core/ because it's a cross-cutting concern used by multiple modules.

use rand::Rng;

/// Compute next_attempt_at with exponential backoff and jitter.
///
/// Formula: now + min(3600000, 1000 * 2^attempt_count) * (0.9 to 1.1)
/// - Capped at 1 hour to prevent runaway delays
/// - ±10% jitter prevents synchronized retries after outages
///
/// Used by:
/// - `dirty_sources` retry scheduling (document regeneration failures)
/// - `pending_discussion_fetches` retry scheduling (API fetch failures)
///
/// Having one implementation prevents subtle divergence between queues
/// (e.g., different caps or jitter ranges).
pub fn compute_next_attempt_at(now: i64, attempt_count: i64) -> i64 {
    // Cap attempt_count to prevent overflow (2^30 > 1 hour anyway)
    let capped_attempts = attempt_count.min(30) as u32;
    let base_delay_ms = 1000_i64.saturating_mul(1 << capped_attempts);
    let capped_delay_ms = base_delay_ms.min(3_600_000); // 1 hour cap

    // Add ±10% jitter
    let jitter_factor = rand::thread_rng().gen_range(0.9..=1.1);
    let delay_with_jitter = (capped_delay_ms as f64 * jitter_factor) as i64;

    now + delay_with_jitter
}

Update src/core/mod.rs:

pub mod backoff;  // Add to existing modules

Acceptance Criteria:

  • Single implementation shared by both queue retry paths
  • Cap at 1 hour prevents runaway delays
  • Jitter prevents thundering herd after outage recovery
  • Unit tests verify backoff curve and cap behavior

6.3 Document Regenerator

File: src/documents/regenerator.rs

use rusqlite::Connection;

use crate::core::error::Result;
use crate::documents::{
    extract_issue_document, extract_mr_document, extract_discussion_document,
    DocumentData, SourceType,
};
use crate::ingestion::dirty_tracker::{get_dirty_sources, clear_dirty};

/// Result of regeneration run.
#[derive(Debug, Default)]
pub struct RegenerateResult {
    pub regenerated: usize,
    pub unchanged: usize,
    pub errored: usize,
}

/// Regenerate documents from dirty queue.
///
/// Process:
/// 1. Query dirty_sources ordered by queued_at
/// 2. For each: regenerate document, compute new hash
/// 3. ALWAYS upsert document (labels/paths may change even if content_hash unchanged)
/// 4. Track whether content_hash changed (for stats)
/// 5. Delete from dirty_sources (or record error on failure)
pub fn regenerate_dirty_documents(conn: &Connection) -> Result<RegenerateResult> {
    let dirty = get_dirty_sources(conn)?;
    let mut result = RegenerateResult::default();

    for (source_type, source_id) in &dirty {
        match regenerate_one(conn, *source_type, *source_id) {
            Ok(changed) => {
                if changed {
                    result.regenerated += 1;
                } else {
                    result.unchanged += 1;
                }
                clear_dirty(conn, *source_type, *source_id)?;
            }
            Err(e) => {
                // Fail-soft: record error but continue processing remaining items
                record_dirty_error(conn, *source_type, *source_id, &e.to_string())?;
                result.errored += 1;
            }
        }
    }

    Ok(result)
}

/// Regenerate a single document. Returns true if content_hash changed.
///
/// If the source entity has been deleted, the corresponding document
/// is also deleted (cascade cleans up labels, paths, embeddings).
fn regenerate_one(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<bool> {
    // Extractors return Option: None means source entity was deleted
    let doc = match source_type {
        SourceType::Issue => extract_issue_document(conn, source_id)?,
        SourceType::MergeRequest => extract_mr_document(conn, source_id)?,
        SourceType::Discussion => extract_discussion_document(conn, source_id)?,
    };

    let Some(doc) = doc else {
        // Source was deleted — remove the document (cascade handles FTS/embeddings)
        delete_document(conn, source_type, source_id)?;
        return Ok(true);
    };

    let existing_hash = get_existing_hash(conn, source_type, source_id)?;
    let changed = existing_hash.as_ref() != Some(&doc.content_hash);

    // Always upsert: labels/paths can change independently of content_hash
    upsert_document(conn, &doc)?;

    Ok(changed)
}

/// Delete a document by source identity (cascade handles FTS trigger, labels, paths, embeddings).
fn delete_document(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<()> {
    conn.execute(
        "DELETE FROM documents WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
    )?;
    Ok(())
}

/// Record a regeneration error on a dirty source for retry.
///
/// IMPORTANT: Sets `next_attempt_at` using the shared backoff utility.
/// Without this, failed items would retry every run (hot-loop), defeating
/// the backoff design documented in the schema.
fn record_dirty_error(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
    error: &str,
) -> Result<()> {
    let now = now_ms();

    // Read current attempt_count from DB to compute backoff
    let attempt_count: i64 = conn.query_row(
        "SELECT attempt_count FROM dirty_sources WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
        |row| row.get(0),
    )?;

    // Use shared backoff utility (same as pending_discussion_fetches)
    let next_attempt_at = crate::core::backoff::compute_next_attempt_at(now, attempt_count + 1);

    conn.execute(
        "UPDATE dirty_sources
         SET attempt_count = attempt_count + 1,
             last_attempt_at = ?,
             last_error = ?,
             next_attempt_at = ?
         WHERE source_type = ? AND source_id = ?",
        rusqlite::params![now, error, next_attempt_at, source_type.as_str(), source_id],
    )?;
    Ok(())
}

/// Get existing content hash for a document, if it exists.
///
/// IMPORTANT: Uses `optional()` to distinguish between:
/// - No row found -> Ok(None)
/// - Row found -> Ok(Some(hash))
/// - DB error -> Err(...)
///
/// Using `.ok()` would hide real DB errors (disk I/O, corruption, etc.)
/// which should propagate up for proper error handling.
fn get_existing_hash(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<Option<String>> {
    use rusqlite::OptionalExtension;

    let mut stmt = conn.prepare(
        "SELECT content_hash FROM documents WHERE source_type = ? AND source_id = ?"
    )?;

    let hash: Option<String> = stmt
        .query_row(rusqlite::params![source_type.as_str(), source_id], |row| row.get(0))
        .optional()?;

    Ok(hash)
}

fn upsert_document(conn: &Connection, doc: &DocumentData) -> Result<()> {
    use rusqlite::OptionalExtension;

    // Check existing hashes before upserting (for write optimization)
    let existing: Option<(i64, String, String)> = conn
        .query_row(
            "SELECT id, labels_hash, paths_hash FROM documents
             WHERE source_type = ? AND source_id = ?",
            rusqlite::params![doc.source_type.as_str(), doc.source_id],
            |row| Ok((row.get(0)?, row.get(1)?, row.get(2)?)),
        )
        .optional()?;

    // Upsert main document (includes labels_hash, paths_hash)
    conn.execute(
        "INSERT INTO documents
         (source_type, source_id, project_id, author_username, label_names,
          labels_hash, paths_hash,
          created_at, updated_at, url, title, content_text, content_hash,
          is_truncated, truncated_reason)
         VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
         ON CONFLICT(source_type, source_id) DO UPDATE SET
           author_username = excluded.author_username,
           label_names = excluded.label_names,
           labels_hash = excluded.labels_hash,
           paths_hash = excluded.paths_hash,
           updated_at = excluded.updated_at,
           url = excluded.url,
           title = excluded.title,
           content_text = excluded.content_text,
           content_hash = excluded.content_hash,
           is_truncated = excluded.is_truncated,
           truncated_reason = excluded.truncated_reason",
        rusqlite::params![
            doc.source_type.as_str(),
            doc.source_id,
            doc.project_id,
            doc.author_username,
            serde_json::to_string(&doc.labels)?,
            doc.labels_hash,
            doc.paths_hash,
            doc.created_at,
            doc.updated_at,
            doc.url,
            doc.title,
            doc.content_text,
            doc.content_hash,
            doc.is_truncated,
            doc.truncated_reason,
        ],
    )?;

    // Get document ID (either existing or newly inserted)
    let doc_id = match existing {
        Some((id, _, _)) => id,
        None => get_document_id(conn, doc.source_type, doc.source_id)?,
    };

    // Only update labels if hash changed (reduces write amplification)
    let labels_changed = match &existing {
        Some((_, old_hash, _)) => old_hash != &doc.labels_hash,
        None => true, // New document, must insert
    };
    if labels_changed {
        conn.execute(
            "DELETE FROM document_labels WHERE document_id = ?",
            [doc_id],
        )?;
        for label in &doc.labels {
            conn.execute(
                "INSERT INTO document_labels (document_id, label_name) VALUES (?, ?)",
                rusqlite::params![doc_id, label],
            )?;
        }
    }

    // Only update paths if hash changed (reduces write amplification)
    let paths_changed = match &existing {
        Some((_, _, old_hash)) => old_hash != &doc.paths_hash,
        None => true, // New document, must insert
    };
    if paths_changed {
        conn.execute(
            "DELETE FROM document_paths WHERE document_id = ?",
            [doc_id],
        )?;
        for path in &doc.paths {
            conn.execute(
                "INSERT INTO document_paths (document_id, path) VALUES (?, ?)",
                rusqlite::params![doc_id, path],
            )?;
        }
    }

    Ok(())
}

fn get_document_id(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<i64> {
    let id: i64 = conn.query_row(
        "SELECT id FROM documents WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
        |row| row.get(0),
    )?;
    Ok(id)
}

Acceptance Criteria:

  • Dirty sources get documents regenerated
  • Hash comparison prevents unnecessary updates
  • FTS triggers fire on document update
  • Queue cleared after processing

6.4 CLI: lore sync

File: src/cli/commands/sync.rs

//! Sync command - orchestrate full sync pipeline.

use serde::Serialize;

use crate::core::error::Result;
use crate::Config;

/// Sync result summary.
#[derive(Debug, Serialize)]
pub struct SyncResult {
    pub issues_updated: usize,
    pub mrs_updated: usize,
    pub discussions_fetched: usize,
    pub documents_regenerated: usize,
    pub documents_embedded: usize,
}

/// Sync options.
#[derive(Debug, Default)]
pub struct SyncOptions {
    pub full: bool,       // Reset cursors, fetch everything
    pub force: bool,      // Override stale lock
    pub no_embed: bool,   // Skip embedding step
    pub no_docs: bool,    // Skip document regeneration
}

/// Run sync orchestration.
///
/// Steps:
/// 1. Acquire app lock with heartbeat
/// 2. Ingest delta (issues, MRs) based on cursors
/// 3. Process pending_discussion_fetches queue (bounded)
/// 4. Apply rolling backfill window (configurable, default 14 days)
/// 5. Regenerate documents from dirty_sources
/// 6. Embed documents with changed content_hash
/// 7. Release lock, record sync_run
pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResult> {
    // Implementation uses existing ingestion orchestrator
    // and new document/embedding pipelines
    todo!()
}

/// Print human-readable sync output.
pub fn print_sync(result: &SyncResult, elapsed_secs: u64) {
    println!("Sync complete:");
    println!("  Issues updated:        {:>6}", result.issues_updated);
    println!("  MRs updated:           {:>6}", result.mrs_updated);
    println!("  Discussions fetched:   {:>6}", result.discussions_fetched);
    println!("  Documents regenerated: {:>6}", result.documents_regenerated);
    println!("  Documents embedded:    {:>6}", result.documents_embedded);
    println!("  Elapsed: {}m {}s", elapsed_secs / 60, elapsed_secs % 60);
}

/// Print JSON output for robot mode.
pub fn print_sync_json(result: &SyncResult, elapsed_ms: u64) {
    let output = serde_json::json!({
        "ok": true,
        "data": result,
        "meta": {
            "elapsed_ms": elapsed_ms
        }
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration:

/// Sync subcommand arguments.
#[derive(Args)]
pub struct SyncArgs {
    /// Reset cursors, fetch everything
    #[arg(long)]
    full: bool,

    /// Override stale lock
    #[arg(long)]
    force: bool,

    /// Skip embedding step
    #[arg(long)]
    no_embed: bool,

    /// Skip document regeneration
    #[arg(long)]
    no_docs: bool,
}

Acceptance Criteria:

  • Orchestrates full sync pipeline
  • Respects app lock
  • --full resets cursors
  • --no-embed skips embedding
  • --no-docs skips document regeneration
  • Progress reporting in human mode
  • JSON summary in robot mode

Testing Strategy

Unit Tests

Module Test File Coverage
Document extractor src/documents/extractor.rs (mod tests) Issue/MR/discussion extraction, consistent headers
Truncation src/documents/truncation.rs (mod tests) All edge cases
RRF ranking src/search/rrf.rs (mod tests) Score computation, merging
Content hash src/documents/extractor.rs (mod tests) Deterministic hashing
FTS query sanitization src/search/fts.rs (mod tests) to_fts_query() edge cases: -, ", :, *, C++
SourceType parsing src/documents/extractor.rs (mod tests) parse() accepts aliases: mr, mrs, issue, etc.
SearchFilters src/search/filters.rs (mod tests) has_any_filter(), clamp_limit()
Backoff logic src/core/backoff.rs (mod tests) Shared exponential backoff curve, cap, jitter
Hydration src/cli/commands/search.rs (mod tests) Single round-trip, label/path aggregation

Integration Tests

Feature Test File Coverage
FTS search tests/fts_search.rs Stemming, empty results
Embedding storage tests/embedding.rs sqlite-vec operations
Hybrid search tests/hybrid_search.rs Combined retrieval
Sync orchestration tests/sync.rs Full pipeline

Golden Query Suite

File: tests/fixtures/golden_queries.json

[
  {
    "query": "authentication redesign",
    "expected_urls": [".../-/issues/234", ".../-/merge_requests/847"],
    "min_results": 1,
    "max_rank": 10
  }
]

Each query must have at least one expected URL in top 10 results.


CLI Smoke Tests

Command Expected Pass Criteria
lore generate-docs Progress, count Completes, count > 0
lore generate-docs (re-run) 0 regenerated Hash comparison works
lore embed Progress, count Completes, count matches docs
lore embed (re-run) 0 embedded Skips unchanged
lore embed --retry-failed Processes failed Only failed docs processed
lore stats Coverage stats Shows 100% after embed
lore stats Queue depths Shows dirty_sources and pending_discussion_fetches counts
lore search "auth" --mode=lexical Results Works without Ollama
lore search "auth" Hybrid results Vector + FTS combined
lore search "auth" (Ollama down) FTS results + warning Graceful degradation, warning in response
lore search "auth" --explain Rank breakdown Shows vector/FTS/RRF
lore search "auth" --type=mr Filtered results Only MRs
lore search "auth" --type=mrs Filtered results Alias works
lore search "auth" --label=bug Filtered results Only labeled docs
lore search "-DWITH_SSL" Results Leading dash doesn't cause FTS error
lore search 'C++' Results Special chars in query work
lore search "auth" --updated-after 2024-01-01 Filtered results Only recently updated docs
lore search "nonexistent123" No results Graceful empty state
lore search "auth" --mode=semantic (no embeddings) Actionable error Tells user to run lore embed first
lore sync Full pipeline All steps complete
lore sync --no-embed Skip embedding Docs generated, not embedded
lore generate-docs --full Progress, count Keyset pagination completes without OFFSET degradation
lore stats --check --repair Repair results FTS rebuilt, orphans cleaned

Data Integrity Checks

  • documents count = issues + MRs + discussions
  • documents_fts count = documents count
  • embeddings count = documents count (after full embed)
  • embedding_metadata.content_hash = documents.content_hash for all rows
  • All document_labels reference valid documents
  • All document_paths reference valid documents
  • No orphaned embeddings (embeddings.rowid without matching documents.id)
  • Discussion documents exclude system notes
  • Discussion documents include parent title
  • All dirty_sources entries reference existing source entities
  • All pending_discussion_fetches entries reference existing projects
  • attempt_count >= 0 for all queue entries (never negative)
  • last_attempt_at is NULL when attempt_count = 0

Success Criteria

Checkpoint 3 is complete when all three gates pass:

Gate A: Lexical MVP

  1. Lexical search works without Ollama

    • lore search "query" --mode=lexical returns relevant results
    • All filters functional (including --updated-after)
    • FTS5 syntax errors prevented by query sanitization
    • Special characters in queries work correctly (-DWITH_SSL, C++)
    • Search results hydrated in single DB round-trip (no N+1)
  2. Document generation is correct

    • Full and incremental modes use the same regenerator codepath
    • --full uses keyset pagination (no OFFSET degradation)
    • FTS triggers use COALESCE for NULL-safe operation

Gate B: Hybrid MVP

  1. Semantic search works with Ollama

    • lore embed completes successfully
    • lore search "query" returns semantically relevant results
    • --explain shows ranking breakdown
    • --mode=semantic with 0% embedding coverage returns actionable error
  2. Hybrid search combines both

    • Documents appearing in both retrievers rank higher
    • Graceful degradation when Ollama unavailable (falls back to FTS)
    • Transient embed failures don't fail the entire search
    • Warning message included in response on degradation
    • Embedding pipeline uses keyset pagination for consistent paging

Gate C: Sync MVP

  1. Incremental sync is efficient

    • lore sync only processes changed entities
    • Re-embedding only happens for changed documents
    • Progress visible during long syncs
    • Queue backoff actually prevents hot-loop retries (both queues set next_attempt_at)
    • Shared backoff utility ensures consistent behavior across queues
  2. Data integrity maintained

    • All counts match between tables
    • No orphaned records
    • Hashes consistent
    • get_existing_hash() properly distinguishes "not found" from DB errors
    • --repair uses FTS rebuild for correct-by-construction repair
  3. Observability

    • lore stats shows queue depths and failed item counts
    • Failed items visible for operator intervention
    • Deterministic ordering ensures consistent paging
  4. Tests pass

    • Unit tests for core algorithms (including FTS sanitization, shared backoff, hydration)
    • Integration tests for pipelines
    • Golden queries return expected results