Files
gitlore/docs/prd/checkpoint-3.md
2026-01-27 22:40:49 -05:00

77 KiB

Checkpoint 3: Search & Sync MVP

Status: Planning Prerequisite: Checkpoints 0, 1, 2 complete (issues, MRs, discussions ingested) Goal: Deliver working semantic + lexical hybrid search with efficient incremental sync

This checkpoint consolidates SPEC.md checkpoints 3A, 3B, 4, and 5 into a unified implementation plan. The work is structured for parallel agent execution where dependencies allow.

All code integrates with existing gitlab-inbox infrastructure:

  • Error handling via GiError and ErrorCode in src/core/error.rs
  • CLI patterns matching src/cli/commands/*.rs (run functions, JSON/human output)
  • Database via rusqlite::Connection with migrations in migrations/
  • Config via src/core/config.rs (EmbeddingConfig already defined)
  • Robot mode JSON with {"ok": true, "data": {...}} pattern

Executive Summary

Deliverables:

  1. Document generation from issues/MRs/discussions with FTS5 indexing
  2. Ollama-powered embedding pipeline with sqlite-vec storage
  3. Hybrid search (RRF-ranked vector + lexical) with rich filtering
  4. Orchestrated gi sync command with incremental re-embedding

Key Design Decisions:

  • Documents are the search unit (not raw entities)
  • FTS5 works standalone when Ollama unavailable (graceful degradation)
  • sqlite-vec rowid = documents.id for simple joins
  • RRF ranking avoids score normalization complexity
  • Queue-based discussion fetching isolates failures

Phase 1: Schema Foundation

1.1 Documents Schema (Migration 007)

File: migrations/007_documents.sql

-- Unified searchable documents (derived from issues/MRs/discussions)
CREATE TABLE documents (
  id INTEGER PRIMARY KEY,
  source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion')),
  source_id INTEGER NOT NULL,    -- local DB id in the source table
  project_id INTEGER NOT NULL REFERENCES projects(id),
  author_username TEXT,          -- for discussions: first note author
  label_names TEXT,              -- JSON array (display/debug only)
  created_at INTEGER,            -- ms epoch UTC
  updated_at INTEGER,            -- ms epoch UTC
  url TEXT,
  title TEXT,                    -- null for discussions
  content_text TEXT NOT NULL,    -- canonical text for embedding/search
  content_hash TEXT NOT NULL,    -- SHA-256 for change detection
  is_truncated INTEGER NOT NULL DEFAULT 0,
  truncated_reason TEXT CHECK (
    truncated_reason IN ('token_limit_middle_drop','single_note_oversized','first_last_oversized')
    OR truncated_reason IS NULL
  ),
  UNIQUE(source_type, source_id)
);

CREATE INDEX idx_documents_project_updated ON documents(project_id, updated_at);
CREATE INDEX idx_documents_author ON documents(author_username);
CREATE INDEX idx_documents_source ON documents(source_type, source_id);
CREATE INDEX idx_documents_hash ON documents(content_hash);

-- Fast label filtering (indexed exact-match)
CREATE TABLE document_labels (
  document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  label_name TEXT NOT NULL,
  PRIMARY KEY(document_id, label_name)
) WITHOUT ROWID;
CREATE INDEX idx_document_labels_label ON document_labels(label_name);

-- Fast path filtering (DiffNote file paths)
CREATE TABLE document_paths (
  document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  path TEXT NOT NULL,
  PRIMARY KEY(document_id, path)
) WITHOUT ROWID;
CREATE INDEX idx_document_paths_path ON document_paths(path);

-- Queue for incremental document regeneration (with retry tracking)
CREATE TABLE dirty_sources (
  source_type TEXT NOT NULL CHECK (source_type IN ('issue','merge_request','discussion')),
  source_id INTEGER NOT NULL,
  queued_at INTEGER NOT NULL,    -- ms epoch UTC
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER,
  last_error TEXT,
  PRIMARY KEY(source_type, source_id)
);
CREATE INDEX idx_dirty_sources_retry
  ON dirty_sources(attempt_count, last_attempt_at)
  WHERE last_error IS NOT NULL;

-- Resumable queue for dependent discussion fetching
CREATE TABLE pending_discussion_fetches (
  project_id INTEGER NOT NULL REFERENCES projects(id),
  noteable_type TEXT NOT NULL,            -- 'Issue' | 'MergeRequest'
  noteable_iid INTEGER NOT NULL,
  queued_at INTEGER NOT NULL,             -- ms epoch UTC
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER,
  last_error TEXT,
  PRIMARY KEY(project_id, noteable_type, noteable_iid)
);
CREATE INDEX idx_pending_discussions_retry
  ON pending_discussion_fetches(attempt_count, last_attempt_at)
  WHERE last_error IS NOT NULL;

Acceptance Criteria:

  • Migration applies cleanly on fresh DB
  • Migration applies cleanly after CP2 schema
  • All foreign keys enforced
  • Indexes created

1.2 FTS5 Index (Migration 008)

File: migrations/008_fts5.sql

-- Full-text search with porter stemmer and prefix indexes for type-ahead
CREATE VIRTUAL TABLE documents_fts USING fts5(
  title,
  content_text,
  content='documents',
  content_rowid='id',
  tokenize='porter unicode61',
  prefix='2 3 4'
);

-- Keep FTS in sync via triggers
CREATE TRIGGER documents_ai AFTER INSERT ON documents BEGIN
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

CREATE TRIGGER documents_ad AFTER DELETE ON documents BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
END;

-- Only rebuild FTS when searchable text actually changes (not metadata-only updates)
CREATE TRIGGER documents_au AFTER UPDATE ON documents
WHEN old.title IS NOT new.title OR old.content_text != new.content_text
BEGIN
  INSERT INTO documents_fts(documents_fts, rowid, title, content_text)
  VALUES('delete', old.id, old.title, old.content_text);
  INSERT INTO documents_fts(rowid, title, content_text)
  VALUES (new.id, new.title, new.content_text);
END;

Acceptance Criteria:

  • documents_fts created as virtual table
  • Triggers fire on insert/update/delete
  • Update trigger only fires when title or content_text changes (not metadata-only updates)
  • FTS row count matches documents count after bulk insert
  • Prefix search works for type-ahead UX

1.3 Embeddings Schema (Migration 009)

File: migrations/009_embeddings.sql

-- NOTE: sqlite-vec vec0 virtual tables cannot participate in FK cascades.
-- We must use an explicit trigger to delete orphan embeddings when documents
-- are deleted. See documents_embeddings_ad trigger below.

-- sqlite-vec virtual table for vector search
-- Storage rule: embeddings.rowid = documents.id
CREATE VIRTUAL TABLE embeddings USING vec0(
  embedding float[768]
);

-- Embedding provenance + change detection
CREATE TABLE embedding_metadata (
  document_id INTEGER PRIMARY KEY REFERENCES documents(id) ON DELETE CASCADE,
  model TEXT NOT NULL,           -- 'nomic-embed-text'
  dims INTEGER NOT NULL,         -- 768
  content_hash TEXT NOT NULL,    -- copied from documents.content_hash
  created_at INTEGER NOT NULL,   -- ms epoch UTC
  last_error TEXT,               -- error message from last failed attempt
  attempt_count INTEGER NOT NULL DEFAULT 0,
  last_attempt_at INTEGER        -- ms epoch UTC
);

CREATE INDEX idx_embedding_metadata_errors
  ON embedding_metadata(last_error) WHERE last_error IS NOT NULL;
CREATE INDEX idx_embedding_metadata_hash ON embedding_metadata(content_hash);

-- CRITICAL: Delete orphan embeddings when documents are deleted.
-- vec0 virtual tables don't support FK ON DELETE CASCADE, so we need this trigger.
-- embedding_metadata has ON DELETE CASCADE, so only vec0 needs explicit cleanup
CREATE TRIGGER documents_embeddings_ad AFTER DELETE ON documents BEGIN
  DELETE FROM embeddings WHERE rowid = old.id;
END;

Acceptance Criteria:

  • embeddings vec0 table created
  • embedding_metadata tracks provenance
  • Error tracking fields present for retry logic
  • Orphan cleanup trigger fires on document deletion

Dependencies:

  • Requires sqlite-vec extension loaded at runtime
  • Extension loading already happens in src/core/db.rs
  • Migration runner must load sqlite-vec before applying migrations (including on fresh DB)

Phase 2: Document Generation

2.1 Document Module Structure

New module: src/documents/

src/documents/
├── mod.rs           # Module exports
├── extractor.rs     # Document extraction from entities
├── truncation.rs    # Note-boundary aware truncation
└── regenerator.rs   # Dirty source processing

File: src/documents/mod.rs

//! Document generation and management.
//!
//! Extracts searchable documents from issues, MRs, and discussions.

mod extractor;
mod regenerator;
mod truncation;

pub use extractor::{
    extract_discussion_document, extract_issue_document, extract_mr_document,
    DocumentData, SourceType,
};
// Note: extract_*_document() return Result<Option<DocumentData>>
// None means the source entity was deleted from the database
pub use regenerator::regenerate_dirty_documents;
pub use truncation::{truncate_content, TruncationResult};

Update src/lib.rs:

pub mod documents;  // Add to existing modules

2.2 Document Types

File: src/documents/extractor.rs

use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};

/// Source type for documents.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum SourceType {
    Issue,
    MergeRequest,
    Discussion,
}

impl SourceType {
    pub fn as_str(&self) -> &'static str {
        match self {
            Self::Issue => "issue",
            Self::MergeRequest => "merge_request",
            Self::Discussion => "discussion",
        }
    }
}

impl std::fmt::Display for SourceType {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "{}", self.as_str())
    }
}

/// Generated document ready for storage.
#[derive(Debug, Clone)]
pub struct DocumentData {
    pub source_type: SourceType,
    pub source_id: i64,
    pub project_id: i64,
    pub author_username: Option<String>,
    pub labels: Vec<String>,
    pub paths: Vec<String>,  // DiffNote file paths
    pub created_at: i64,
    pub updated_at: i64,
    pub url: Option<String>,
    pub title: Option<String>,
    pub content_text: String,
    pub content_hash: String,
    pub is_truncated: bool,
    pub truncated_reason: Option<String>,
}

/// Compute SHA-256 hash of content.
pub fn compute_content_hash(content: &str) -> String {
    let mut hasher = Sha256::new();
    hasher.update(content.as_bytes());
    format!("{:x}", hasher.finalize())
}

Document Formats:

Source content_text
Issue {title}\n\n{description}
MR {title}\n\n{description}
Discussion Full thread with header (see below)

Discussion Document Format:

[[Discussion]] Issue #234: Authentication redesign
Project: group/project-one
URL: https://gitlab.example.com/group/project-one/-/issues/234#note_12345
Labels: ["bug", "auth"]
Files: ["src/auth/login.ts"]

--- Thread ---

@johndoe (2024-03-15):
I think we should move to JWT-based auth...

@janedoe (2024-03-15):
Agreed. What about refresh token strategy?

Acceptance Criteria:

  • Issue document: title + description concatenated
  • MR document: title + description concatenated
  • Discussion document: includes parent title, project, URL, labels, files, thread
  • System notes (is_system=1) excluded from discussion content
  • DiffNote file paths extracted to paths vector
  • Labels extracted to labels vector
  • SHA-256 hash computed from content_text

2.3 Truncation Logic

File: src/documents/truncation.rs

/// Maximum content length (~8,000 tokens at 4 chars/token estimate).
pub const MAX_CONTENT_CHARS: usize = 32_000;

/// Truncation result with metadata.
#[derive(Debug, Clone)]
pub struct TruncationResult {
    pub content: String,
    pub is_truncated: bool,
    pub reason: Option<TruncationReason>,
}

/// Reason for truncation.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TruncationReason {
    TokenLimitMiddleDrop,
    SingleNoteOversized,
    FirstLastOversized,
}

impl TruncationReason {
    pub fn as_str(&self) -> &'static str {
        match self {
            Self::TokenLimitMiddleDrop => "token_limit_middle_drop",
            Self::SingleNoteOversized => "single_note_oversized",
            Self::FirstLastOversized => "first_last_oversized",
        }
    }
}

/// Truncate content at note boundaries.
///
/// Rules:
/// - Max content: 32,000 characters
/// - Truncate at NOTE boundaries (never mid-note)
/// - Preserve first N notes and last M notes
/// - Drop from middle, insert marker
pub fn truncate_content(notes: &[NoteContent], max_chars: usize) -> TruncationResult {
    // Implementation handles edge cases per table below
    todo!()
}

/// Note content for truncation.
pub struct NoteContent {
    pub author: String,
    pub date: String,
    pub body: String,
}

Edge Cases:

Scenario Handling
Single note > 32000 chars Truncate at char boundary, append [truncated], reason = single_note_oversized
First + last note > 32000 Keep only first note (truncated if needed), reason = first_last_oversized
Only one note Truncate at char boundary if needed

Acceptance Criteria:

  • Notes never cut mid-content
  • First and last notes preserved when possible
  • Truncation marker \n\n[... N notes omitted for length ...]\n\n inserted
  • Metadata fields set correctly
  • Edge cases handled per table above

2.4 CLI: gi generate-docs (Incremental by Default)

File: src/cli/commands/generate_docs.rs

//! Generate documents command - create searchable documents from entities.
//!
//! By default, runs incrementally (processes only dirty_sources queue).
//! Use --full to regenerate all documents from scratch.

use rusqlite::Connection;
use serde::Serialize;

use crate::core::error::Result;
use crate::documents::{DocumentData, SourceType};
use crate::Config;

/// Result of document generation.
#[derive(Debug, Serialize)]
pub struct GenerateDocsResult {
    pub issues: usize,
    pub mrs: usize,
    pub discussions: usize,
    pub total: usize,
    pub truncated: usize,
    pub skipped: usize,  // Unchanged documents
}

/// Run document generation (incremental by default).
///
/// Incremental mode (default):
/// - Processes only items in dirty_sources queue
/// - Fast for routine syncs
///
/// Full mode (--full):
/// - Regenerates ALL documents from scratch
/// - Use when schema changes or after migration
pub fn run_generate_docs(
    config: &Config,
    full: bool,
    project_filter: Option<&str>,
) -> Result<GenerateDocsResult> {
    if full {
        // Full mode: regenerate everything inside a single transaction
        // 1. BEGIN IMMEDIATE transaction
        // 2. Query all issues, MRs, discussions
        // 3. For each: generate document, compute hash
        // 4. Upsert into `documents` table (FTS triggers auto-fire)
        // 5. Populate `document_labels` and `document_paths`
        // 6. Rebuild FTS: INSERT INTO documents_fts(documents_fts) VALUES('rebuild')
        // 7. COMMIT
        // 8. Return counts
        //
        // The FTS rebuild at step 6 ensures the index is consistent
        // after bulk operations. Wrapping in a transaction avoids
        // partial state if the process is interrupted.
    } else {
        // Incremental mode: process dirty_sources only
        // 1. Query dirty_sources (bounded by LIMIT)
        // 2. Regenerate only those documents
        // 3. Clear from dirty_sources after processing
    }
    todo!()
}

/// Print human-readable output.
pub fn print_generate_docs(result: &GenerateDocsResult) {
    println!("Document generation complete:");
    println!("  Issues:      {:>6} documents", result.issues);
    println!("  MRs:         {:>6} documents", result.mrs);
    println!("  Discussions: {:>6} documents", result.discussions);
    println!("  ─────────────────────");
    println!("  Total:       {:>6} documents", result.total);
    if result.truncated > 0 {
        println!("  Truncated:   {:>6}", result.truncated);
    }
    if result.skipped > 0 {
        println!("  Skipped:     {:>6} (unchanged)", result.skipped);
    }
}

/// Print JSON output for robot mode.
pub fn print_generate_docs_json(result: &GenerateDocsResult) {
    let output = serde_json::json!({
        "ok": true,
        "data": result
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration in src/cli/mod.rs:

/// Generate-docs subcommand arguments.
#[derive(Args)]
pub struct GenerateDocsArgs {
    /// Regenerate ALL documents (not just dirty queue)
    #[arg(long)]
    full: bool,

    /// Only generate for specific project
    #[arg(long)]
    project: Option<String>,
}

Acceptance Criteria:

  • Creates document for each issue
  • Creates document for each MR
  • Creates document for each discussion
  • Default mode processes dirty_sources queue only (incremental)
  • --full regenerates all documents from scratch
  • Progress bar in human mode (via indicatif)
  • JSON output in robot mode

3.1 Search Module Structure

New module: src/search/

src/search/
├── mod.rs       # Module exports
├── fts.rs       # FTS5 search
├── vector.rs    # Vector search (sqlite-vec)
├── hybrid.rs    # Combined hybrid search
├── rrf.rs       # RRF ranking algorithm
└── filters.rs   # Filter parsing and application

File: src/search/mod.rs

//! Search functionality for documents.
//!
//! Supports lexical (FTS5), semantic (vector), and hybrid search.

mod filters;
mod fts;
mod hybrid;
mod rrf;
mod vector;

pub use filters::{SearchFilters, apply_filters};
pub use fts::{search_fts, FtsResult};
pub use hybrid::{search_hybrid, HybridResult, SearchMode};
pub use rrf::{rank_rrf, RrfResult};
pub use vector::{search_vector, VectorResult};

3.2 FTS5 Search Function

File: src/search/fts.rs

use rusqlite::Connection;
use crate::core::error::Result;

/// FTS search result.
#[derive(Debug, Clone)]
pub struct FtsResult {
    pub document_id: i64,
    pub rank: f64,     // BM25 score (lower = better match)
    pub snippet: String, // Context snippet around match
}

/// Search documents using FTS5.
///
/// Returns matching document IDs with BM25 rank scores and snippets.
/// Lower rank values indicate better matches.
/// Uses bm25() explicitly (not the `rank` alias) and snippet() for context.
pub fn search_fts(
    conn: &Connection,
    query: &str,
    limit: usize,
) -> Result<Vec<FtsResult>> {
    if query.trim().is_empty() {
        return Ok(Vec::new());
    }

    let mut stmt = conn.prepare(
        "SELECT rowid,
                bm25(documents_fts),
                snippet(documents_fts, 1, '<mark>', '</mark>', '...', 64)
         FROM documents_fts
         WHERE documents_fts MATCH ?
         ORDER BY bm25(documents_fts)
         LIMIT ?"
    )?;

    let results = stmt
        .query_map([query, &limit.to_string()], |row| {
            Ok(FtsResult {
                document_id: row.get(0)?,
                rank: row.get(1)?,
                snippet: row.get(2)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

Acceptance Criteria:

  • Returns matching document IDs with BM25 rank
  • Porter stemming works (search/searching match)
  • Prefix search works (type-ahead UX)
  • Empty query returns empty results
  • Nonsense query returns empty results

3.3 Search Filters

File: src/search/filters.rs

use rusqlite::Connection;
use crate::core::error::Result;
use crate::documents::SourceType;

/// Search filters applied post-retrieval.
#[derive(Debug, Clone, Default)]
pub struct SearchFilters {
    pub source_type: Option<SourceType>,
    pub author: Option<String>,
    pub project_id: Option<i64>,
    pub after: Option<i64>,        // ms epoch
    pub labels: Vec<String>,       // AND logic
    pub path: Option<PathFilter>,
    pub limit: usize,              // Default 20, max 100
}

/// Path filter with prefix or exact match.
#[derive(Debug, Clone)]
pub enum PathFilter {
    Prefix(String),  // Trailing `/` -> LIKE 'path/%'
    Exact(String),   // No trailing `/` -> = 'path'
}

impl PathFilter {
    pub fn from_str(s: &str) -> Self {
        if s.ends_with('/') {
            Self::Prefix(s.to_string())
        } else {
            Self::Exact(s.to_string())
        }
    }
}

/// Apply filters to document IDs, returning filtered set.
///
/// IMPORTANT: Preserves ranking order from input document_ids.
/// Filters must not reorder results - maintain the RRF/search ranking.
pub fn apply_filters(
    conn: &Connection,
    document_ids: &[i64],
    filters: &SearchFilters,
) -> Result<Vec<i64>> {
    // Build dynamic WHERE clause based on filters
    // Multiple --label flags use AND logic
    // Path prefix vs exact match per PathFilter variant
    //
    // Implementation strategy to preserve ranking order:
    // 1. Accept document_ids as ordered list
    // 2. Build CTE with position
    // 3. JOIN with filters
    // 4. ORDER BY original position
    //
    // Example SQL pattern:
    // ```sql
    // WITH ranked_docs(doc_id, pos) AS (
    //   SELECT column1, ROW_NUMBER() OVER() as pos
    //   FROM (VALUES (?),(?),(?),...) 
    // )
    // SELECT d.id
    // FROM documents d
    // JOIN ranked_docs rd ON d.id = rd.doc_id
    // WHERE d.source_type = ?
    //   AND EXISTS (
    //     SELECT 1 FROM document_labels dl
    //     WHERE dl.document_id = d.id AND dl.label_name = ?
    //   )
    // ORDER BY rd.pos
    // LIMIT ?
    // ```
    todo!()
}

Supported filters:

Filter SQL Column Notes
--type source_type issue, mr, discussion
--author author_username Exact match
--project project_id Resolve path to ID
--after created_at >= date (ms epoch)
--label document_labels JOIN, multiple = AND
--path document_paths JOIN, trailing / = prefix
--limit N/A Default 20, max 100

Acceptance Criteria:

  • Each filter correctly restricts results
  • Multiple --label flags use AND logic
  • Path prefix vs exact match works correctly
  • Filters compose (all applied together)
  • Ranking order preserved after filtering

3.4 CLI: gi search --mode=lexical

File: src/cli/commands/search.rs

//! Search command - find documents using lexical, semantic, or hybrid search.

use console::style;
use serde::Serialize;

use crate::core::error::Result;
use crate::core::time::ms_to_iso;
use crate::search::{SearchFilters, SearchMode, search_hybrid, HybridResult};
use crate::Config;

/// Search result for display.
#[derive(Debug, Serialize)]
pub struct SearchResultDisplay {
    pub document_id: i64,
    pub source_type: String,
    pub title: Option<String>,
    pub url: Option<String>,
    pub project_path: String,
    pub author: Option<String>,
    pub created_at: String,  // ISO format
    pub updated_at: String,  // ISO format
    pub score: f64,          // Normalized 0-1
    pub snippet: String,     // Context around match
    pub labels: Vec<String>,
    #[serde(skip_serializing_if = "Option::is_none")]
    pub explain: Option<ExplainData>,
}

/// Ranking explanation for --explain flag.
#[derive(Debug, Serialize)]
pub struct ExplainData {
    pub vector_rank: Option<usize>,
    pub fts_rank: Option<usize>,
    pub rrf_score: f64,
}

/// Search results response.
#[derive(Debug, Serialize)]
pub struct SearchResponse {
    pub query: String,
    pub mode: String,
    pub total_results: usize,
    pub results: Vec<SearchResultDisplay>,
    #[serde(skip_serializing_if = "Vec::is_empty")]
    pub warnings: Vec<String>,
}

/// Run search command.
pub fn run_search(
    config: &Config,
    query: &str,
    mode: SearchMode,
    filters: SearchFilters,
    explain: bool,
) -> Result<SearchResponse> {
    // 1. Parse query and filters
    // 2. Execute search based on mode
    // 3. Apply post-retrieval filters
    // 4. Format and return results
    todo!()
}

/// Print human-readable search results.
pub fn print_search_results(response: &SearchResponse, explain: bool) {
    println!(
        "Found {} results ({} search)\n",
        response.total_results,
        response.mode
    );

    for (i, result) in response.results.iter().enumerate() {
        let type_prefix = match result.source_type.as_str() {
            "merge_request" => "MR",
            "issue" => "Issue",
            "discussion" => "Discussion",
            _ => &result.source_type,
        };

        let title = result.title.as_deref().unwrap_or("(untitled)");
        println!(
            "[{}] {} - {} ({})",
            i + 1,
            style(type_prefix).cyan(),
            title,
            format!("{:.2}", result.score)
        );

        if explain {
            if let Some(exp) = &result.explain {
                let vec_str = exp.vector_rank.map(|r| format!("#{}", r)).unwrap_or_else(|| "-".into());
                let fts_str = exp.fts_rank.map(|r| format!("#{}", r)).unwrap_or_else(|| "-".into());
                println!(
                    "    Vector: {}, FTS: {}, RRF: {:.4}",
                    vec_str, fts_str, exp.rrf_score
                );
            }
        }

        if let Some(author) = &result.author {
            println!(
                "    @{} · {} · {}",
                author, &result.created_at[..10], result.project_path
            );
        }

        println!("    \"{}...\"", &result.snippet);

        if let Some(url) = &result.url {
            println!("    {}", style(url).dim());
        }
        println!();
    }
}

/// Print JSON search results for robot mode.
pub fn print_search_results_json(response: &SearchResponse, elapsed_ms: u64) {
    let output = serde_json::json!({
        "ok": true,
        "data": response,
        "meta": {
            "elapsed_ms": elapsed_ms
        }
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration in src/cli/mod.rs:

/// Search subcommand arguments.
#[derive(Args)]
pub struct SearchArgs {
    /// Search query
    query: String,

    /// Search mode
    #[arg(long, default_value = "hybrid")]
    mode: String,  // "hybrid" | "lexical" | "semantic"

    /// Filter by source type
    #[arg(long, value_name = "TYPE")]
    r#type: Option<String>,

    /// Filter by author username
    #[arg(long)]
    author: Option<String>,

    /// Filter by project path
    #[arg(long)]
    project: Option<String>,

    /// Filter by creation date (after)
    #[arg(long)]
    after: Option<String>,

    /// Filter by label (can specify multiple)
    #[arg(long, action = clap::ArgAction::Append)]
    label: Vec<String>,

    /// Filter by file path
    #[arg(long)]
    path: Option<String>,

    /// Maximum results
    #[arg(long, default_value = "20")]
    limit: usize,

    /// Show ranking breakdown
    #[arg(long)]
    explain: bool,
}

Acceptance Criteria:

  • Works without Ollama running
  • All filters functional
  • Human-readable output with snippets
  • JSON output matches schema
  • Empty results show helpful message
  • "No data indexed" message if documents table empty

Phase 4: Embedding Pipeline

4.1 Embedding Module Structure

New module: src/embedding/

src/embedding/
├── mod.rs              # Module exports
├── ollama.rs           # Ollama API client
├── pipeline.rs         # Batch embedding orchestration
└── change_detector.rs  # Detect documents needing re-embedding

File: src/embedding/mod.rs

//! Embedding generation and storage.
//!
//! Uses Ollama for embedding generation and sqlite-vec for storage.

mod change_detector;
mod ollama;
mod pipeline;

pub use change_detector::detect_embedding_changes;
pub use ollama::{OllamaClient, OllamaConfig, check_ollama_health};
pub use pipeline::{embed_documents, EmbedResult};

4.2 Ollama Client

File: src/embedding/ollama.rs

use reqwest::Client;
use serde::{Deserialize, Serialize};

use crate::core::error::{GiError, Result};

/// Ollama client configuration.
#[derive(Debug, Clone)]
pub struct OllamaConfig {
    pub base_url: String,      // "http://localhost:11434"
    pub model: String,         // "nomic-embed-text"
    pub timeout_secs: u64,     // Request timeout
}

impl Default for OllamaConfig {
    fn default() -> Self {
        Self {
            base_url: "http://localhost:11434".into(),
            model: "nomic-embed-text".into(),
            timeout_secs: 60,
        }
    }
}

/// Ollama API client.
pub struct OllamaClient {
    client: Client,
    config: OllamaConfig,
}

/// Batch embed request.
#[derive(Serialize)]
struct EmbedRequest {
    model: String,
    input: Vec<String>,
}

/// Batch embed response.
#[derive(Deserialize)]
struct EmbedResponse {
    model: String,
    embeddings: Vec<Vec<f32>>,
}

/// Model info from /api/tags.
#[derive(Deserialize)]
struct TagsResponse {
    models: Vec<ModelInfo>,
}

#[derive(Deserialize)]
struct ModelInfo {
    name: String,
}

impl OllamaClient {
    pub fn new(config: OllamaConfig) -> Self {
        let client = Client::builder()
            .timeout(std::time::Duration::from_secs(config.timeout_secs))
            .build()
            .expect("Failed to create HTTP client");

        Self { client, config }
    }

    /// Check if Ollama is available and model is loaded.
    pub async fn health_check(&self) -> Result<()> {
        let url = format!("{}/api/tags", self.config.base_url);

        let response = self.client.get(&url).send().await.map_err(|e| {
            GiError::OllamaUnavailable {
                base_url: self.config.base_url.clone(),
                source: Some(e),
            }
        })?;

        let tags: TagsResponse = response.json().await?;

        let model_available = tags.models.iter().any(|m| m.name.starts_with(&self.config.model));

        if !model_available {
            return Err(GiError::OllamaModelNotFound {
                model: self.config.model.clone(),
            });
        }

        Ok(())
    }

    /// Generate embeddings for a batch of texts.
    ///
    /// Returns 768-dimensional vectors for each input text.
    pub async fn embed_batch(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
        let url = format!("{}/api/embed", self.config.base_url);

        let request = EmbedRequest {
            model: self.config.model.clone(),
            input: texts,
        };

        let response = self.client
            .post(&url)
            .json(&request)
            .send()
            .await
            .map_err(|e| GiError::OllamaUnavailable {
                base_url: self.config.base_url.clone(),
                source: Some(e),
            })?;

        if !response.status().is_success() {
            let status = response.status();
            let body = response.text().await.unwrap_or_default();
            return Err(GiError::EmbeddingFailed {
                document_id: 0, // Batch failure
                reason: format!("HTTP {}: {}", status, body),
            });
        }

        let embed_response: EmbedResponse = response.json().await?;
        Ok(embed_response.embeddings)
    }
}

/// Quick health check without full client.
pub async fn check_ollama_health(base_url: &str) -> bool {
    let client = Client::new();
    client
        .get(format!("{}/api/tags", base_url))
        .send()
        .await
        .is_ok()
}

Endpoints:

Endpoint Purpose
GET /api/tags Health check, verify model available
POST /api/embed Batch embedding (preferred)

Acceptance Criteria:

  • Health check detects Ollama availability
  • Batch embedding works with up to 32 texts
  • Clear error messages for common failures

4.3 Error Handling Extensions

File: src/core/error.rs (extend existing)

Add to ErrorCode:

pub enum ErrorCode {
    // ... existing variants ...
    InvalidEnumValue,
    OllamaUnavailable,
    OllamaModelNotFound,
    EmbeddingFailed,
}

impl ErrorCode {
    pub fn exit_code(&self) -> i32 {
        match self {
            // ... existing mappings ...
            Self::InvalidEnumValue => 13,
            Self::OllamaUnavailable => 14,
            Self::OllamaModelNotFound => 15,
            Self::EmbeddingFailed => 16,
        }
    }
}

impl std::fmt::Display for ErrorCode {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        let code = match self {
            // ... existing mappings ...
            Self::InvalidEnumValue => "INVALID_ENUM_VALUE",
            Self::OllamaUnavailable => "OLLAMA_UNAVAILABLE",
            Self::OllamaModelNotFound => "OLLAMA_MODEL_NOT_FOUND",
            Self::EmbeddingFailed => "EMBEDDING_FAILED",
        };
        write!(f, "{code}")
    }
}

Add to GiError:

pub enum GiError {
    // ... existing variants ...

    #[error("Cannot connect to Ollama at {base_url}. Is it running?")]
    OllamaUnavailable {
        base_url: String,
        #[source]
        source: Option<reqwest::Error>,
    },

    #[error("Ollama model '{model}' not found. Run: ollama pull {model}")]
    OllamaModelNotFound { model: String },

    #[error("Embedding failed for document {document_id}: {reason}")]
    EmbeddingFailed { document_id: i64, reason: String },
}

impl GiError {
    pub fn code(&self) -> ErrorCode {
        match self {
            // ... existing mappings ...
            Self::OllamaUnavailable { .. } => ErrorCode::OllamaUnavailable,
            Self::OllamaModelNotFound { .. } => ErrorCode::OllamaModelNotFound,
            Self::EmbeddingFailed { .. } => ErrorCode::EmbeddingFailed,
        }
    }

    pub fn suggestion(&self) -> Option<&'static str> {
        match self {
            // ... existing mappings ...
            Self::OllamaUnavailable { .. } => Some("Start Ollama: ollama serve"),
            Self::OllamaModelNotFound { model } => Some("Pull the model: ollama pull nomic-embed-text"),
            Self::EmbeddingFailed { .. } => Some("Check Ollama logs or retry with 'gi embed --retry-failed'"),
        }
    }
}

4.4 Embedding Pipeline

File: src/embedding/pipeline.rs

use indicatif::{ProgressBar, ProgressStyle};
use rusqlite::Connection;

use crate::core::error::Result;
use crate::embedding::OllamaClient;

/// Batch size for embedding requests.
const BATCH_SIZE: usize = 32;

/// SQLite page size for paging through pending documents.
const DB_PAGE_SIZE: usize = 500;

/// Which documents to embed.
#[derive(Debug, Clone, Copy)]
pub enum EmbedSelection {
    /// New or changed documents (default).
    Pending,
    /// Only previously failed documents.
    RetryFailed,
}

/// Result of embedding run.
#[derive(Debug, Default)]
pub struct EmbedResult {
    pub embedded: usize,
    pub failed: usize,
    pub skipped: usize,
}

/// Embed documents that need embedding.
///
/// Process:
/// 1. Page through documents needing embedding (DB_PAGE_SIZE at a time)
/// 2. Batch documents (32 per Ollama request)
/// 3. Fire concurrent HTTP requests via FuturesUnordered (capped by concurrency)
/// 4. Collect results and write to SQLite sequentially (rusqlite is !Send)
/// 5. On failure: record error with actual content_hash, continue with next batch
///
/// Architecture note: rusqlite::Connection is !Send, so all DB reads/writes
/// happen on the main thread. Only HTTP calls are concurrent.
pub async fn embed_documents(
    conn: &Connection,
    client: &OllamaClient,
    concurrency: usize,
    progress_callback: Option<Box<dyn Fn(usize, usize)>>,
) -> Result<EmbedResult> {
    use futures::stream::{FuturesUnordered, StreamExt};

    let mut result = EmbedResult::default();
    let mut total_pending = count_pending_documents(conn)?;

    if total_pending == 0 {
        return Ok(result);
    }

    // Page through pending documents to avoid loading all into memory
    loop {
        let pending = find_pending_documents(conn, DB_PAGE_SIZE)?;
        if pending.is_empty() {
            break;
        }

        // Launch concurrent HTTP requests, collect results
        let mut futures = FuturesUnordered::new();

        for batch in pending.chunks(BATCH_SIZE) {
            let texts: Vec<String> = batch.iter().map(|d| d.content.clone()).collect();
            let batch_meta: Vec<(i64, String)> = batch
                .iter()
                .map(|d| (d.id, d.content_hash.clone()))
                .collect();

            futures.push(async move {
                let embed_result = client.embed_batch(texts).await;
                (batch_meta, embed_result)
            });

            // Cap in-flight requests
            if futures.len() >= concurrency {
                if let Some((meta, res)) = futures.next().await {
                    collect_writes(conn, &meta, res, &mut result)?;
                }
            }
        }

        // Drain remaining futures
        while let Some((meta, res)) = futures.next().await {
            collect_writes(conn, &meta, res, &mut result)?;
        }

        if let Some(ref cb) = progress_callback {
            cb(result.embedded + result.failed, total_pending);
        }
    }

    Ok(result)
}

/// Collect embedding results and write to DB (sequential, on main thread).
fn collect_writes(
    conn: &Connection,
    batch_meta: &[(i64, String)],
    embed_result: Result<Vec<Vec<f32>>>,
    result: &mut EmbedResult,
) -> Result<()> {
    let tx = conn.transaction()?;
    match embed_result {
        Ok(embeddings) => {
            for ((doc_id, hash), embedding) in batch_meta.iter().zip(embeddings.iter()) {
                store_embedding(&tx, *doc_id, embedding, hash)?;
                result.embedded += 1;
            }
        }
        Err(e) => {
            for (doc_id, hash) in batch_meta {
                record_embedding_error(&tx, *doc_id, hash, &e.to_string())?;
                result.failed += 1;
            }
        }
    }
    tx.commit()?;
    Ok(())
}

struct PendingDocument {
    id: i64,
    content: String,
    content_hash: String,
}

/// Count total pending documents (for progress reporting).
fn count_pending_documents(conn: &Connection) -> Result<usize> {
    let count: usize = conn.query_row(
        "SELECT COUNT(*)
         FROM documents d
         LEFT JOIN embedding_metadata em ON d.id = em.document_id
         WHERE em.document_id IS NULL
            OR em.content_hash != d.content_hash",
        [],
        |row| row.get(0),
    )?;
    Ok(count)
}

fn find_pending_documents(
    conn: &Connection,
    limit: usize,
    selection: EmbedSelection,
) -> Result<Vec<PendingDocument>> {
    let sql = match selection {
        EmbedSelection::Pending =>
            "SELECT d.id, d.content_text, d.content_hash
             FROM documents d
             LEFT JOIN embedding_metadata em ON d.id = em.document_id
             WHERE em.document_id IS NULL
                OR em.content_hash != d.content_hash
             LIMIT ?",
        EmbedSelection::RetryFailed =>
            "SELECT d.id, d.content_text, d.content_hash
             FROM documents d
             JOIN embedding_metadata em ON d.id = em.document_id
             WHERE em.last_error IS NOT NULL
             LIMIT ?",
    };
    let mut stmt = conn.prepare(sql)?;

    let docs = stmt
        .query_map([limit], |row| {
            Ok(PendingDocument {
                id: row.get(0)?,
                content: row.get(1)?,
                content_hash: row.get(2)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(docs)
}

fn store_embedding(
    conn: &Connection,
    document_id: i64,
    embedding: &[f32],
    content_hash: &str,
) -> Result<()> {
    // Convert embedding to bytes for sqlite-vec
    // sqlite-vec expects raw little-endian bytes, not the array directly
    let embedding_bytes: Vec<u8> = embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    // Store in sqlite-vec (rowid = document_id)
    conn.execute(
        "INSERT OR REPLACE INTO embeddings(rowid, embedding) VALUES (?, ?)",
        rusqlite::params![document_id, embedding_bytes],
    )?;

    // Update metadata
    let now = crate::core::time::now_ms();
    conn.execute(
        "INSERT OR REPLACE INTO embedding_metadata
         (document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at)
         VALUES (?, 'nomic-embed-text', 768, ?, ?, NULL, 0, ?)",
        rusqlite::params![document_id, content_hash, now, now],
    )?;

    Ok(())
}

fn record_embedding_error(
    conn: &Connection,
    document_id: i64,
    content_hash: &str,
    error: &str,
) -> Result<()> {
    let now = crate::core::time::now_ms();
    conn.execute(
        "INSERT INTO embedding_metadata
         (document_id, model, dims, content_hash, created_at, last_error, attempt_count, last_attempt_at)
         VALUES (?, 'nomic-embed-text', 768, ?, ?, ?, 1, ?)
         ON CONFLICT(document_id) DO UPDATE SET
           last_error = excluded.last_error,
           attempt_count = attempt_count + 1,
           last_attempt_at = excluded.last_attempt_at",
        rusqlite::params![document_id, content_hash, now, error, now],
    )?;

    Ok(())
}

Acceptance Criteria:

  • New documents get embedded
  • Changed documents (hash mismatch) get re-embedded
  • Unchanged documents skipped
  • Failures recorded in embedding_metadata.last_error
  • Failures record actual content_hash (not empty string)
  • Writes batched in transactions for performance
  • Concurrency parameter respected
  • Progress reported during embedding

4.5 CLI: gi embed

File: src/cli/commands/embed.rs

//! Embed command - generate embeddings for documents.

use indicatif::{ProgressBar, ProgressStyle};
use serde::Serialize;

use crate::core::error::Result;
use crate::embedding::{embed_documents, EmbedResult, OllamaClient, OllamaConfig};
use crate::Config;

/// Run embedding command.
pub async fn run_embed(
    config: &Config,
    retry_failed: bool,
) -> Result<EmbedResult> {
    let ollama_config = OllamaConfig {
        base_url: config.embedding.base_url.clone(),
        model: config.embedding.model.clone(),
        timeout_secs: 120,
    };

    let client = OllamaClient::new(ollama_config);

    // Health check
    client.health_check().await?;

    // Run embedding
    let result = embed_documents(
        &conn,
        &client,
        config.embedding.concurrency as usize,
        None,
    ).await?;

    Ok(result)
}

/// Print human-readable output.
pub fn print_embed(result: &EmbedResult, elapsed_secs: u64) {
    println!("Embedding complete:");
    println!("  Embedded: {:>6} documents", result.embedded);
    println!("  Failed:   {:>6} documents", result.failed);
    println!("  Skipped:  {:>6} documents", result.skipped);
    println!("  Elapsed:  {}m {}s", elapsed_secs / 60, elapsed_secs % 60);
}

/// Print JSON output for robot mode.
pub fn print_embed_json(result: &EmbedResult, elapsed_ms: u64) {
    let output = serde_json::json!({
        "ok": true,
        "data": {
            "embedded": result.embedded,
            "failed": result.failed,
            "skipped": result.skipped
        },
        "meta": {
            "elapsed_ms": elapsed_ms
        }
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration:

/// Embed subcommand arguments.
#[derive(Args)]
pub struct EmbedArgs {
    /// Retry only previously failed documents
    #[arg(long)]
    retry_failed: bool,
}

Acceptance Criteria:

  • Embeds documents without embeddings
  • Re-embeds documents with changed hash
  • --retry-failed only processes failed documents
  • Progress bar with count
  • Clear error if Ollama unavailable

4.6 CLI: gi stats

File: src/cli/commands/stats.rs

//! Stats command - display document and embedding statistics.

use rusqlite::Connection;
use serde::Serialize;

use crate::core::error::Result;
use crate::Config;

/// Document statistics.
#[derive(Debug, Serialize)]
pub struct Stats {
    pub documents: DocumentStats,
    pub embeddings: EmbeddingStats,
    pub fts: FtsStats,
}

#[derive(Debug, Serialize)]
pub struct DocumentStats {
    pub issues: usize,
    pub mrs: usize,
    pub discussions: usize,
    pub total: usize,
    pub truncated: usize,
}

#[derive(Debug, Serialize)]
pub struct EmbeddingStats {
    pub embedded: usize,
    pub pending: usize,
    pub failed: usize,
    pub coverage_pct: f64,
}

#[derive(Debug, Serialize)]
pub struct FtsStats {
    pub indexed: usize,
}

/// Integrity check result.
#[derive(Debug, Serialize)]
pub struct IntegrityCheck {
    pub documents_count: usize,
    pub fts_count: usize,
    pub embeddings_count: usize,
    pub metadata_count: usize,
    pub orphaned_embeddings: usize,
    pub hash_mismatches: usize,
    pub ok: bool,
}

/// Run stats command.
pub fn run_stats(config: &Config) -> Result<Stats> {
    // Query counts from database
    todo!()
}

/// Run integrity check (--check flag).
///
/// Verifies:
/// - documents count == documents_fts count
/// - embeddings.rowid all exist in documents.id
/// - embedding_metadata.content_hash == documents.content_hash
pub fn run_integrity_check(config: &Config) -> Result<IntegrityCheck> {
    // 1. Count documents
    // 2. Count FTS entries
    // 3. Find orphaned embeddings (no matching document)
    // 4. Find hash mismatches between embedding_metadata and documents
    // 5. Return check results
    todo!()
}

/// Print human-readable stats.
pub fn print_stats(stats: &Stats) {
    println!("Document Statistics:");
    println!("  Issues:      {:>6} documents", stats.documents.issues);
    println!("  MRs:         {:>6} documents", stats.documents.mrs);
    println!("  Discussions: {:>6} documents", stats.documents.discussions);
    println!("  Total:       {:>6} documents", stats.documents.total);
    if stats.documents.truncated > 0 {
        println!("  Truncated:   {:>6}", stats.documents.truncated);
    }
    println!();
    println!("Embedding Coverage:");
    println!("  Embedded: {:>6} ({:.1}%)", stats.embeddings.embedded, stats.embeddings.coverage_pct);
    println!("  Pending:  {:>6}", stats.embeddings.pending);
    println!("  Failed:   {:>6}", stats.embeddings.failed);
    println!();
    println!("FTS Index:");
    println!("  Indexed:  {:>6} documents", stats.fts.indexed);
}

/// Print integrity check results.
pub fn print_integrity_check(check: &IntegrityCheck) {
    println!("Integrity Check:");
    println!("  Documents:      {:>6}", check.documents_count);
    println!("  FTS entries:    {:>6}", check.fts_count);
    println!("  Embeddings:     {:>6}", check.embeddings_count);
    println!("  Metadata:       {:>6}", check.metadata_count);
    if check.orphaned_embeddings > 0 {
        println!("  Orphaned embeddings: {:>6} (WARN)", check.orphaned_embeddings);
    }
    if check.hash_mismatches > 0 {
        println!("  Hash mismatches:     {:>6} (WARN)", check.hash_mismatches);
    }
    println!();
    println!("  Status: {}", if check.ok { "OK" } else { "ISSUES FOUND" });
}

/// Print JSON stats for robot mode.
pub fn print_stats_json(stats: &Stats) {
    let output = serde_json::json!({
        "ok": true,
        "data": stats
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration:

/// Stats subcommand arguments.
#[derive(Args)]
pub struct StatsArgs {
    /// Run integrity checks (document/FTS/embedding consistency)
    #[arg(long)]
    check: bool,
}

Acceptance Criteria:

  • Shows document counts by type
  • Shows embedding coverage
  • Shows FTS index count
  • Identifies truncated documents
  • --check verifies document/FTS/embedding consistency
  • JSON output for scripting

5.1 Vector Search Function

File: src/search/vector.rs

use rusqlite::Connection;
use crate::core::error::Result;

/// Vector search result.
#[derive(Debug, Clone)]
pub struct VectorResult {
    pub document_id: i64,
    pub distance: f64,  // Lower = more similar
}

/// Search documents using vector similarity.
///
/// Uses sqlite-vec for efficient vector search.
/// Returns document IDs sorted by distance (lower = better match).
///
/// IMPORTANT: sqlite-vec KNN queries require:
/// - k parameter for number of results
/// - embedding passed as raw little-endian bytes
pub fn search_vector(
    conn: &Connection,
    query_embedding: &[f32],
    limit: usize,
) -> Result<Vec<VectorResult>> {
    // Convert embedding to bytes for sqlite-vec
    let embedding_bytes: Vec<u8> = query_embedding
        .iter()
        .flat_map(|f| f.to_le_bytes())
        .collect();

    let mut stmt = conn.prepare(
        "SELECT rowid, distance
         FROM embeddings
         WHERE embedding MATCH ? AND k = ?
         ORDER BY distance
         LIMIT ?"
    )?;

    let results = stmt
        .query_map(rusqlite::params![embedding_bytes, limit, limit], |row| {
            Ok(VectorResult {
                document_id: row.get(0)?,
                distance: row.get(1)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

Acceptance Criteria:

  • Returns document IDs with distances
  • Lower distance = better match
  • Works with 768-dim vectors
  • Uses k parameter for KNN query
  • Embedding passed as bytes

5.2 RRF Ranking

File: src/search/rrf.rs

use std::collections::HashMap;

/// RRF ranking constant.
const RRF_K: f64 = 60.0;

/// RRF-ranked result.
#[derive(Debug, Clone)]
pub struct RrfResult {
    pub document_id: i64,
    pub rrf_score: f64,         // Raw RRF score
    pub normalized_score: f64,  // Normalized to 0-1
    pub vector_rank: Option<usize>,
    pub fts_rank: Option<usize>,
}

/// Rank documents using Reciprocal Rank Fusion.
///
/// Algorithm:
/// RRF_score(d) = Σ 1 / (k + rank_i(d))
///
/// Where:
/// - k = 60 (tunable constant)
/// - rank_i(d) = rank of document d in retriever i (1-indexed)
/// - Sum over all retrievers where document appears
pub fn rank_rrf(
    vector_results: &[(i64, f64)],  // (doc_id, distance)
    fts_results: &[(i64, f64)],     // (doc_id, bm25_score)
) -> Vec<RrfResult> {
    let mut scores: HashMap<i64, (f64, Option<usize>, Option<usize>)> = HashMap::new();

    // Add vector results (1-indexed ranks)
    for (rank, (doc_id, _)) in vector_results.iter().enumerate() {
        let rrf_contribution = 1.0 / (RRF_K + (rank + 1) as f64);
        let entry = scores.entry(*doc_id).or_insert((0.0, None, None));
        entry.0 += rrf_contribution;
        entry.1 = Some(rank + 1);
    }

    // Add FTS results (1-indexed ranks)
    for (rank, (doc_id, _)) in fts_results.iter().enumerate() {
        let rrf_contribution = 1.0 / (RRF_K + (rank + 1) as f64);
        let entry = scores.entry(*doc_id).or_insert((0.0, None, None));
        entry.0 += rrf_contribution;
        entry.2 = Some(rank + 1);
    }

    // Convert to results and sort by RRF score descending
    let mut results: Vec<_> = scores
        .into_iter()
        .map(|(doc_id, (rrf_score, vector_rank, fts_rank))| {
            RrfResult {
                document_id: doc_id,
                rrf_score,
                normalized_score: 0.0, // Will be set below
                vector_rank,
                fts_rank,
            }
        })
        .collect();

    results.sort_by(|a, b| b.rrf_score.partial_cmp(&a.rrf_score).unwrap());

    // Normalize scores to 0-1
    if let Some(max_score) = results.first().map(|r| r.rrf_score) {
        for result in &mut results {
            result.normalized_score = result.rrf_score / max_score;
        }
    }

    results
}

Acceptance Criteria:

  • Documents in both lists score higher
  • Documents in one list still included
  • Normalized score = rrfScore / max(rrfScore)
  • Raw RRF score available in --explain output

5.3 Adaptive Recall

File: src/search/hybrid.rs

use rusqlite::Connection;

use crate::core::error::Result;
use crate::embedding::OllamaClient;
use crate::search::{SearchFilters, search_fts, search_vector, rank_rrf, RrfResult};

/// Base recall for unfiltered search.
const BASE_RECALL: usize = 50;

/// Expanded recall when filters are applied.
const FILTERED_RECALL: usize = 200;

/// Search mode.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum SearchMode {
    Hybrid,    // Vector + FTS with RRF
    Lexical,   // FTS only
    Semantic,  // Vector only
}

impl SearchMode {
    pub fn from_str(s: &str) -> Option<Self> {
        match s.to_lowercase().as_str() {
            "hybrid" => Some(Self::Hybrid),
            "lexical" | "fts" => Some(Self::Lexical),
            "semantic" | "vector" => Some(Self::Semantic),
            _ => None,
        }
    }

    pub fn as_str(&self) -> &'static str {
        match self {
            Self::Hybrid => "hybrid",
            Self::Lexical => "lexical",
            Self::Semantic => "semantic",
        }
    }
}

/// Hybrid search result.
#[derive(Debug)]
pub struct HybridResult {
    pub document_id: i64,
    pub score: f64,
    pub vector_rank: Option<usize>,
    pub fts_rank: Option<usize>,
    pub rrf_score: f64,
}

/// Execute hybrid search.
///
/// Adaptive recall: expands topK when filters are applied to prevent
/// "no results" when relevant docs exist but would be filtered out.
///
/// IMPORTANT: All modes use RRF consistently to ensure rank fields
/// are populated correctly for --explain output.
pub async fn search_hybrid(
    conn: &Connection,
    client: Option<&OllamaClient>,
    query: &str,
    mode: SearchMode,
    filters: &SearchFilters,
) -> Result<(Vec<HybridResult>, Vec<String>)> {
    let mut warnings: Vec<String> = Vec::new();
    // Determine recall based on filters
    let top_k = if filters.has_any_filter() {
        FILTERED_RECALL
    } else {
        BASE_RECALL
    };

    match mode {
        SearchMode::Lexical => {
            // FTS only - use RRF with empty vector results for consistent ranking
            let fts_results = search_fts(conn, query, top_k)?;

            let fts_tuples: Vec<_> = fts_results.iter().map(|r| (r.document_id, r.rank)).collect();
            let ranked = rank_rrf(&[], &fts_tuples);

            let results = ranked
                .into_iter()
                .map(|r| HybridResult {
                    document_id: r.document_id,
                    score: r.normalized_score,
                    vector_rank: r.vector_rank,
                    fts_rank: r.fts_rank,
                    rrf_score: r.rrf_score,
                })
                .collect();
            Ok((results, warnings))
        }
        SearchMode::Semantic => {
            // Vector only - requires client
            let client = client.ok_or_else(|| crate::core::error::GiError::OllamaUnavailable {
                base_url: "unknown".into(),
                source: None,
            })?;

            let query_embedding = client.embed_batch(vec![query.to_string()]).await?;
            let embedding = query_embedding.into_iter().next().unwrap();

            let vec_results = search_vector(conn, &embedding, top_k)?;

            // Use RRF with empty FTS results for consistent ranking
            let vec_tuples: Vec<_> = vec_results.iter().map(|r| (r.document_id, r.distance)).collect();
            let ranked = rank_rrf(&vec_tuples, &[]);

            let results = ranked
                .into_iter()
                .map(|r| HybridResult {
                    document_id: r.document_id,
                    score: r.normalized_score,
                    vector_rank: r.vector_rank,
                    fts_rank: r.fts_rank,
                    rrf_score: r.rrf_score,
                })
                .collect();
            Ok((results, warnings))
        }
        SearchMode::Hybrid => {
            // Both retrievers with RRF fusion
            let fts_results = search_fts(conn, query, top_k)?;

            let vec_results = if let Some(client) = client {
                let query_embedding = client.embed_batch(vec![query.to_string()]).await?;
                let embedding = query_embedding.into_iter().next().unwrap();
                search_vector(conn, &embedding, top_k)?
            } else {
                // Graceful degradation: use FTS only
                warnings.push("Embedding service unavailable, using lexical search only".into());
                Vec::new()
            };

            // RRF fusion
            let vec_tuples: Vec<_> = vec_results.iter().map(|r| (r.document_id, r.distance)).collect();
            let fts_tuples: Vec<_> = fts_results.iter().map(|r| (r.document_id, r.rank)).collect();

            let ranked = rank_rrf(&vec_tuples, &fts_tuples);

            let results = ranked
                .into_iter()
                .map(|r| HybridResult {
                    document_id: r.document_id,
                    score: r.normalized_score,
                    vector_rank: r.vector_rank,
                    fts_rank: r.fts_rank,
                    rrf_score: r.rrf_score,
                })
                .collect();
            Ok((results, warnings))
        }
    }
}

Acceptance Criteria:

  • Unfiltered search uses topK=50
  • Any filter triggers topK=200
  • Final results still limited by --limit

5.4 Graceful Degradation

When Ollama unavailable during hybrid/semantic search:

  1. Log warning: "Embedding service unavailable, using lexical search only"
  2. Fall back to FTS-only search
  3. Include warning in response

Acceptance Criteria:

  • Default mode is hybrid
  • --mode=lexical works without Ollama
  • --mode=semantic requires Ollama
  • Graceful degradation when Ollama down
  • --explain shows rank breakdown
  • All Phase 3 filters work in hybrid mode

Phase 6: Sync Orchestration

6.1 Dirty Source Tracking

File: src/ingestion/dirty_tracker.rs

use rusqlite::Connection;
use crate::core::error::Result;
use crate::core::time::now_ms;
use crate::documents::SourceType;

/// Maximum dirty sources to process per sync run.
const MAX_DIRTY_SOURCES_PER_RUN: usize = 500;

/// Mark a source as dirty (needs document regeneration).
///
/// Called during entity upsert operations.
/// Uses INSERT OR IGNORE to avoid duplicates.
pub fn mark_dirty(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<()> {
    conn.execute(
        "INSERT OR IGNORE INTO dirty_sources (source_type, source_id, queued_at)
         VALUES (?, ?, ?)",
        rusqlite::params![source_type.as_str(), source_id, now_ms()],
    )?;
    Ok(())
}

/// Get dirty sources ordered by queue time (bounded).
///
/// Limits results to prevent unbounded processing during large syncs.
pub fn get_dirty_sources(conn: &Connection) -> Result<Vec<(SourceType, i64)>> {
    let mut stmt = conn.prepare(
        "SELECT source_type, source_id FROM dirty_sources ORDER BY queued_at LIMIT ?"
    )?;

    let results = stmt
        .query_map([MAX_DIRTY_SOURCES_PER_RUN], |row| {
            let type_str: String = row.get(0)?;
            let source_type = match type_str.as_str() {
                "issue" => SourceType::Issue,
                "merge_request" => SourceType::MergeRequest,
                "discussion" => SourceType::Discussion,
                other => return Err(rusqlite::Error::FromSqlConversionFailure(
                    0,
                    rusqlite::types::Type::Text,
                    Box::new(std::io::Error::new(
                        std::io::ErrorKind::InvalidData,
                        format!("invalid source_type: {other}"),
                    )),
                )),
            };
            Ok((source_type, row.get(1)?))
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

/// Clear dirty source after processing.
pub fn clear_dirty(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<()> {
    conn.execute(
        "DELETE FROM dirty_sources WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
    )?;
    Ok(())
}

Acceptance Criteria:

  • Upserted entities added to dirty_sources
  • Duplicates ignored
  • Queue cleared after document regeneration
  • Processing bounded per run (max 500)

6.2 Pending Discussion Queue

File: src/ingestion/discussion_queue.rs

use rusqlite::Connection;
use crate::core::error::Result;
use crate::core::time::now_ms;

/// Noteable type for discussion fetching.
#[derive(Debug, Clone, Copy)]
pub enum NoteableType {
    Issue,
    MergeRequest,
}

impl NoteableType {
    pub fn as_str(&self) -> &'static str {
        match self {
            Self::Issue => "Issue",
            Self::MergeRequest => "MergeRequest",
        }
    }
}

/// Pending discussion fetch entry.
pub struct PendingFetch {
    pub project_id: i64,
    pub noteable_type: NoteableType,
    pub noteable_iid: i64,
    pub attempt_count: i64,
}

/// Queue a discussion fetch for an entity.
pub fn queue_discussion_fetch(
    conn: &Connection,
    project_id: i64,
    noteable_type: NoteableType,
    noteable_iid: i64,
) -> Result<()> {
    conn.execute(
        "INSERT OR REPLACE INTO pending_discussion_fetches
         (project_id, noteable_type, noteable_iid, queued_at, attempt_count, last_attempt_at, last_error)
         VALUES (?, ?, ?, ?, 0, NULL, NULL)",
        rusqlite::params![project_id, noteable_type.as_str(), noteable_iid, now_ms()],
    )?;
    Ok(())
}

/// Get pending fetches with exponential backoff.
///
/// Only returns items that have waited long enough based on attempt_count.
/// Backoff formula: min_wait_ms = 1000 * 2^attempt_count (capped at 1 hour)
///
/// Limited to `max_items` to bound API calls per sync run.
pub fn get_pending_fetches(conn: &Connection, max_items: usize) -> Result<Vec<PendingFetch>> {
    let now = now_ms();

    let mut stmt = conn.prepare(
        "SELECT project_id, noteable_type, noteable_iid, attempt_count
         FROM pending_discussion_fetches
         WHERE last_attempt_at IS NULL
            OR (? - last_attempt_at) > MIN(3600000, 1000 * (1 << attempt_count))
         ORDER BY attempt_count ASC, queued_at ASC
         LIMIT ?"
    )?;

    let results = stmt
        .query_map(rusqlite::params![now, max_items], |row| {
            let type_str: String = row.get(1)?;
            let noteable_type = if type_str == "Issue" {
                NoteableType::Issue
            } else {
                NoteableType::MergeRequest
            };
            Ok(PendingFetch {
                project_id: row.get(0)?,
                noteable_type,
                noteable_iid: row.get(2)?,
                attempt_count: row.get(3)?,
            })
        })?
        .collect::<std::result::Result<Vec<_>, _>>()?;

    Ok(results)
}

/// Mark fetch as successful and remove from queue.
pub fn complete_fetch(
    conn: &Connection,
    project_id: i64,
    noteable_type: NoteableType,
    noteable_iid: i64,
) -> Result<()> {
    conn.execute(
        "DELETE FROM pending_discussion_fetches
         WHERE project_id = ? AND noteable_type = ? AND noteable_iid = ?",
        rusqlite::params![project_id, noteable_type.as_str(), noteable_iid],
    )?;
    Ok(())
}

/// Record fetch failure.
pub fn record_fetch_error(
    conn: &Connection,
    project_id: i64,
    noteable_type: NoteableType,
    noteable_iid: i64,
    error: &str,
) -> Result<()> {
    conn.execute(
        "UPDATE pending_discussion_fetches
         SET attempt_count = attempt_count + 1,
             last_attempt_at = ?,
             last_error = ?
         WHERE project_id = ? AND noteable_type = ? AND noteable_iid = ?",
        rusqlite::params![now_ms(), error, project_id, noteable_type.as_str(), noteable_iid],
    )?;
    Ok(())
}

Acceptance Criteria:

  • Updated entities queued for discussion fetch
  • Success removes from queue
  • Failure increments attempt_count
  • Processing bounded per run (max 100)
  • Exponential backoff respects attempt_count

6.3 Document Regenerator

File: src/documents/regenerator.rs

use rusqlite::Connection;

use crate::core::error::Result;
use crate::documents::{
    extract_issue_document, extract_mr_document, extract_discussion_document,
    DocumentData, SourceType,
};
use crate::ingestion::dirty_tracker::{get_dirty_sources, clear_dirty};

/// Result of regeneration run.
#[derive(Debug, Default)]
pub struct RegenerateResult {
    pub regenerated: usize,
    pub unchanged: usize,
    pub errored: usize,
}

/// Regenerate documents from dirty queue.
///
/// Process:
/// 1. Query dirty_sources ordered by queued_at
/// 2. For each: regenerate document, compute new hash
/// 3. ALWAYS upsert document (labels/paths may change even if content_hash unchanged)
/// 4. Track whether content_hash changed (for stats)
/// 5. Delete from dirty_sources (or record error on failure)
pub fn regenerate_dirty_documents(conn: &Connection) -> Result<RegenerateResult> {
    let dirty = get_dirty_sources(conn)?;
    let mut result = RegenerateResult::default();

    for (source_type, source_id) in &dirty {
        match regenerate_one(conn, *source_type, *source_id) {
            Ok(changed) => {
                if changed {
                    result.regenerated += 1;
                } else {
                    result.unchanged += 1;
                }
                clear_dirty(conn, *source_type, *source_id)?;
            }
            Err(e) => {
                // Fail-soft: record error but continue processing remaining items
                record_dirty_error(conn, *source_type, *source_id, &e.to_string())?;
                result.errored += 1;
            }
        }
    }

    Ok(result)
}

/// Regenerate a single document. Returns true if content_hash changed.
///
/// If the source entity has been deleted, the corresponding document
/// is also deleted (cascade cleans up labels, paths, embeddings).
fn regenerate_one(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<bool> {
    // Extractors return Option: None means source entity was deleted
    let doc = match source_type {
        SourceType::Issue => extract_issue_document(conn, source_id)?,
        SourceType::MergeRequest => extract_mr_document(conn, source_id)?,
        SourceType::Discussion => extract_discussion_document(conn, source_id)?,
    };

    let Some(doc) = doc else {
        // Source was deleted — remove the document (cascade handles FTS/embeddings)
        delete_document(conn, source_type, source_id)?;
        return Ok(true);
    };

    let existing_hash = get_existing_hash(conn, source_type, source_id)?;
    let changed = existing_hash.as_ref() != Some(&doc.content_hash);

    // Always upsert: labels/paths can change independently of content_hash
    upsert_document(conn, &doc)?;

    Ok(changed)
}

/// Delete a document by source identity (cascade handles FTS trigger, labels, paths, embeddings).
fn delete_document(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<()> {
    conn.execute(
        "DELETE FROM documents WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
    )?;
    Ok(())
}

/// Record a regeneration error on a dirty source for retry.
fn record_dirty_error(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
    error: &str,
) -> Result<()> {
    conn.execute(
        "UPDATE dirty_sources
         SET attempt_count = attempt_count + 1,
             last_attempt_at = ?,
             last_error = ?
         WHERE source_type = ? AND source_id = ?",
        rusqlite::params![crate::core::time::now_ms(), error, source_type.as_str(), source_id],
    )?;
    Ok(())
}

fn get_existing_hash(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<Option<String>> {
    let mut stmt = conn.prepare(
        "SELECT content_hash FROM documents WHERE source_type = ? AND source_id = ?"
    )?;

    let hash: Option<String> = stmt
        .query_row(rusqlite::params![source_type.as_str(), source_id], |row| row.get(0))
        .ok();

    Ok(hash)
}

fn upsert_document(conn: &Connection, doc: &DocumentData) -> Result<()> {
    // Upsert main document
    conn.execute(
        "INSERT INTO documents
         (source_type, source_id, project_id, author_username, label_names,
          created_at, updated_at, url, title, content_text, content_hash,
          is_truncated, truncated_reason)
         VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
         ON CONFLICT(source_type, source_id) DO UPDATE SET
           author_username = excluded.author_username,
           label_names = excluded.label_names,
           updated_at = excluded.updated_at,
           url = excluded.url,
           title = excluded.title,
           content_text = excluded.content_text,
           content_hash = excluded.content_hash,
           is_truncated = excluded.is_truncated,
           truncated_reason = excluded.truncated_reason",
        rusqlite::params![
            doc.source_type.as_str(),
            doc.source_id,
            doc.project_id,
            doc.author_username,
            serde_json::to_string(&doc.labels)?,
            doc.created_at,
            doc.updated_at,
            doc.url,
            doc.title,
            doc.content_text,
            doc.content_hash,
            doc.is_truncated,
            doc.truncated_reason,
        ],
    )?;

    // Get inserted/updated document ID
    let doc_id = get_document_id(conn, doc.source_type, doc.source_id)?;

    // Update labels
    conn.execute(
        "DELETE FROM document_labels WHERE document_id = ?",
        [doc_id],
    )?;
    for label in &doc.labels {
        conn.execute(
            "INSERT INTO document_labels (document_id, label_name) VALUES (?, ?)",
            rusqlite::params![doc_id, label],
        )?;
    }

    // Update paths
    conn.execute(
        "DELETE FROM document_paths WHERE document_id = ?",
        [doc_id],
    )?;
    for path in &doc.paths {
        conn.execute(
            "INSERT INTO document_paths (document_id, path) VALUES (?, ?)",
            rusqlite::params![doc_id, path],
        )?;
    }

    Ok(())
}

fn get_document_id(
    conn: &Connection,
    source_type: SourceType,
    source_id: i64,
) -> Result<i64> {
    let id: i64 = conn.query_row(
        "SELECT id FROM documents WHERE source_type = ? AND source_id = ?",
        rusqlite::params![source_type.as_str(), source_id],
        |row| row.get(0),
    )?;
    Ok(id)
}

Acceptance Criteria:

  • Dirty sources get documents regenerated
  • Hash comparison prevents unnecessary updates
  • FTS triggers fire on document update
  • Queue cleared after processing

6.4 CLI: gi sync

File: src/cli/commands/sync.rs

//! Sync command - orchestrate full sync pipeline.

use serde::Serialize;

use crate::core::error::Result;
use crate::Config;

/// Sync result summary.
#[derive(Debug, Serialize)]
pub struct SyncResult {
    pub issues_updated: usize,
    pub mrs_updated: usize,
    pub discussions_fetched: usize,
    pub documents_regenerated: usize,
    pub documents_embedded: usize,
}

/// Sync options.
#[derive(Debug, Default)]
pub struct SyncOptions {
    pub full: bool,       // Reset cursors, fetch everything
    pub force: bool,      // Override stale lock
    pub no_embed: bool,   // Skip embedding step
    pub no_docs: bool,    // Skip document regeneration
}

/// Run sync orchestration.
///
/// Steps:
/// 1. Acquire app lock with heartbeat
/// 2. Ingest delta (issues, MRs) based on cursors
/// 3. Process pending_discussion_fetches queue (bounded)
/// 4. Apply rolling backfill window (configurable, default 14 days)
/// 5. Regenerate documents from dirty_sources
/// 6. Embed documents with changed content_hash
/// 7. Release lock, record sync_run
pub async fn run_sync(config: &Config, options: SyncOptions) -> Result<SyncResult> {
    // Implementation uses existing ingestion orchestrator
    // and new document/embedding pipelines
    todo!()
}

/// Print human-readable sync output.
pub fn print_sync(result: &SyncResult, elapsed_secs: u64) {
    println!("Sync complete:");
    println!("  Issues updated:        {:>6}", result.issues_updated);
    println!("  MRs updated:           {:>6}", result.mrs_updated);
    println!("  Discussions fetched:   {:>6}", result.discussions_fetched);
    println!("  Documents regenerated: {:>6}", result.documents_regenerated);
    println!("  Documents embedded:    {:>6}", result.documents_embedded);
    println!("  Elapsed: {}m {}s", elapsed_secs / 60, elapsed_secs % 60);
}

/// Print JSON sync output for robot mode.
pub fn print_sync_json(result: &SyncResult, elapsed_ms: u64) {
    let output = serde_json::json!({
        "ok": true,
        "data": result,
        "meta": {
            "elapsed_ms": elapsed_ms
        }
    });
    println!("{}", serde_json::to_string_pretty(&output).unwrap());
}

CLI integration:

/// Sync subcommand arguments.
#[derive(Args)]
pub struct SyncArgs {
    /// Reset cursors, fetch everything
    #[arg(long)]
    full: bool,

    /// Override stale lock
    #[arg(long)]
    force: bool,

    /// Skip embedding step
    #[arg(long)]
    no_embed: bool,

    /// Skip document regeneration
    #[arg(long)]
    no_docs: bool,
}

Acceptance Criteria:

  • Orchestrates full sync pipeline
  • Respects app lock
  • --full resets cursors
  • --no-embed skips embedding
  • --no-docs skips document regeneration
  • Progress reporting in human mode
  • JSON summary in robot mode

Testing Strategy

Unit Tests

Module Test File Coverage
Document extractor src/documents/extractor.rs (mod tests) Issue/MR/discussion extraction
Truncation src/documents/truncation.rs (mod tests) All edge cases
RRF ranking src/search/rrf.rs (mod tests) Score computation, merging
Content hash src/documents/extractor.rs (mod tests) Deterministic hashing

Integration Tests

Feature Test File Coverage
FTS search tests/fts_search.rs Stemming, empty results
Embedding storage tests/embedding.rs sqlite-vec operations
Hybrid search tests/hybrid_search.rs Combined retrieval
Sync orchestration tests/sync.rs Full pipeline

Golden Query Suite

File: tests/fixtures/golden_queries.json

[
  {
    "query": "authentication redesign",
    "expected_urls": [".../-/issues/234", ".../-/merge_requests/847"],
    "min_results": 1,
    "max_rank": 10
  }
]

Each query must have at least one expected URL in top 10 results.


CLI Smoke Tests

Command Expected Pass Criteria
gi generate-docs Progress, count Completes, count > 0
gi generate-docs (re-run) 0 regenerated Hash comparison works
gi embed Progress, count Completes, count matches docs
gi embed (re-run) 0 embedded Skips unchanged
gi stats Coverage stats Shows 100% after embed
gi search "auth" --mode=lexical Results Works without Ollama
gi search "auth" Hybrid results Vector + FTS combined
gi search "auth" --explain Rank breakdown Shows vector/FTS/RRF
gi search "auth" --type=mr Filtered results Only MRs
gi search "auth" --label=bug Filtered results Only labeled docs
gi search "nonexistent123" No results Graceful empty state
gi sync Full pipeline All steps complete
gi sync --no-embed Skip embedding Docs generated, not embedded

Data Integrity Checks

  • documents count = issues + MRs + discussions
  • documents_fts count = documents count
  • embeddings count = documents count (after full embed)
  • embedding_metadata.content_hash = documents.content_hash for all rows
  • All document_labels reference valid documents
  • All document_paths reference valid documents
  • No orphaned embeddings (embeddings.rowid without matching documents.id)
  • Discussion documents exclude system notes
  • Discussion documents include parent title

Success Criteria

Checkpoint 3 is complete when:

  1. Lexical search works without Ollama

    • gi search "query" --mode=lexical returns relevant results
    • All filters functional
  2. Semantic search works with Ollama

    • gi embed completes successfully
    • gi search "query" returns semantically relevant results
    • --explain shows ranking breakdown
  3. Hybrid search combines both

    • Documents appearing in both retrievers rank higher
    • Graceful degradation when Ollama unavailable
  4. Incremental sync is efficient

    • gi sync only processes changed entities
    • Re-embedding only happens for changed documents
    • Progress visible during long syncs
  5. Data integrity maintained

    • All counts match between tables
    • No orphaned records
    • Hashes consistent
  6. Tests pass

    • Unit tests for core algorithms
    • Integration tests for pipelines
    • Golden queries return expected results