feat(embedding): Add Ollama-powered vector embedding pipeline
Implements the embedding module that generates vector representations of documents using a local Ollama instance with the nomic-embed-text model. These embeddings enable semantic (vector) search and the hybrid search mode that fuses lexical and semantic results via RRF. Key components: - embedding::ollama: HTTP client for the Ollama /api/embeddings endpoint. Handles connection errors with actionable error messages (OllamaUnavailable, OllamaModelNotFound) and validates response dimensions. - embedding::chunking: Splits long documents into overlapping paragraph-aware chunks for embedding. Uses a configurable max token estimate (8192 default for nomic-embed-text) with 10% overlap to preserve cross-chunk context. - embedding::chunk_ids: Encodes chunk identity as doc_id * 1000 + chunk_index for the embeddings table rowid. This allows vector search to map results back to documents and deduplicate by doc_id efficiently. - embedding::change_detector: Compares document content_hash against stored embedding hashes to skip re-embedding unchanged documents, making incremental embedding runs fast. - embedding::pipeline: Orchestrates the full embedding flow: detect changed documents, chunk them, call Ollama in configurable concurrency (default 4), store results. Supports --retry-failed to re-attempt previously failed embeddings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
9
src/embedding/mod.rs
Normal file
9
src/embedding/mod.rs
Normal file
@@ -0,0 +1,9 @@
|
||||
pub mod change_detector;
|
||||
pub mod chunk_ids;
|
||||
pub mod chunking;
|
||||
pub mod ollama;
|
||||
pub mod pipeline;
|
||||
|
||||
pub use change_detector::{count_pending_documents, find_pending_documents, PendingDocument};
|
||||
pub use chunking::{split_into_chunks, CHUNK_MAX_BYTES, CHUNK_OVERLAP_CHARS};
|
||||
pub use pipeline::{embed_documents, EmbedResult};
|
||||
Reference in New Issue
Block a user