feat(documents): Add document generation pipeline with dirty tracking
Implements the documents module that transforms raw ingested entities (issues, MRs, discussions) into searchable document blobs stored in the documents table. This is the foundation for both FTS5 lexical search and vector embedding. Key components: - documents::extractor: Renders entities into structured text documents. Issues include title, description, labels, milestone, assignees, and threaded discussion summaries. MRs additionally include source/target branches, reviewers, and approval status. Discussions are rendered with full note threading. - documents::regenerator: Drains the dirty_queue table to regenerate only documents whose source entities changed since last sync. Supports full rebuild mode (seeds all entities into dirty queue first) and project-scoped regeneration. - documents::truncation: Safety cap at 2MB per document to prevent pathological outliers from degrading FTS or embedding performance. - ingestion::dirty_tracker: Marks entities as dirty inside the ingestion transaction so document regeneration stays consistent with data changes. Uses INSERT OR IGNORE to deduplicate. - ingestion::discussion_queue: Queue-based discussion fetching that isolates individual discussion failures from the broader ingestion pipeline, preventing a single corrupt discussion from blocking an entire project sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
1085
src/documents/extractor.rs
Normal file
1085
src/documents/extractor.rs
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user