Go to file

Taylor Eernisse 7d07f95d4c fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures

Reduces CHUNK_MAX_BYTES from 32KB to 6KB and CHUNK_OVERLAP_CHARS from
500 to 200 to stay within nomic-embed-text's 8,192-token context
window. This commit addresses all downstream consequences of that
reduction:

- Config drift detection: find_pending_documents and
  count_pending_documents now take model_name and compare
  chunk_max_bytes, model, and dims against stored metadata. Documents
  embedded with stale config are automatically re-queued.

- Overflow guard: documents producing >= CHUNK_ROWID_MULTIPLIER chunks
  are skipped with a sentinel error recorded in embedding_metadata,
  preventing both rowid collision and infinite re-processing loops.

- Deferred clearing: old embeddings are no longer cleared before
  attempting new ones. clear_document_embeddings is deferred until the
  first successful chunk embedding, so if all chunks fail the document
  retains its previous embeddings rather than losing all data.

- Savepoints: each page of DB writes is wrapped in a SQLite savepoint
  so a crash mid-page rolls back atomically instead of leaving partial
  state (cleared embeddings with no replacements).

- Per-chunk retry on context overflow: when a batch fails with a
  context-length error, each chunk is retried individually so one
  oversized chunk doesn't poison the entire batch.

- Adaptive dedup in vector search: replaces the static 3x over-fetch
  multiplier with a dynamic one based on actual max chunks per document
  (using the new chunk_count column with a fallback COUNT query for
  pre-migration data). Also replaces partial_cmp with total_cmp for
  f64 distance sorting.

- Stores chunk_max_bytes and chunk_count (on sentinel rows) in
  embedding_metadata to support config drift detection and adaptive
  dedup without runtime queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-03 09:35:08 -05:00

.beads

chore(beads): Update issue tracker with search pipeline beads

2026-01-30 15:47:39 -05:00

docs

docs: Update documentation for search pipeline and Phase A spec

2026-01-30 15:47:33 -05:00

migrations

feat(db): Add migration 010 for chunk config tracking columns

2026-02-03 09:34:48 -05:00

src

fix(embedding): Harden pipeline against chunk overflow, config drift, and partial failures

2026-02-03 09:35:08 -05:00

tests

test: Add test suites for embedding, FTS, hybrid search, and golden queries

2026-01-30 15:47:19 -05:00

.gitignore

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

AGENTS.md

docs: Update exit codes, add config precedence and shell completions

2026-01-30 16:55:02 -05:00

build.rs

build: Add clap_complete, libc dependencies and git hash build script

2026-01-30 16:53:51 -05:00

Cargo.lock

build: Add clap_complete, libc dependencies and git hash build script

2026-01-30 16:53:51 -05:00

Cargo.toml

build: Add clap_complete, libc dependencies and git hash build script

2026-01-30 16:53:51 -05:00

PRD.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

README.md

docs: Update exit codes, add config precedence and shell completions

2026-01-30 16:55:02 -05:00

RUST_CLI_TOOLS_BEST_PRACTICES_GUIDE.md

Begin planning phase 3-5 implementation

2026-01-27 22:40:49 -05:00

SPEC-REVISIONS-2.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

SPEC-REVISIONS-3.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

SPEC-REVISIONS.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

SPEC.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

README.md

Gitlore

Local GitLab data management with semantic search. Syncs issues, MRs, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying, filtering, and hybrid search.

Features

Local-first: All data stored in SQLite for instant queries
Incremental sync: Cursor-based sync only fetches changes since last sync
Full re-sync: Reset cursors and fetch all data from scratch when needed
Multi-project: Track issues and MRs across multiple GitLab projects
Rich filtering: Filter by state, author, assignee, labels, milestone, due date, draft status, reviewer, branches
Hybrid search: Combines FTS5 lexical search with Ollama-powered vector embeddings via Reciprocal Rank Fusion
Raw payload storage: Preserves original GitLab API responses for debugging
Discussion threading: Full support for issue and MR discussions including inline code review comments
Robot mode: Machine-readable JSON output with structured errors and meaningful exit codes

Installation

cargo install --path .

Or build from source:

cargo build --release
./target/release/lore --help

Quick Start

# Initialize configuration (interactive)
lore init

# Verify authentication
lore auth

# Sync everything from GitLab (issues + MRs + docs + embeddings)
lore sync

# List recent issues
lore issues -n 10

# List open merge requests
lore mrs -s opened

# Show issue details
lore issues 123

# Show MR details with discussions
lore mrs 456

# Search across all indexed data
lore search "authentication bug"

# Robot mode (machine-readable JSON)
lore -J issues -n 5 | jq .

Configuration

Configuration is stored in ~/.config/lore/config.json (or $XDG_CONFIG_HOME/lore/config.json).

Example Configuration

{
  "gitlab": {
    "baseUrl": "https://gitlab.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project" },
    { "path": "other-group/other-project" }
  ],
  "sync": {
    "backfillDays": 14,
    "staleLockMinutes": 10,
    "heartbeatIntervalSeconds": 30,
    "cursorRewindSeconds": 2,
    "primaryConcurrency": 4,
    "dependentConcurrency": 2
  },
  "storage": {
    "compressRawPayloads": true
  },
  "embedding": {
    "provider": "ollama",
    "model": "nomic-embed-text",
    "baseUrl": "http://localhost:11434",
    "concurrency": 4
  }
}

Configuration Options

Section	Field	Default	Description
`gitlab`	`baseUrl`	--	GitLab instance URL (required)
`gitlab`	`tokenEnvVar`	`GITLAB_TOKEN`	Environment variable containing API token
`projects`	`path`	--	Project path (e.g., `group/project`)
`sync`	`backfillDays`	`14`	Days to backfill on initial sync
`sync`	`staleLockMinutes`	`10`	Minutes before sync lock considered stale
`sync`	`heartbeatIntervalSeconds`	`30`	Frequency of lock heartbeat updates
`sync`	`cursorRewindSeconds`	`2`	Seconds to rewind cursor for overlap safety
`sync`	`primaryConcurrency`	`4`	Concurrent GitLab requests for primary resources
`sync`	`dependentConcurrency`	`2`	Concurrent requests for dependent resources
`storage`	`dbPath`	`~/.local/share/lore/lore.db`	Database file path
`storage`	`backupDir`	`~/.local/share/lore/backups`	Backup directory
`storage`	`compressRawPayloads`	`true`	Compress stored API responses with gzip
`embedding`	`provider`	`ollama`	Embedding provider
`embedding`	`model`	`nomic-embed-text`	Model name for embeddings
`embedding`	`baseUrl`	`http://localhost:11434`	Ollama server URL
`embedding`	`concurrency`	`4`	Concurrent embedding requests

Config File Resolution

The config file is resolved in this order:

--config / -c CLI flag
LORE_CONFIG_PATH environment variable
~/.config/lore/config.json (XDG default)
./lore.config.json (local fallback for development)

GitLab Token

Create a personal access token with read_api scope:

Go to GitLab > Settings > Access Tokens
Create token with read_api scope
Export it: export GITLAB_TOKEN=glpat-xxxxxxxxxxxx

Environment Variables

Variable	Purpose	Required
`GITLAB_TOKEN`	GitLab API authentication token (name configurable via `gitlab.tokenEnvVar`)	Yes
`LORE_CONFIG_PATH`	Override config file location	No
`LORE_ROBOT`	Enable robot mode globally (set to `true` or `1`)	No
`XDG_CONFIG_HOME`	XDG Base Directory for config (fallback: `~/.config`)	No
`XDG_DATA_HOME`	XDG Base Directory for data (fallback: `~/.local/share`)	No
`RUST_LOG`	Logging level filter (e.g., `lore=debug`)	No

Commands

`lore issues`

Query issues from local database, or show a specific issue.

lore issues                           # Recent issues (default 50)
lore issues 123                       # Show issue #123 with discussions
lore issues 123 -p group/repo        # Disambiguate by project
lore issues -n 100                    # More results
lore issues -s opened                 # Only open issues
lore issues -s closed                 # Only closed issues
lore issues -a username               # By author (@ prefix optional)
lore issues -A username               # By assignee (@ prefix optional)
lore issues -l bug                    # By label (AND logic)
lore issues -l bug -l urgent          # Multiple labels
lore issues -m "v1.0"                 # By milestone title
lore issues --since 7d               # Updated in last 7 days
lore issues --since 2w               # Updated in last 2 weeks
lore issues --since 2024-01-01       # Updated since date
lore issues --due-before 2024-12-31  # Due before date
lore issues --has-due                 # Only issues with due dates
lore issues -p group/repo            # Filter by project
lore issues --sort created --asc     # Sort by created date, ascending
lore issues -o                        # Open first result in browser

When listing, output includes: IID, title, state, author, assignee, labels, and update time.

When showing a single issue (e.g., lore issues 123), output includes: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions.

`lore mrs`

Query merge requests from local database, or show a specific MR.

lore mrs                              # Recent MRs (default 50)
lore mrs 456                          # Show MR !456 with discussions
lore mrs 456 -p group/repo           # Disambiguate by project
lore mrs -n 100                       # More results
lore mrs -s opened                    # Only open MRs
lore mrs -s merged                    # Only merged MRs
lore mrs -s closed                    # Only closed MRs
lore mrs -s locked                    # Only locked MRs
lore mrs -s all                       # All states
lore mrs -a username                  # By author (@ prefix optional)
lore mrs -A username                  # By assignee (@ prefix optional)
lore mrs -r username                  # By reviewer (@ prefix optional)
lore mrs -d                           # Only draft/WIP MRs
lore mrs -D                           # Exclude draft MRs
lore mrs --target main               # By target branch
lore mrs --source feature/foo        # By source branch
lore mrs -l needs-review              # By label (AND logic)
lore mrs --since 7d                  # Updated in last 7 days
lore mrs -p group/repo               # Filter by project
lore mrs --sort created --asc        # Sort by created date, ascending
lore mrs -o                           # Open first result in browser

When listing, output includes: IID, title (with [DRAFT] prefix if applicable), state, author, assignee, labels, and update time.

When showing a single MR (e.g., lore mrs 456), output includes: title, description, state, draft status, author, assignees, reviewers, labels, source/target branches, merge status, web URL, and threaded discussions. Inline code review comments (DiffNotes) display file context in the format [src/file.ts:45].

`lore search`

Search across indexed documents using hybrid (lexical + semantic), lexical-only, or semantic-only modes.

lore search "authentication bug"              # Hybrid search (default)
lore search "login flow" --mode lexical       # FTS5 lexical only
lore search "login flow" --mode semantic      # Vector similarity only
lore search "auth" --type issue               # Filter by source type
lore search "auth" --type mr                  # MR documents only
lore search "auth" --type discussion          # Discussion documents only
lore search "deploy" --author username        # Filter by author
lore search "deploy" -p group/repo           # Filter by project
lore search "deploy" --label backend          # Filter by label (AND logic)
lore search "deploy" --path src/             # Filter by file path (trailing / for prefix)
lore search "deploy" --after 7d              # Created after (7d, 2w, or YYYY-MM-DD)
lore search "deploy" --updated-after 2w      # Updated after
lore search "deploy" -n 50                    # Limit results (default 20, max 100)
lore search "deploy" --explain               # Show ranking explanation per result
lore search "deploy" --fts-mode raw          # Raw FTS5 query syntax (advanced)

Requires lore generate-docs (or lore sync) to have been run at least once. Semantic mode requires Ollama with the configured embedding model.

`lore sync`

Run the full sync pipeline: ingest from GitLab, generate searchable documents, and compute embeddings.

lore sync                    # Full pipeline
lore sync --full             # Reset cursors, fetch everything
lore sync --force            # Override stale lock
lore sync --no-embed         # Skip embedding step
lore sync --no-docs          # Skip document regeneration

`lore ingest`

Sync data from GitLab to local database. Runs only the ingestion step (no doc generation or embeddings).

lore ingest                                    # Ingest everything (issues + MRs)
lore ingest issues                             # Issues only
lore ingest mrs                                # MRs only
lore ingest issues -p group/repo              # Single project
lore ingest --force                            # Override stale lock
lore ingest --full                             # Full re-sync (reset cursors)

The --full flag resets sync cursors and discussion watermarks, then fetches all data from scratch. Useful when:

Assignee data or other fields were missing from earlier syncs
You want to ensure complete data after schema changes
Troubleshooting sync issues

`lore generate-docs`

Extract searchable documents from ingested issues, MRs, and discussions for the FTS5 index.

lore generate-docs                    # Incremental (dirty items only)
lore generate-docs --full             # Full rebuild
lore generate-docs -p group/repo     # Single project

`lore embed`

Generate vector embeddings for documents via Ollama. Requires Ollama running with the configured embedding model.

lore embed                    # Embed new/changed documents
lore embed --retry-failed     # Retry previously failed embeddings

`lore count`

Count entities in local database.

lore count issues                     # Total issues
lore count mrs                        # Total MRs (with state breakdown)
lore count discussions                # Total discussions
lore count discussions --for issue   # Issue discussions only
lore count discussions --for mr      # MR discussions only
lore count notes                      # Total notes (system vs user breakdown)
lore count notes --for issue         # Issue notes only

`lore stats`

Show document and index statistics, with optional integrity checks.

lore stats                    # Document and index statistics
lore stats --check            # Run integrity checks
lore stats --check --repair   # Repair integrity issues

`lore status`

Show current sync state and watermarks.

lore status

Displays:

Last sync run details (status, timing)
Cursor positions per project and resource type (issues and MRs)
Data summary counts

`lore init`

Initialize configuration and database interactively.

lore init                    # Interactive setup
lore init --force            # Overwrite existing config
lore init --non-interactive  # Fail if prompts needed

`lore auth`

Verify GitLab authentication is working.

lore auth
# Authenticated as @username (Full Name)
# GitLab: https://gitlab.com

`lore doctor`

Check environment health and configuration.

lore doctor

Checks performed:

Config file existence and validity
Database existence and pragmas (WAL mode, foreign keys)
GitLab authentication
Project accessibility
Ollama connectivity (optional)

`lore migrate`

Run pending database migrations.

lore migrate

`lore version`

Show version information.

lore version

Robot Mode

Machine-readable JSON output for scripting and AI agent consumption.

Activation

# Global flag
lore --robot issues -n 5

# JSON shorthand (-J)
lore -J issues -n 5

# Environment variable
LORE_ROBOT=1 lore issues -n 5

# Auto-detection (when stdout is not a TTY)
lore issues -n 5 | jq .

Response Format

All commands return consistent JSON:

{"ok": true, "data": {...}, "meta": {...}}

Errors return structured JSON to stderr:

{"error": {"code": "CONFIG_NOT_FOUND", "message": "...", "suggestion": "Run 'lore init'"}}

Exit Codes

Code	Meaning
0	Success
1	Internal error / health check failed / not implemented
2	Usage error (invalid flags or arguments)
3	Config invalid
4	Token not set
5	GitLab auth failed
6	Resource not found
7	Rate limited
8	Network error
9	Database locked
10	Database error
11	Migration failed
12	I/O error
13	Transform error
14	Ollama unavailable
15	Ollama model not found
16	Embedding failed
20	Config not found

Configuration Precedence

Settings are resolved in this order (highest to lowest priority):

CLI flags (--robot, --config, --color)
Environment variables (LORE_ROBOT, GITLAB_TOKEN, LORE_CONFIG_PATH)
Config file (~/.config/lore/config.json)
Built-in defaults

Global Options

lore -c /path/to/config.json <command>   # Use alternate config
lore --robot <command>                    # Machine-readable JSON
lore -J <command>                         # JSON shorthand

Shell Completions

Generate shell completions for tab-completion support:

# Bash (add to ~/.bashrc)
lore completions bash > ~/.local/share/bash-completion/completions/lore

# Zsh (add to ~/.zshrc: fpath=(~/.zfunc $fpath))
lore completions zsh > ~/.zfunc/_lore

# Fish
lore completions fish > ~/.config/fish/completions/lore.fish

# PowerShell (add to $PROFILE)
lore completions powershell >> $PROFILE

Database Schema

Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:

Table	Purpose
`projects`	Tracked GitLab projects with metadata
`issues`	Issue metadata (title, state, author, due date, milestone)
`merge_requests`	MR metadata (title, state, draft, branches, merge status)
`milestones`	Project milestones with state and due dates
`labels`	Project labels with colors
`issue_labels`	Many-to-many issue-label relationships
`issue_assignees`	Many-to-many issue-assignee relationships
`mr_labels`	Many-to-many MR-label relationships
`mr_assignees`	Many-to-many MR-assignee relationships
`mr_reviewers`	Many-to-many MR-reviewer relationships
`discussions`	Issue/MR discussion threads
`notes`	Individual notes within discussions (with system note flag and DiffNote position data)
`documents`	Extracted searchable text for FTS and embedding
`documents_fts`	FTS5 full-text search index
`embeddings`	Vector embeddings for semantic search
`sync_runs`	Audit trail of sync operations
`sync_cursors`	Cursor positions for incremental sync
`app_locks`	Crash-safe single-flight lock
`raw_payloads`	Compressed original API responses
`schema_version`	Migration version tracking

The database is stored at ~/.local/share/lore/lore.db by default (XDG compliant).

Development

# Run tests
cargo test

# Run with debug logging
RUST_LOG=lore=debug lore issues

# Run with trace logging
RUST_LOG=lore=trace lore ingest issues

# Check formatting
cargo fmt --check

# Lint
cargo clippy

Tech Stack

Rust (2024 edition)
SQLite via rusqlite (bundled) with FTS5 and sqlite-vec
Ollama for vector embeddings (nomic-embed-text)
clap for CLI parsing
reqwest for HTTP
tokio for async runtime
serde for serialization
tracing for logging
indicatif for progress bars

License

MIT