Go to file

Taylor Eernisse 20edff4ab1 feat(documents): Add document generation pipeline with dirty tracking

Implements the documents module that transforms raw ingested entities
(issues, MRs, discussions) into searchable document blobs stored in
the documents table. This is the foundation for both FTS5 lexical
search and vector embedding.

Key components:

- documents::extractor: Renders entities into structured text documents.
  Issues include title, description, labels, milestone, assignees, and
  threaded discussion summaries. MRs additionally include source/target
  branches, reviewers, and approval status. Discussions are rendered
  with full note threading.

- documents::regenerator: Drains the dirty_queue table to regenerate
  only documents whose source entities changed since last sync. Supports
  full rebuild mode (seeds all entities into dirty queue first) and
  project-scoped regeneration.

- documents::truncation: Safety cap at 2MB per document to prevent
  pathological outliers from degrading FTS or embedding performance.

- ingestion::dirty_tracker: Marks entities as dirty inside the
  ingestion transaction so document regeneration stays consistent
  with data changes. Uses INSERT OR IGNORE to deduplicate.

- ingestion::discussion_queue: Queue-based discussion fetching that
  isolates individual discussion failures from the broader ingestion
  pipeline, preventing a single corrupt discussion from blocking
  an entire project sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-30 15:46:18 -05:00

.beads

Begin planning phase 3-5 implementation

2026-01-27 22:40:49 -05:00

docs

docs: Restructure checkpoint-3 PRD with gated milestones

2026-01-29 08:42:39 -05:00

migrations

feat(db): Add migrations for documents, FTS5, and embeddings

2026-01-30 15:45:41 -05:00

src

feat(documents): Add document generation pipeline with dirty tracking

2026-01-30 15:46:18 -05:00

tests

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

.gitignore

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

AGENTS.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

Cargo.lock

deps: Add rand crate for randomized backoff and jitter

2026-01-30 15:45:30 -05:00

Cargo.toml

deps: Add rand crate for randomized backoff and jitter

2026-01-30 15:45:30 -05:00

PRD.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

README.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

RUST_CLI_TOOLS_BEST_PRACTICES_GUIDE.md

Begin planning phase 3-5 implementation

2026-01-27 22:40:49 -05:00

SPEC-REVISIONS-2.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

SPEC-REVISIONS-3.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

SPEC-REVISIONS.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

SPEC.md

Update name to gitlore instead of gitlab-inbox

2026-01-28 15:49:14 -05:00

README.md

Gitlore

Local GitLab data management with semantic search. Syncs issues, MRs, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying and filtering.

Features

Local-first: All data stored in SQLite for instant queries
Incremental sync: Cursor-based sync only fetches changes since last sync
Full re-sync: Reset cursors and fetch all data from scratch when needed
Multi-project: Track issues and MRs across multiple GitLab projects
Rich filtering: Filter by state, author, assignee, labels, milestone, due date, draft status, reviewer, branches
Raw payload storage: Preserves original GitLab API responses for debugging
Discussion threading: Full support for issue and MR discussions including inline code review comments

Installation

cargo install --path .

Or build from source:

cargo build --release
./target/release/lore --help

Quick Start

# Initialize configuration (interactive)
lore init

# Verify authentication
lore auth-test

# Sync issues from GitLab
lore ingest --type issues

# Sync merge requests from GitLab
lore ingest --type mrs

# List recent issues
lore list issues --limit 10

# List open merge requests
lore list mrs --state opened

# Show issue details
lore show issue 123 --project group/repo

# Show MR details with discussions
lore show mr 456 --project group/repo

Configuration

Configuration is stored in ~/.config/lore/config.json (or $XDG_CONFIG_HOME/lore/config.json).

Example Configuration

{
  "gitlab": {
    "baseUrl": "https://gitlab.com",
    "tokenEnvVar": "GITLAB_TOKEN"
  },
  "projects": [
    { "path": "group/project" },
    { "path": "other-group/other-project" }
  ],
  "sync": {
    "backfillDays": 14,
    "staleLockMinutes": 10,
    "heartbeatIntervalSeconds": 30,
    "cursorRewindSeconds": 2,
    "primaryConcurrency": 4,
    "dependentConcurrency": 2
  },
  "storage": {
    "compressRawPayloads": true
  }
}

Configuration Options

Section	Field	Default	Description
`gitlab`	`baseUrl`	—	GitLab instance URL (required)
`gitlab`	`tokenEnvVar`	`GITLAB_TOKEN`	Environment variable containing API token
`projects`	`path`	—	Project path (e.g., `group/project`)
`sync`	`backfillDays`	`14`	Days to backfill on initial sync
`sync`	`staleLockMinutes`	`10`	Minutes before sync lock considered stale
`sync`	`heartbeatIntervalSeconds`	`30`	Frequency of lock heartbeat updates
`sync`	`cursorRewindSeconds`	`2`	Seconds to rewind cursor for overlap safety
`sync`	`primaryConcurrency`	`4`	Concurrent GitLab requests for primary resources
`sync`	`dependentConcurrency`	`2`	Concurrent requests for dependent resources
`storage`	`dbPath`	`~/.local/share/lore/lore.db`	Database file path
`storage`	`backupDir`	`~/.local/share/lore/backups`	Backup directory
`storage`	`compressRawPayloads`	`true`	Compress stored API responses with gzip
`embedding`	`provider`	`ollama`	Embedding provider
`embedding`	`model`	`nomic-embed-text`	Model name for embeddings
`embedding`	`baseUrl`	`http://localhost:11434`	Ollama server URL
`embedding`	`concurrency`	`4`	Concurrent embedding requests

Config File Resolution

The config file is resolved in this order:

--config CLI flag
LORE_CONFIG_PATH environment variable
~/.config/lore/config.json (XDG default)
./lore.config.json (local fallback for development)

GitLab Token

Create a personal access token with read_api scope:

Go to GitLab → Settings → Access Tokens
Create token with read_api scope
Export it: export GITLAB_TOKEN=glpat-xxxxxxxxxxxx

Environment Variables

Variable	Purpose	Required
`GITLAB_TOKEN`	GitLab API authentication token (name configurable via `gitlab.tokenEnvVar`)	Yes
`LORE_CONFIG_PATH`	Override config file location	No
`XDG_CONFIG_HOME`	XDG Base Directory for config (fallback: `~/.config`)	No
`XDG_DATA_HOME`	XDG Base Directory for data (fallback: `~/.local/share`)	No
`RUST_LOG`	Logging level filter (e.g., `lore=debug`)	No

Commands

`lore init`

Initialize configuration and database interactively.

lore init                    # Interactive setup
lore init --force            # Overwrite existing config
lore init --non-interactive  # Fail if prompts needed

`lore auth-test`

Verify GitLab authentication is working.

lore auth-test
# Authenticated as @username (Full Name)
# GitLab: https://gitlab.com

`lore doctor`

Check environment health and configuration.

lore doctor          # Human-readable output
lore doctor --json   # JSON output for scripting

Checks performed:

Config file existence and validity
Database existence and pragmas (WAL mode, foreign keys)
GitLab authentication
Project accessibility
Ollama connectivity (optional)

`lore ingest`

Sync data from GitLab to local database.

# Issues
lore ingest --type issues                       # Sync all projects
lore ingest --type issues --project group/repo  # Single project
lore ingest --type issues --force               # Override stale lock
lore ingest --type issues --full                # Full re-sync (reset cursors)

# Merge Requests
lore ingest --type mrs                          # Sync all projects
lore ingest --type mrs --project group/repo     # Single project
lore ingest --type mrs --full                   # Full re-sync (reset cursors)

The --full flag resets sync cursors and discussion watermarks, then fetches all data from scratch. Useful when:

Assignee data or other fields were missing from earlier syncs
You want to ensure complete data after schema changes
Troubleshooting sync issues

`lore list issues`

Query issues from local database.

lore list issues                              # Recent issues (default 50)
lore list issues --limit 100                  # More results
lore list issues --state opened               # Only open issues
lore list issues --state closed               # Only closed issues
lore list issues --author username            # By author (@ prefix optional)
lore list issues --assignee username          # By assignee (@ prefix optional)
lore list issues --label bug                  # By label (AND logic)
lore list issues --label bug --label urgent   # Multiple labels
lore list issues --milestone "v1.0"           # By milestone title
lore list issues --since 7d                   # Updated in last 7 days
lore list issues --since 2w                   # Updated in last 2 weeks
lore list issues --since 2024-01-01           # Updated since date
lore list issues --due-before 2024-12-31      # Due before date
lore list issues --has-due-date               # Only issues with due dates
lore list issues --project group/repo         # Filter by project
lore list issues --sort created --order asc   # Sort options
lore list issues --open                       # Open first result in browser
lore list issues --json                       # JSON output

Output includes: IID, title, state, author, assignee, labels, and update time.

`lore list mrs`

Query merge requests from local database.

lore list mrs                                 # Recent MRs (default 50)
lore list mrs --limit 100                     # More results
lore list mrs --state opened                  # Only open MRs
lore list mrs --state merged                  # Only merged MRs
lore list mrs --state closed                  # Only closed MRs
lore list mrs --state locked                  # Only locked MRs
lore list mrs --state all                     # All states
lore list mrs --author username               # By author (@ prefix optional)
lore list mrs --assignee username             # By assignee (@ prefix optional)
lore list mrs --reviewer username             # By reviewer (@ prefix optional)
lore list mrs --draft                         # Only draft/WIP MRs
lore list mrs --no-draft                      # Exclude draft MRs
lore list mrs --target-branch main            # By target branch
lore list mrs --source-branch feature/foo     # By source branch
lore list mrs --label needs-review            # By label (AND logic)
lore list mrs --since 7d                      # Updated in last 7 days
lore list mrs --project group/repo            # Filter by project
lore list mrs --sort created --order asc      # Sort options
lore list mrs --open                          # Open first result in browser
lore list mrs --json                          # JSON output

Output includes: IID, title (with [DRAFT] prefix if applicable), state, author, assignee, labels, and update time.

`lore show issue`

Display detailed issue information.

lore show issue 123                      # Show issue #123
lore show issue 123 --project group/repo # Disambiguate if needed

Shows: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions.

`lore show mr`

Display detailed merge request information.

lore show mr 456                         # Show MR !456
lore show mr 456 --project group/repo    # Disambiguate if needed

Shows: title, description, state, draft status, author, assignees, reviewers, labels, source/target branches, merge status, web URL, and threaded discussions. Inline code review comments (DiffNotes) display file context in the format [src/file.ts:45].

`lore count`

Count entities in local database.

lore count issues                    # Total issues
lore count mrs                       # Total MRs (with state breakdown)
lore count discussions               # Total discussions
lore count discussions --type issue  # Issue discussions only
lore count discussions --type mr     # MR discussions only
lore count notes                     # Total notes (shows system vs user breakdown)

`lore sync-status`

Show current sync state and watermarks.

lore sync-status

Displays:

Last sync run details (status, timing)
Cursor positions per project and resource type (issues and MRs)
Data summary counts

`lore migrate`

Run pending database migrations.

lore migrate

Shows current schema version and applies any pending migrations.

`lore version`

Show version information.

lore version

`lore backup`

Create timestamped database backup.

lore backup

Note: Not yet implemented.

`lore reset`

Delete database and reset all state.

lore reset --confirm

Note: Not yet implemented.

Database Schema

Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:

Table	Purpose
`projects`	Tracked GitLab projects with metadata
`issues`	Issue metadata (title, state, author, due date, milestone)
`merge_requests`	MR metadata (title, state, draft, branches, merge status)
`milestones`	Project milestones with state and due dates
`labels`	Project labels with colors
`issue_labels`	Many-to-many issue-label relationships
`issue_assignees`	Many-to-many issue-assignee relationships
`mr_labels`	Many-to-many MR-label relationships
`mr_assignees`	Many-to-many MR-assignee relationships
`mr_reviewers`	Many-to-many MR-reviewer relationships
`discussions`	Issue/MR discussion threads
`notes`	Individual notes within discussions (with system note flag and DiffNote position data)
`sync_runs`	Audit trail of sync operations
`sync_cursors`	Cursor positions for incremental sync
`app_locks`	Crash-safe single-flight lock
`raw_payloads`	Compressed original API responses
`schema_version`	Migration version tracking

The database is stored at ~/.local/share/lore/lore.db by default (XDG compliant).

Global Options

lore --config /path/to/config.json <command>  # Use alternate config

Development

# Run tests
cargo test

# Run with debug logging
RUST_LOG=lore=debug lore list issues

# Run with trace logging
RUST_LOG=lore=trace lore ingest --type issues

# Check formatting
cargo fmt --check

# Lint
cargo clippy

Tech Stack

Rust (2024 edition)
SQLite via rusqlite (bundled)
clap for CLI parsing
reqwest for HTTP
tokio for async runtime
serde for serialization
tracing for logging
indicatif for progress bars

Current Status

This is Checkpoint 2 (CP2) of the Gitlore project. Currently implemented:

Issue ingestion with cursor-based incremental sync
Merge request ingestion with cursor-based incremental sync
Discussion and note syncing for issues and MRs
DiffNote support for inline code review comments
Rich filtering and querying for both issues and MRs
Full re-sync capability with watermark reset

Not yet implemented:

Semantic search with embeddings (CP3+)
Backup and reset commands

See SPEC.md for the full project roadmap and architecture.

License

MIT