docs: Update documentation for search pipeline and Phase A spec

- README.md: Add hybrid search and robot mode to feature list. Update
  quick start to use new noun-first CLI syntax (lore issues, lore mrs,
  lore search). Add embedding configuration section. Update command
  examples throughout.

- AGENTS.md: Update robot mode examples to new CLI syntax. Add search,
  sync, stats, and generate-docs commands to the robot mode reference.
  Update flag conventions (-n for limit, -s for state, -J for JSON).

- docs/prd/checkpoint-3.md: Major expansion with gated milestone
  structure (Gate A: lexical, Gate B: hybrid, Gate C: sync). Add
  prerequisite rename note, code sample conventions, chunking strategy
  details, and sqlite-vec rowid encoding scheme. Clarify that Gate A
  requires only SQLite + FTS5 with no sqlite-vec dependency.

- docs/phase-a-spec.md: New detailed specification for Gate A (lexical
  search MVP) covering document schema, FTS5 configuration, dirty
  queue mechanics, CLI interface, and acceptance criteria.

- docs/api-efficiency-findings.md: Analysis of GitLab API pagination
  behavior and efficiency observations from production sync runs.
  Documents the missing x-next-page header issue and heuristic fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Taylor Eernisse
2026-01-30 15:47:33 -05:00
parent d235f2b4dd
commit 9b63671df9
5 changed files with 1902 additions and 377 deletions

View File

@@ -33,38 +33,50 @@ The `lore` CLI has a robot mode optimized for AI agent consumption with structur
```bash
# Explicit flag
lore --robot list issues
lore --robot issues -n 10
# JSON shorthand (-J)
lore -J issues -n 10
# Auto-detection (when stdout is not a TTY)
lore list issues | jq .
lore issues | jq .
# Environment variable
LORE_ROBOT=true lore list issues
LORE_ROBOT=1 lore issues
```
### Robot Mode Commands
```bash
# List issues/MRs with JSON output
lore --robot list issues --limit=10
lore --robot list mrs --state=opened
lore --robot issues -n 10
lore --robot mrs -s opened
# Show detailed entity info
lore --robot issues 123
lore --robot mrs 456 -p group/repo
# Count entities
lore --robot count issues
lore --robot count discussions --type=mr
lore --robot count discussions --for mr
# Show detailed entity info
lore --robot show issue 123
lore --robot show mr 456 --project=group/repo
# Search indexed documents
lore --robot search "authentication bug"
# Check sync status
lore --robot sync-status
lore --robot status
# Run ingestion (quiet, JSON summary)
lore --robot ingest --type=issues
# Run full sync pipeline
lore --robot sync
# Run ingestion only
lore --robot ingest issues
# Check environment health
lore --robot doctor
# Document and index statistics
lore --robot stats
```
### Response Format
@@ -102,8 +114,8 @@ Errors return structured JSON to stderr:
### Best Practices
- Use `lore --robot` for all agent interactions
- Use `lore --robot` or `lore -J` for all agent interactions
- Check exit codes for error handling
- Parse JSON errors from stderr
- Use `--limit` to control response size
- Use `-n` / `--limit` to control response size
- TTY detection handles piped commands automatically

441
README.md
View File

@@ -1,6 +1,6 @@
# Gitlore
Local GitLab data management with semantic search. Syncs issues, MRs, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying and filtering.
Local GitLab data management with semantic search. Syncs issues, MRs, discussions, and notes from GitLab to a local SQLite database for fast, offline-capable querying, filtering, and hybrid search.
## Features
@@ -9,8 +9,10 @@ Local GitLab data management with semantic search. Syncs issues, MRs, discussion
- **Full re-sync**: Reset cursors and fetch all data from scratch when needed
- **Multi-project**: Track issues and MRs across multiple GitLab projects
- **Rich filtering**: Filter by state, author, assignee, labels, milestone, due date, draft status, reviewer, branches
- **Hybrid search**: Combines FTS5 lexical search with Ollama-powered vector embeddings via Reciprocal Rank Fusion
- **Raw payload storage**: Preserves original GitLab API responses for debugging
- **Discussion threading**: Full support for issue and MR discussions including inline code review comments
- **Robot mode**: Machine-readable JSON output with structured errors and meaningful exit codes
## Installation
@@ -32,25 +34,28 @@ cargo build --release
lore init
# Verify authentication
lore auth-test
lore auth
# Sync issues from GitLab
lore ingest --type issues
# Sync merge requests from GitLab
lore ingest --type mrs
# Sync everything from GitLab (issues + MRs + docs + embeddings)
lore sync
# List recent issues
lore list issues --limit 10
lore issues -n 10
# List open merge requests
lore list mrs --state opened
lore mrs -s opened
# Show issue details
lore show issue 123 --project group/repo
lore issues 123
# Show MR details with discussions
lore show mr 456 --project group/repo
lore mrs 456
# Search across all indexed data
lore search "authentication bug"
# Robot mode (machine-readable JSON)
lore -J issues -n 5 | jq .
```
## Configuration
@@ -79,6 +84,12 @@ Configuration is stored in `~/.config/lore/config.json` (or `$XDG_CONFIG_HOME/lo
},
"storage": {
"compressRawPayloads": true
},
"embedding": {
"provider": "ollama",
"model": "nomic-embed-text",
"baseUrl": "http://localhost:11434",
"concurrency": 4
}
}
```
@@ -87,9 +98,9 @@ Configuration is stored in `~/.config/lore/config.json` (or `$XDG_CONFIG_HOME/lo
| Section | Field | Default | Description |
|---------|-------|---------|-------------|
| `gitlab` | `baseUrl` | | GitLab instance URL (required) |
| `gitlab` | `baseUrl` | -- | GitLab instance URL (required) |
| `gitlab` | `tokenEnvVar` | `GITLAB_TOKEN` | Environment variable containing API token |
| `projects` | `path` | | Project path (e.g., `group/project`) |
| `projects` | `path` | -- | Project path (e.g., `group/project`) |
| `sync` | `backfillDays` | `14` | Days to backfill on initial sync |
| `sync` | `staleLockMinutes` | `10` | Minutes before sync lock considered stale |
| `sync` | `heartbeatIntervalSeconds` | `30` | Frequency of lock heartbeat updates |
@@ -107,7 +118,7 @@ Configuration is stored in `~/.config/lore/config.json` (or `$XDG_CONFIG_HOME/lo
### Config File Resolution
The config file is resolved in this order:
1. `--config` CLI flag
1. `--config` / `-c` CLI flag
2. `LORE_CONFIG_PATH` environment variable
3. `~/.config/lore/config.json` (XDG default)
4. `./lore.config.json` (local fallback for development)
@@ -116,7 +127,7 @@ The config file is resolved in this order:
Create a personal access token with `read_api` scope:
1. Go to GitLab Settings Access Tokens
1. Go to GitLab > Settings > Access Tokens
2. Create token with `read_api` scope
3. Export it: `export GITLAB_TOKEN=glpat-xxxxxxxxxxxx`
@@ -126,12 +137,185 @@ Create a personal access token with `read_api` scope:
|----------|---------|----------|
| `GITLAB_TOKEN` | GitLab API authentication token (name configurable via `gitlab.tokenEnvVar`) | Yes |
| `LORE_CONFIG_PATH` | Override config file location | No |
| `LORE_ROBOT` | Enable robot mode globally (set to `true` or `1`) | No |
| `XDG_CONFIG_HOME` | XDG Base Directory for config (fallback: `~/.config`) | No |
| `XDG_DATA_HOME` | XDG Base Directory for data (fallback: `~/.local/share`) | No |
| `RUST_LOG` | Logging level filter (e.g., `lore=debug`) | No |
## Commands
### `lore issues`
Query issues from local database, or show a specific issue.
```bash
lore issues # Recent issues (default 50)
lore issues 123 # Show issue #123 with discussions
lore issues 123 -p group/repo # Disambiguate by project
lore issues -n 100 # More results
lore issues -s opened # Only open issues
lore issues -s closed # Only closed issues
lore issues -a username # By author (@ prefix optional)
lore issues -A username # By assignee (@ prefix optional)
lore issues -l bug # By label (AND logic)
lore issues -l bug -l urgent # Multiple labels
lore issues -m "v1.0" # By milestone title
lore issues --since 7d # Updated in last 7 days
lore issues --since 2w # Updated in last 2 weeks
lore issues --since 2024-01-01 # Updated since date
lore issues --due-before 2024-12-31 # Due before date
lore issues --has-due # Only issues with due dates
lore issues -p group/repo # Filter by project
lore issues --sort created --asc # Sort by created date, ascending
lore issues -o # Open first result in browser
```
When listing, output includes: IID, title, state, author, assignee, labels, and update time.
When showing a single issue (e.g., `lore issues 123`), output includes: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions.
### `lore mrs`
Query merge requests from local database, or show a specific MR.
```bash
lore mrs # Recent MRs (default 50)
lore mrs 456 # Show MR !456 with discussions
lore mrs 456 -p group/repo # Disambiguate by project
lore mrs -n 100 # More results
lore mrs -s opened # Only open MRs
lore mrs -s merged # Only merged MRs
lore mrs -s closed # Only closed MRs
lore mrs -s locked # Only locked MRs
lore mrs -s all # All states
lore mrs -a username # By author (@ prefix optional)
lore mrs -A username # By assignee (@ prefix optional)
lore mrs -r username # By reviewer (@ prefix optional)
lore mrs -d # Only draft/WIP MRs
lore mrs -D # Exclude draft MRs
lore mrs --target main # By target branch
lore mrs --source feature/foo # By source branch
lore mrs -l needs-review # By label (AND logic)
lore mrs --since 7d # Updated in last 7 days
lore mrs -p group/repo # Filter by project
lore mrs --sort created --asc # Sort by created date, ascending
lore mrs -o # Open first result in browser
```
When listing, output includes: IID, title (with [DRAFT] prefix if applicable), state, author, assignee, labels, and update time.
When showing a single MR (e.g., `lore mrs 456`), output includes: title, description, state, draft status, author, assignees, reviewers, labels, source/target branches, merge status, web URL, and threaded discussions. Inline code review comments (DiffNotes) display file context in the format `[src/file.ts:45]`.
### `lore search`
Search across indexed documents using hybrid (lexical + semantic), lexical-only, or semantic-only modes.
```bash
lore search "authentication bug" # Hybrid search (default)
lore search "login flow" --mode lexical # FTS5 lexical only
lore search "login flow" --mode semantic # Vector similarity only
lore search "auth" --type issue # Filter by source type
lore search "auth" --type mr # MR documents only
lore search "auth" --type discussion # Discussion documents only
lore search "deploy" --author username # Filter by author
lore search "deploy" -p group/repo # Filter by project
lore search "deploy" --label backend # Filter by label (AND logic)
lore search "deploy" --path src/ # Filter by file path (trailing / for prefix)
lore search "deploy" --after 7d # Created after (7d, 2w, or YYYY-MM-DD)
lore search "deploy" --updated-after 2w # Updated after
lore search "deploy" -n 50 # Limit results (default 20, max 100)
lore search "deploy" --explain # Show ranking explanation per result
lore search "deploy" --fts-mode raw # Raw FTS5 query syntax (advanced)
```
Requires `lore generate-docs` (or `lore sync`) to have been run at least once. Semantic mode requires Ollama with the configured embedding model.
### `lore sync`
Run the full sync pipeline: ingest from GitLab, generate searchable documents, and compute embeddings.
```bash
lore sync # Full pipeline
lore sync --full # Reset cursors, fetch everything
lore sync --force # Override stale lock
lore sync --no-embed # Skip embedding step
lore sync --no-docs # Skip document regeneration
```
### `lore ingest`
Sync data from GitLab to local database. Runs only the ingestion step (no doc generation or embeddings).
```bash
lore ingest # Ingest everything (issues + MRs)
lore ingest issues # Issues only
lore ingest mrs # MRs only
lore ingest issues -p group/repo # Single project
lore ingest --force # Override stale lock
lore ingest --full # Full re-sync (reset cursors)
```
The `--full` flag resets sync cursors and discussion watermarks, then fetches all data from scratch. Useful when:
- Assignee data or other fields were missing from earlier syncs
- You want to ensure complete data after schema changes
- Troubleshooting sync issues
### `lore generate-docs`
Extract searchable documents from ingested issues, MRs, and discussions for the FTS5 index.
```bash
lore generate-docs # Incremental (dirty items only)
lore generate-docs --full # Full rebuild
lore generate-docs -p group/repo # Single project
```
### `lore embed`
Generate vector embeddings for documents via Ollama. Requires Ollama running with the configured embedding model.
```bash
lore embed # Embed new/changed documents
lore embed --retry-failed # Retry previously failed embeddings
```
### `lore count`
Count entities in local database.
```bash
lore count issues # Total issues
lore count mrs # Total MRs (with state breakdown)
lore count discussions # Total discussions
lore count discussions --for issue # Issue discussions only
lore count discussions --for mr # MR discussions only
lore count notes # Total notes (system vs user breakdown)
lore count notes --for issue # Issue notes only
```
### `lore stats`
Show document and index statistics, with optional integrity checks.
```bash
lore stats # Document and index statistics
lore stats --check # Run integrity checks
lore stats --check --repair # Repair integrity issues
```
### `lore status`
Show current sync state and watermarks.
```bash
lore status
```
Displays:
- Last sync run details (status, timing)
- Cursor positions per project and resource type (issues and MRs)
- Data summary counts
### `lore init`
Initialize configuration and database interactively.
@@ -142,12 +326,12 @@ lore init --force # Overwrite existing config
lore init --non-interactive # Fail if prompts needed
```
### `lore auth-test`
### `lore auth`
Verify GitLab authentication is working.
```bash
lore auth-test
lore auth
# Authenticated as @username (Full Name)
# GitLab: https://gitlab.com
```
@@ -157,8 +341,7 @@ lore auth-test
Check environment health and configuration.
```bash
lore doctor # Human-readable output
lore doctor --json # JSON output for scripting
lore doctor
```
Checks performed:
@@ -168,132 +351,6 @@ Checks performed:
- Project accessibility
- Ollama connectivity (optional)
### `lore ingest`
Sync data from GitLab to local database.
```bash
# Issues
lore ingest --type issues # Sync all projects
lore ingest --type issues --project group/repo # Single project
lore ingest --type issues --force # Override stale lock
lore ingest --type issues --full # Full re-sync (reset cursors)
# Merge Requests
lore ingest --type mrs # Sync all projects
lore ingest --type mrs --project group/repo # Single project
lore ingest --type mrs --full # Full re-sync (reset cursors)
```
The `--full` flag resets sync cursors and discussion watermarks, then fetches all data from scratch. Useful when:
- Assignee data or other fields were missing from earlier syncs
- You want to ensure complete data after schema changes
- Troubleshooting sync issues
### `lore list issues`
Query issues from local database.
```bash
lore list issues # Recent issues (default 50)
lore list issues --limit 100 # More results
lore list issues --state opened # Only open issues
lore list issues --state closed # Only closed issues
lore list issues --author username # By author (@ prefix optional)
lore list issues --assignee username # By assignee (@ prefix optional)
lore list issues --label bug # By label (AND logic)
lore list issues --label bug --label urgent # Multiple labels
lore list issues --milestone "v1.0" # By milestone title
lore list issues --since 7d # Updated in last 7 days
lore list issues --since 2w # Updated in last 2 weeks
lore list issues --since 2024-01-01 # Updated since date
lore list issues --due-before 2024-12-31 # Due before date
lore list issues --has-due-date # Only issues with due dates
lore list issues --project group/repo # Filter by project
lore list issues --sort created --order asc # Sort options
lore list issues --open # Open first result in browser
lore list issues --json # JSON output
```
Output includes: IID, title, state, author, assignee, labels, and update time.
### `lore list mrs`
Query merge requests from local database.
```bash
lore list mrs # Recent MRs (default 50)
lore list mrs --limit 100 # More results
lore list mrs --state opened # Only open MRs
lore list mrs --state merged # Only merged MRs
lore list mrs --state closed # Only closed MRs
lore list mrs --state locked # Only locked MRs
lore list mrs --state all # All states
lore list mrs --author username # By author (@ prefix optional)
lore list mrs --assignee username # By assignee (@ prefix optional)
lore list mrs --reviewer username # By reviewer (@ prefix optional)
lore list mrs --draft # Only draft/WIP MRs
lore list mrs --no-draft # Exclude draft MRs
lore list mrs --target-branch main # By target branch
lore list mrs --source-branch feature/foo # By source branch
lore list mrs --label needs-review # By label (AND logic)
lore list mrs --since 7d # Updated in last 7 days
lore list mrs --project group/repo # Filter by project
lore list mrs --sort created --order asc # Sort options
lore list mrs --open # Open first result in browser
lore list mrs --json # JSON output
```
Output includes: IID, title (with [DRAFT] prefix if applicable), state, author, assignee, labels, and update time.
### `lore show issue`
Display detailed issue information.
```bash
lore show issue 123 # Show issue #123
lore show issue 123 --project group/repo # Disambiguate if needed
```
Shows: title, description, state, author, assignees, labels, milestone, due date, web URL, and threaded discussions.
### `lore show mr`
Display detailed merge request information.
```bash
lore show mr 456 # Show MR !456
lore show mr 456 --project group/repo # Disambiguate if needed
```
Shows: title, description, state, draft status, author, assignees, reviewers, labels, source/target branches, merge status, web URL, and threaded discussions. Inline code review comments (DiffNotes) display file context in the format `[src/file.ts:45]`.
### `lore count`
Count entities in local database.
```bash
lore count issues # Total issues
lore count mrs # Total MRs (with state breakdown)
lore count discussions # Total discussions
lore count discussions --type issue # Issue discussions only
lore count discussions --type mr # MR discussions only
lore count notes # Total notes (shows system vs user breakdown)
```
### `lore sync-status`
Show current sync state and watermarks.
```bash
lore sync-status
```
Displays:
- Last sync run details (status, timing)
- Cursor positions per project and resource type (issues and MRs)
- Data summary counts
### `lore migrate`
Run pending database migrations.
@@ -302,8 +359,6 @@ Run pending database migrations.
lore migrate
```
Shows current schema version and applies any pending migrations.
### `lore version`
Show version information.
@@ -312,26 +367,67 @@ Show version information.
lore version
```
### `lore backup`
## Robot Mode
Create timestamped database backup.
Machine-readable JSON output for scripting and AI agent consumption.
### Activation
```bash
lore backup
# Global flag
lore --robot issues -n 5
# JSON shorthand (-J)
lore -J issues -n 5
# Environment variable
LORE_ROBOT=1 lore issues -n 5
# Auto-detection (when stdout is not a TTY)
lore issues -n 5 | jq .
```
*Note: Not yet implemented.*
### Response Format
### `lore reset`
All commands return consistent JSON:
Delete database and reset all state.
```json
{"ok": true, "data": {...}, "meta": {...}}
```
Errors return structured JSON to stderr:
```json
{"error": {"code": "CONFIG_NOT_FOUND", "message": "...", "suggestion": "Run 'lore init'"}}
```
### Exit Codes
| Code | Meaning |
|------|---------|
| 0 | Success |
| 1 | Internal error |
| 2 | Config not found |
| 3 | Config invalid |
| 4 | Token not set |
| 5 | GitLab auth failed |
| 6 | Resource not found |
| 7 | Rate limited |
| 8 | Network error |
| 9 | Database locked |
| 10 | Database error |
| 11 | Migration failed |
| 12 | I/O error |
| 13 | Transform error |
## Global Options
```bash
lore reset --confirm
lore -c /path/to/config.json <command> # Use alternate config
lore --robot <command> # Machine-readable JSON
lore -J <command> # JSON shorthand
```
*Note: Not yet implemented.*
## Database Schema
Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:
@@ -350,6 +446,9 @@ Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:
| `mr_reviewers` | Many-to-many MR-reviewer relationships |
| `discussions` | Issue/MR discussion threads |
| `notes` | Individual notes within discussions (with system note flag and DiffNote position data) |
| `documents` | Extracted searchable text for FTS and embedding |
| `documents_fts` | FTS5 full-text search index |
| `embeddings` | Vector embeddings for semantic search |
| `sync_runs` | Audit trail of sync operations |
| `sync_cursors` | Cursor positions for incremental sync |
| `app_locks` | Crash-safe single-flight lock |
@@ -358,12 +457,6 @@ Data is stored in SQLite with WAL mode and foreign keys enabled. Main tables:
The database is stored at `~/.local/share/lore/lore.db` by default (XDG compliant).
## Global Options
```bash
lore --config /path/to/config.json <command> # Use alternate config
```
## Development
```bash
@@ -371,10 +464,10 @@ lore --config /path/to/config.json <command> # Use alternate config
cargo test
# Run with debug logging
RUST_LOG=lore=debug lore list issues
RUST_LOG=lore=debug lore issues
# Run with trace logging
RUST_LOG=lore=trace lore ingest --type issues
RUST_LOG=lore=trace lore ingest issues
# Check formatting
cargo fmt --check
@@ -386,7 +479,8 @@ cargo clippy
## Tech Stack
- **Rust** (2024 edition)
- **SQLite** via rusqlite (bundled)
- **SQLite** via rusqlite (bundled) with FTS5 and sqlite-vec
- **Ollama** for vector embeddings (nomic-embed-text)
- **clap** for CLI parsing
- **reqwest** for HTTP
- **tokio** for async runtime
@@ -394,23 +488,6 @@ cargo clippy
- **tracing** for logging
- **indicatif** for progress bars
## Current Status
This is Checkpoint 2 (CP2) of the Gitlore project. Currently implemented:
- Issue ingestion with cursor-based incremental sync
- Merge request ingestion with cursor-based incremental sync
- Discussion and note syncing for issues and MRs
- DiffNote support for inline code review comments
- Rich filtering and querying for both issues and MRs
- Full re-sync capability with watermark reset
Not yet implemented:
- Semantic search with embeddings (CP3+)
- Backup and reset commands
See [SPEC.md](SPEC.md) for the full project roadmap and architecture.
## License
MIT

View File

@@ -0,0 +1,354 @@
# API Efficiency & Observability Findings
> **Status:** Draft - working through items
> **Context:** Audit of gitlore's GitLab API usage, data processing, and observability gaps
> **Interactive reference:** `api-review.html` (root of repo, open in browser)
---
## Checkpoint 3 Alignment
Checkpoint 3 (`docs/prd/checkpoint-3.md`) introduces `lore sync` orchestration, document generation, and search. Several findings here overlap with that work. This section maps the relationship so effort isn't duplicated and so CP3 implementation can absorb the right instrumentation as it's built.
### Direct overlaps (CP3 partially addresses)
| Finding | CP3 coverage | Remaining gap |
|---------|-------------|---------------|
| **P0-1** sync_runs never written | `lore sync` step 7 says "record sync_run". `SyncResult` struct defined with counts. | Only covers the new `lore sync` command. Existing `lore ingest` still won't write sync_runs. Either instrument `lore ingest` separately or have `lore sync` subsume it entirely. |
| **P0-2** No timing | `print_sync` captures wall-clock `elapsed_secs` / `elapsed_ms` in robot mode JSON `meta` envelope. | Wall-clock only. No per-phase, per-API-call, or per-DB-write breakdown. The `SyncResult` struct has counts but no duration fields. |
| **P2-1** Discussion full-refresh | CP3 introduces `pending_discussion_fetches` queue with exponential backoff and bounded processing per sync. Structures the work better. | Same full-refresh strategy per entity. The queue adds retry resilience but doesn't reduce the number of API calls for unchanged discussions. |
### Different scope (complementary, no overlap)
| Finding | Why no overlap |
|---------|---------------|
| **P0-3** metrics_json schema | CP3 doesn't reference the `metrics_json` column. `SyncResult` is printed/returned but not persisted there. |
| **P0-4** Discussion sync telemetry columns | CP3's queue system (`pending_discussion_fetches`) is a replacement architecture. The existing per-MR telemetry columns (`discussions_sync_attempts`, `_last_error`) aren't referenced in CP3. Decide: use CP3's queue table or wire up the existing columns? |
| **P0-5** Progress events lack timing | CP3 lists "Progress visible during long syncs" as acceptance criteria but doesn't spec timing in events. |
| **P1-\*** Free data capture | CP3 doesn't touch GitLab API response field coverage at all. These are independent. |
| **P2-2** Keyset pagination (GitLab API) | CP3 uses keyset pagination for local SQLite queries (document seeding, embedding pipelines). Completely different from using GitLab API keyset pagination. |
| **P2-3** ETags | Not mentioned in CP3. |
| **P2-4** Labels enrichment | Not mentioned in CP3. |
| **P3-\*** Structural improvements | Not in CP3 scope. |
### Recommendation
CP3's `lore sync` orchestrator is the natural integration point for P0 instrumentation. Rather than retrofitting `lore ingest` separately, the most efficient path is:
1. Build P0 timing instrumentation as a reusable layer (e.g., a `SyncMetrics` struct that accumulates phase timings)
2. Wire it into the CP3 `run_sync` implementation as it's built
3. Have `run_sync` persist the full metrics (counts + timing) to `sync_runs.metrics_json`
4. Decide whether `lore ingest` becomes a thin wrapper around `lore sync --no-docs --no-embed` or stays separate with its own sync_runs recording
This avoids building instrumentation twice and ensures the new sync pipeline is observable from day one.
### Decision: `lore ingest` goes away
`lore sync` becomes the single command for all data fetching. First run does a full fetch (equivalent to today's `lore ingest`), subsequent runs are incremental via cursors. `lore ingest` becomes a hidden deprecated alias.
Implications:
- P0 instrumentation only needs to be built in one place (`run_sync`)
- CP3 Gate C owns the sync_runs lifecycle end-to-end
- The existing `lore ingest issues` / `lore ingest mrs` code becomes internal functions called by `run_sync`, not standalone CLI commands
- `lore sync` always syncs everything: issues, MRs, discussions, documents, embeddings (with `--no-embed` / `--no-docs` to opt out of later stages)
---
## Implementation Sequence
### Phase A: Before CP3 (independent, enriches data model)
**Do first.** Migration + struct changes only. No architectural dependency. Gets richer source data into the DB before CP3's document generation pipeline locks in its schema.
1. **P1 batch: free data capture** - All ~11 fields in a single migration. `user_notes_count`, `upvotes`, `downvotes`, `confidential`, `has_conflicts`, `blocking_discussions_resolved`, `merge_commit_sha`, `discussion_locked`, `task_completion_status`, `issue_type`, `issue references`.
2. **P1-10: MR milestones** - Reuse existing issue milestone transformer. Slightly more work, same migration.
### Phase B: During CP3 Gate C (`lore sync`)
**Build instrumentation into the sync orchestrator as it's constructed.** Not a separate effort.
3. **P0-1 + P0-2 + P0-3** - `SyncMetrics` struct accumulating phase timings. `run_sync` writes to `sync_runs` with full `metrics_json` on completion.
4. **P0-4** - Decide: use CP3's `pending_discussion_fetches` queue or existing per-MR telemetry columns. Wire up the winner.
5. **P0-5** - Add `elapsed_ms` to `*Complete` progress event variants.
6. **Deprecate `lore ingest`** - Hidden alias pointing to `lore sync`. Remove from help output.
### Phase C: After CP3 ships, informed by real metrics
**Only pursue items that P0 data proves matter.**
7. **P2-1: Discussion optimization** - Check metrics_json from real runs. If discussion phase is <10% of wall-clock, skip.
8. **P2-2: Keyset pagination** - Check primary fetch timing on largest project. If fast, skip.
9. **P2-4: Labels enrichment** - If label colors are needed for any UI surface.
### Phase D: Future (needs a forcing function)
10. **P3-1: Users table** - When a UI needs display names / avatars.
11. **P2-3: ETags** - Only if P2-1 doesn't sufficiently reduce discussion overhead.
12. **P3-2/3/4: GraphQL, Events API, Webhooks** - Architectural shifts. Only if pull-based sync hits a scaling wall.
---
## Priority 0: Observability (prerequisite for everything else)
We can't evaluate any efficiency question without measurement. Gitlore has no runtime performance instrumentation. The infrastructure for it was scaffolded (sync_runs table, metrics_json column, discussion sync telemetry columns) but never wired up.
### P0-1: sync_runs table is never written to
**Location:** Schema in `migrations/001_initial.sql:25-34`, read in `src/cli/commands/sync_status.rs:69-72`
The table exists and `lore status` reads from it, but no code ever INSERTs or UPDATEs rows. The entire audit trail is empty.
```sql
-- Exists in schema, never populated
CREATE TABLE sync_runs (
id INTEGER PRIMARY KEY,
started_at INTEGER NOT NULL,
heartbeat_at INTEGER NOT NULL,
finished_at INTEGER,
status TEXT NOT NULL, -- 'running' | 'succeeded' | 'failed'
command TEXT NOT NULL,
error TEXT,
metrics_json TEXT -- never written
);
```
**What to do:** Instrument the ingest orchestrator to record sync runs. Each `lore ingest issues` / `lore ingest mrs` invocation should:
- INSERT a row with status='running' at start
- UPDATE with status='succeeded'/'failed' + finished_at on completion
- Populate metrics_json with the IngestProjectResult / IngestMrProjectResult counters
### P0-2: No operation timing anywhere
**Location:** Rate limiter in `src/gitlab/client.rs:20-65`, orchestrator in `src/ingestion/orchestrator.rs`
`Instant::now()` is used only for rate limiter enforcement. No operation durations are measured or logged. We don't know:
- How long a full issue ingest takes
- How long discussion sync takes per entity
- How long individual API requests take (network latency)
- How long database writes take per batch
- How long rate limiter sleeps accumulate to
- How long pagination takes across pages
**What to do:** Add timing instrumentation at these levels:
| Level | What to time | Where |
|-------|-------------|-------|
| **Run** | Total ingest wall-clock time | orchestrator entry/exit |
| **Phase** | Primary fetch vs discussion sync | orchestrator phase boundaries |
| **API call** | Individual HTTP request round-trip | client.rs request method |
| **DB write** | Transaction duration per batch | ingestion store functions |
| **Rate limiter** | Cumulative sleep time per run | client.rs acquire() |
Store phase-level and run-level timing in `metrics_json`. Log API-call-level timing at debug level.
### P0-3: metrics_json has no defined schema
**What to do:** Define what goes in there. Strawman based on existing IngestProjectResult fields plus timing:
```json
{
"wall_clock_ms": 14200,
"phases": {
"primary_fetch": {
"duration_ms": 8400,
"api_calls": 12,
"items_fetched": 1143,
"items_upserted": 87,
"pages": 12,
"rate_limit_sleep_ms": 1200
},
"discussion_sync": {
"duration_ms": 5800,
"entities_checked": 87,
"entities_synced": 14,
"entities_skipped": 73,
"api_calls": 22,
"discussions_fetched": 156,
"notes_upserted": 412,
"rate_limit_sleep_ms": 2200
}
},
"db": {
"labels_created": 3,
"raw_payloads_stored": 87,
"raw_payloads_deduped": 42
}
}
```
### P0-4: Discussion sync telemetry columns are dead code
**Location:** `merge_requests` table columns: `discussions_sync_last_attempt_at`, `discussions_sync_attempts`, `discussions_sync_last_error`
These exist in the schema but are never read or written. They were designed for tracking retry behavior on failed discussion syncs.
**What to do:** Wire these up during discussion sync. On attempt: set last_attempt_at and increment attempts. On failure: set last_error. On success: reset attempts to 0. This provides per-entity visibility into discussion sync health.
### P0-5: Progress events carry no timing
**Location:** `src/ingestion/orchestrator.rs:28-53`
ProgressEvent variants (`IssueFetched`, `DiscussionSynced`, etc.) carry only counts. Adding elapsed_ms to at least `*Complete` variants would give callers (CLI progress bars, robot mode output) real throughput numbers.
---
## Priority 1: Free data capture (zero API cost)
These fields are already in the API responses gitlore receives. Storing them requires only Rust struct additions and DB column migrations. No additional API calls.
### P1-1: user_notes_count (Issues + MRs)
**API field:** `user_notes_count` (integer)
**Value:** Could short-circuit discussion re-sync. If count hasn't changed, discussions probably haven't changed either. Also useful for "most discussed" queries.
**Effort:** Add field to serde struct, add DB column, store during transform.
### P1-2: upvotes / downvotes (Issues + MRs)
**API field:** `upvotes`, `downvotes` (integers)
**Value:** Engagement metrics for triage. "Most upvoted open issues" is a common query.
**Effort:** Same pattern as above.
### P1-3: confidential (Issues)
**API field:** `confidential` (boolean)
**Value:** Security-sensitive filtering. Important to know when exposing issue data.
**Effort:** Low.
### P1-4: has_conflicts (MRs)
**API field:** `has_conflicts` (boolean)
**Value:** Identify MRs needing rebase. Useful for "stale MR" detection.
**Effort:** Low.
### P1-5: blocking_discussions_resolved (MRs)
**API field:** `blocking_discussions_resolved` (boolean)
**Value:** MR readiness indicator without joining the discussions table.
**Effort:** Low.
### P1-6: merge_commit_sha (MRs)
**API field:** `merge_commit_sha` (string, nullable)
**Value:** Trace merged MRs to specific commits in git history.
**Effort:** Low.
### P1-7: discussion_locked (Issues + MRs)
**API field:** `discussion_locked` (boolean)
**Value:** Know if new comments can be added. Useful for robot mode consumers.
**Effort:** Low.
### P1-8: task_completion_status (Issues + MRs)
**API field:** `task_completion_status` (object: `{count, completed_count}`)
**Value:** Track task-list checkbox progress without parsing markdown.
**Effort:** Low. Store as two integer columns or a small JSON blob.
### P1-9: issue_type (Issues)
**API field:** `issue_type` (string: "issue" | "incident" | "test_case")
**Value:** Distinguish issues vs incidents vs test cases for filtering.
**Effort:** Low.
### P1-10: MR milestone (MRs)
**API field:** `milestone` (object, same structure as on issues)
**Current state:** Milestones are fully stored for issues but completely ignored for MRs.
**Value:** "Which MRs are in milestone X?" Currently impossible to query locally.
**Effort:** Medium - reuse existing milestone transformer from issue pipeline.
### P1-11: Issue references (Issues)
**API field:** `references` (object: `{short, relative, full}`)
**Current state:** Stored for MRs (`references_short`, `references_full`), dropped for issues.
**Value:** Cross-project issue references (e.g., `group/project#42`).
**Effort:** Low.
---
## Priority 2: Efficiency improvements (requires measurement from P0 first)
These are potential optimizations. **Do not implement until P0 instrumentation proves they matter.**
### P2-1: Discussion full-refresh strategy
**Current behavior:** When an issue/MR's `updated_at` advances, ALL its discussions are deleted and re-fetched from scratch.
**Potential optimization:** Use `user_notes_count` (P1-1) to detect whether discussions actually changed. Skip re-sync if count is unchanged.
**Why we need P0 first:** The full-refresh may be fast enough. Since we already fetch the data from GitLab, the DELETE+INSERT is just local SQLite I/O. If discussion sync for a typical entity takes <100ms locally, this isn't worth optimizing. We need the per-entity timing from P0-2 to know.
**Trade-offs to consider:**
- Full-refresh catches edited and deleted notes. Incremental would miss those.
- `user_notes_count` doesn't change when notes are edited, only when added/removed.
- Full-refresh is simpler to reason about for consistency.
### P2-2: Keyset pagination
**Current behavior:** Offset-based (`page=N&per_page=100`).
**Alternative:** Keyset pagination (`pagination=keyset`), O(1) per page instead of O(N).
**Why we need P0 first:** Only matters for large projects (>10K issues). Most projects will never hit enough pages for this to be measurable. P0 timing of pagination will show if this is a bottleneck.
**Note:** Gitlore already parses `Link` headers for next-page detection, which is the client-side mechanism keyset pagination uses. So partial support exists.
### P2-3: ETag / conditional requests
**Current behavior:** All requests are unconditional.
**Alternative:** Cache ETags, send `If-None-Match`, get 304s back.
**Why we need P0 first:** The cursor-based sync already avoids re-fetching unchanged data for primary resources. ETags would mainly help with discussion re-fetches where nothing changed. If P2-1 (user_notes_count skip) is implemented, ETags become less valuable.
### P2-4: Labels API enrichment
**Current behavior:** Labels extracted from the `labels[]` string array in issue/MR responses. The `labels` table has `color` and `description` columns that may not be populated.
**Alternative:** Single call to `GET /projects/:id/labels` per project per sync to populate label metadata.
**Cost:** 1 API call per project per sync run.
**Value:** Label colors for UI rendering, descriptions for tooltips.
---
## Priority 3: Structural improvements (future consideration)
### P3-1: Users table
**Current state:** Only `username` stored. Author `name`, `avatar_url`, `web_url`, `state` are in every API response but discarded.
**Proposal:** Create a `users` table, upsert on every encounter. Zero API cost.
**Value:** Richer user display, detect blocked/deactivated users.
### P3-2: GraphQL API for field-precise fetching
**Current state:** REST API returns ~40-50 fields per entity. Gitlore uses ~15-23.
**Alternative:** GraphQL API allows requesting exactly the fields needed.
**Trade-offs:** Different pagination model, potentially less stable API, more complex client code. The bandwidth savings are real but likely minor compared to discussion re-fetch overhead.
### P3-3: Events API for lightweight change detection
**Endpoint:** `GET /projects/:id/events`
**Value:** Lightweight "has anything changed?" check before running full issue/MR sync. Could replace or supplement the cursor-based approach for very active projects.
### P3-4: Webhook-based push sync
**Endpoint:** `POST /projects/:id/hooks` (setup), then receive pushes.
**Value:** Near-real-time sync without polling cost. Eliminates all rate-limit concerns.
**Barrier:** Requires a listener endpoint, which changes the architecture from pull-only CLI to something with a daemon/server component.
---
## Working notes
_Space for recording decisions as we work through items._
### Decisions made
| Item | Decision | Rationale |
|------|----------|-----------|
| `lore ingest` | Remove. `lore sync` is the single entry point. | No reason to separate initial load from incremental updates. First run = full fetch, subsequent = cursor-based delta. |
| CP3 alignment | Build P0 instrumentation into CP3 Gate C, not separately. | Avoids building in two places. `lore sync` owns the full lifecycle. |
| P2 timing | Defer all efficiency optimizations until P0 metrics from real runs are available. | Can't evaluate trade-offs without measurement. |
### Open questions
- What's the typical project size (issue/MR count) for gitlore users? This determines whether keyset pagination (P2-2) matters.
- Is there a plan for a web UI or TUI? That would increase the value of P3-1 (users table) and P2-4 (label colors).

456
docs/phase-a-spec.md Normal file
View File

@@ -0,0 +1,456 @@
# Phase A: Complete API Field Capture
> **Status:** Draft
> **Guiding principle:** Mirror everything GitLab gives us.
> - **Lossless mirror:** the raw API JSON stored behind `raw_payload_id`. This is the true complete representation of every API response.
> - **Relational projection:** a stable, query-optimized subset of fields we commit to keeping current on every re-sync.
> This preserves maximum context for processing and analysis while avoiding unbounded schema growth.
> **Migration:** 007_complete_field_capture.sql
> **Prerequisite:** None (independent of CP3)
---
## Scope
One migration. Three categories of work:
1. **New columns** on `issues` and `merge_requests` for fields currently dropped by serde or dropped during transform
2. **New serde fields** on `GitLabIssue` and `GitLabMergeRequest` to deserialize currently-silently-dropped JSON fields
3. **Transformer + insert updates** to pass the new fields through to the DB
No new tables. No new API calls. No new endpoints. All data comes from responses we already receive.
---
## Issues: Field Gap Inventory
### Currently stored
id, iid, project_id, title, description, state, author_username, created_at, updated_at, web_url, due_date, milestone_id, milestone_title, raw_payload_id, last_seen_at, discussions_synced_for_updated_at, labels (junction), assignees (junction)
### Currently deserialized but dropped during transform
| API Field | Status | Action |
|-----------|--------|--------|
| `closed_at` | Deserialized in serde struct, but no DB column exists and transformer never populates it | Add column in migration 007, wire up in IssueRow + transform + INSERT |
| `author.id` | Deserialized | Store as `author_id` column |
| `author.name` | Deserialized | Store as `author_name` column |
### Currently silently dropped by serde (not in GitLabIssue struct)
| API Field | Type | DB Column | Notes |
|-----------|------|-----------|-------|
| `issue_type` | Option\<String\> | `issue_type` | Canonical field (lowercase, e.g. "issue"); preferred for DB storage |
| `upvotes` | i64 | `upvotes` | |
| `downvotes` | i64 | `downvotes` | |
| `user_notes_count` | i64 | `user_notes_count` | Useful for discussion sync optimization |
| `merge_requests_count` | i64 | `merge_requests_count` | Count of linked MRs |
| `confidential` | bool | `confidential` | 0/1 |
| `discussion_locked` | bool | `discussion_locked` | 0/1 |
| `weight` | Option\<i64\> | `weight` | Premium/Ultimate, null on Free |
| `time_stats.time_estimate` | i64 | `time_estimate` | Seconds |
| `time_stats.total_time_spent` | i64 | `time_spent` | Seconds |
| `time_stats.human_time_estimate` | Option\<String\> | `human_time_estimate` | e.g. "3h 30m" |
| `time_stats.human_total_time_spent` | Option\<String\> | `human_time_spent` | e.g. "1h 15m" |
| `task_completion_status.count` | i64 | `task_count` | Checkbox total |
| `task_completion_status.completed_count` | i64 | `task_completed_count` | Checkboxes checked |
| `has_tasks` | bool | `has_tasks` | 0/1 |
| `severity` | Option\<String\> | `severity` | Incident severity |
| `closed_by` | Option\<object\> | `closed_by_username` | Who closed it (username only, consistent with author pattern) |
| `imported` | bool | `imported` | 0/1 |
| `imported_from` | Option\<String\> | `imported_from` | Import source |
| `moved_to_id` | Option\<i64\> | `moved_to_id` | Target issue if moved |
| `references.short` | String | `references_short` | e.g. "#42" |
| `references.relative` | String | `references_relative` | e.g. "#42" or "group/proj#42" |
| `references.full` | String | `references_full` | e.g. "group/project#42" |
| `health_status` | Option\<String\> | `health_status` | Ultimate only |
| `type` | Option\<String\> | (transform-only) | Uppercase category (e.g. "ISSUE"); fallback for `issue_type` -- lowercased before storage. Not stored as separate column; raw JSON remains lossless. |
| `epic.id` | Option\<i64\> | `epic_id` | Premium/Ultimate, null on Free |
| `epic.iid` | Option\<i64\> | `epic_iid` | |
| `epic.title` | Option\<String\> | `epic_title` | |
| `epic.url` | Option\<String\> | `epic_url` | |
| `epic.group_id` | Option\<i64\> | `epic_group_id` | |
| `iteration.id` | Option\<i64\> | `iteration_id` | Premium/Ultimate, null on Free |
| `iteration.iid` | Option\<i64\> | `iteration_iid` | |
| `iteration.title` | Option\<String\> | `iteration_title` | |
| `iteration.state` | Option\<i64\> | `iteration_state` | Enum: 1=upcoming, 2=current, 3=closed |
| `iteration.start_date` | Option\<String\> | `iteration_start_date` | ISO date |
| `iteration.due_date` | Option\<String\> | `iteration_due_date` | ISO date |
---
## Merge Requests: Field Gap Inventory
### Currently stored
id, iid, project_id, title, description, state, draft, author_username, source_branch, target_branch, head_sha, references_short, references_full, detailed_merge_status, merge_user_username, created_at, updated_at, merged_at, closed_at, last_seen_at, web_url, raw_payload_id, discussions_synced_for_updated_at, discussions_sync_last_attempt_at, discussions_sync_attempts, discussions_sync_last_error, labels (junction), assignees (junction), reviewers (junction)
### Currently deserialized but dropped during transform
| API Field | Status | Action |
|-----------|--------|--------|
| `author.id` | Deserialized | Store as `author_id` column |
| `author.name` | Deserialized | Store as `author_name` column |
| `work_in_progress` | Used transiently for `draft` fallback | Already handled, no change needed |
| `merge_status` (legacy) | Used transiently for `detailed_merge_status` fallback | Already handled, no change needed |
| `merged_by` | Used transiently for `merge_user` fallback | Already handled, no change needed |
### Currently silently dropped by serde (not in GitLabMergeRequest struct)
| API Field | Type | DB Column | Notes |
|-----------|------|-----------|-------|
| `upvotes` | i64 | `upvotes` | |
| `downvotes` | i64 | `downvotes` | |
| `user_notes_count` | i64 | `user_notes_count` | |
| `source_project_id` | i64 | `source_project_id` | Fork source |
| `target_project_id` | i64 | `target_project_id` | Fork target |
| `milestone` | Option\<object\> | `milestone_id`, `milestone_title` | Reuse issue milestone pattern |
| `merge_when_pipeline_succeeds` | bool | `merge_when_pipeline_succeeds` | 0/1, auto-merge flag |
| `merge_commit_sha` | Option\<String\> | `merge_commit_sha` | Commit ref after merge |
| `squash_commit_sha` | Option\<String\> | `squash_commit_sha` | Commit ref after squash |
| `discussion_locked` | bool | `discussion_locked` | 0/1 |
| `should_remove_source_branch` | Option\<bool\> | `should_remove_source_branch` | 0/1 |
| `force_remove_source_branch` | Option\<bool\> | `force_remove_source_branch` | 0/1 |
| `squash` | bool | `squash` | 0/1 |
| `squash_on_merge` | bool | `squash_on_merge` | 0/1 |
| `has_conflicts` | bool | `has_conflicts` | 0/1 |
| `blocking_discussions_resolved` | bool | `blocking_discussions_resolved` | 0/1 |
| `time_stats.time_estimate` | i64 | `time_estimate` | Seconds |
| `time_stats.total_time_spent` | i64 | `time_spent` | Seconds |
| `time_stats.human_time_estimate` | Option\<String\> | `human_time_estimate` | |
| `time_stats.human_total_time_spent` | Option\<String\> | `human_time_spent` | |
| `task_completion_status.count` | i64 | `task_count` | |
| `task_completion_status.completed_count` | i64 | `task_completed_count` | |
| `closed_by` | Option\<object\> | `closed_by_username` | |
| `prepared_at` | Option\<String\> | `prepared_at` | ISO datetime in API; store as ms epoch via `iso_to_ms()`, nullable |
| `merge_after` | Option\<String\> | `merge_after` | ISO datetime in API; store as ms epoch via `iso_to_ms()`, nullable (scheduled merge) |
| `imported` | bool | `imported` | 0/1 |
| `imported_from` | Option\<String\> | `imported_from` | |
| `approvals_before_merge` | Option\<i64\> | `approvals_before_merge` | Deprecated, scheduled for removal in GitLab API v5; store best-effort, keep nullable |
| `references.relative` | String | `references_relative` | Currently only short + full stored |
| `confidential` | bool | `confidential` | 0/1 (MRs can be confidential too) |
| `iteration.id` | Option\<i64\> | `iteration_id` | Premium/Ultimate, null on Free |
| `iteration.iid` | Option\<i64\> | `iteration_iid` | |
| `iteration.title` | Option\<String\> | `iteration_title` | |
| `iteration.state` | Option\<i64\> | `iteration_state` | |
| `iteration.start_date` | Option\<String\> | `iteration_start_date` | ISO date |
| `iteration.due_date` | Option\<String\> | `iteration_due_date` | ISO date |
---
## Migration 007: complete_field_capture.sql
```sql
-- Migration 007: Capture all remaining GitLab API response fields.
-- Principle: mirror everything GitLab returns. No field left behind.
-- ============================================================
-- ISSUES: new columns
-- ============================================================
-- Fields currently deserialized but not stored
ALTER TABLE issues ADD COLUMN closed_at INTEGER; -- ms epoch, deserialized but never stored until now
ALTER TABLE issues ADD COLUMN author_id INTEGER; -- GitLab user ID
ALTER TABLE issues ADD COLUMN author_name TEXT; -- Display name
-- Issue metadata
ALTER TABLE issues ADD COLUMN issue_type TEXT; -- 'issue' | 'incident' | 'test_case'
ALTER TABLE issues ADD COLUMN confidential INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN discussion_locked INTEGER NOT NULL DEFAULT 0;
-- Engagement
ALTER TABLE issues ADD COLUMN upvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN downvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN user_notes_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN merge_requests_count INTEGER NOT NULL DEFAULT 0;
-- Time tracking
ALTER TABLE issues ADD COLUMN time_estimate INTEGER NOT NULL DEFAULT 0; -- seconds
ALTER TABLE issues ADD COLUMN time_spent INTEGER NOT NULL DEFAULT 0; -- seconds
ALTER TABLE issues ADD COLUMN human_time_estimate TEXT;
ALTER TABLE issues ADD COLUMN human_time_spent TEXT;
-- Task lists
ALTER TABLE issues ADD COLUMN task_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN task_completed_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN has_tasks INTEGER NOT NULL DEFAULT 0;
-- References (MRs already have short + full)
ALTER TABLE issues ADD COLUMN references_short TEXT; -- e.g. "#42"
ALTER TABLE issues ADD COLUMN references_relative TEXT; -- context-dependent
ALTER TABLE issues ADD COLUMN references_full TEXT; -- e.g. "group/project#42"
-- Close/move tracking
ALTER TABLE issues ADD COLUMN closed_by_username TEXT;
-- Premium/Ultimate fields (nullable, null on Free tier)
ALTER TABLE issues ADD COLUMN weight INTEGER;
ALTER TABLE issues ADD COLUMN severity TEXT;
ALTER TABLE issues ADD COLUMN health_status TEXT;
-- Import tracking
ALTER TABLE issues ADD COLUMN imported INTEGER NOT NULL DEFAULT 0;
ALTER TABLE issues ADD COLUMN imported_from TEXT;
ALTER TABLE issues ADD COLUMN moved_to_id INTEGER;
-- Epic (Premium/Ultimate, null on Free)
ALTER TABLE issues ADD COLUMN epic_id INTEGER;
ALTER TABLE issues ADD COLUMN epic_iid INTEGER;
ALTER TABLE issues ADD COLUMN epic_title TEXT;
ALTER TABLE issues ADD COLUMN epic_url TEXT;
ALTER TABLE issues ADD COLUMN epic_group_id INTEGER;
-- Iteration (Premium/Ultimate, null on Free)
ALTER TABLE issues ADD COLUMN iteration_id INTEGER;
ALTER TABLE issues ADD COLUMN iteration_iid INTEGER;
ALTER TABLE issues ADD COLUMN iteration_title TEXT;
ALTER TABLE issues ADD COLUMN iteration_state INTEGER;
ALTER TABLE issues ADD COLUMN iteration_start_date TEXT;
ALTER TABLE issues ADD COLUMN iteration_due_date TEXT;
-- ============================================================
-- MERGE REQUESTS: new columns
-- ============================================================
-- Author enrichment
ALTER TABLE merge_requests ADD COLUMN author_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN author_name TEXT;
-- Engagement
ALTER TABLE merge_requests ADD COLUMN upvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN downvotes INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN user_notes_count INTEGER NOT NULL DEFAULT 0;
-- Fork tracking
ALTER TABLE merge_requests ADD COLUMN source_project_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN target_project_id INTEGER;
-- Milestone (parity with issues)
ALTER TABLE merge_requests ADD COLUMN milestone_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN milestone_title TEXT;
-- Merge behavior
ALTER TABLE merge_requests ADD COLUMN merge_when_pipeline_succeeds INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN merge_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash_commit_sha TEXT;
ALTER TABLE merge_requests ADD COLUMN squash INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN squash_on_merge INTEGER NOT NULL DEFAULT 0;
-- Merge readiness
ALTER TABLE merge_requests ADD COLUMN has_conflicts INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN blocking_discussions_resolved INTEGER NOT NULL DEFAULT 0;
-- Branch cleanup
ALTER TABLE merge_requests ADD COLUMN should_remove_source_branch INTEGER;
ALTER TABLE merge_requests ADD COLUMN force_remove_source_branch INTEGER;
-- Discussion lock
ALTER TABLE merge_requests ADD COLUMN discussion_locked INTEGER NOT NULL DEFAULT 0;
-- Time tracking
ALTER TABLE merge_requests ADD COLUMN time_estimate INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN time_spent INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN human_time_estimate TEXT;
ALTER TABLE merge_requests ADD COLUMN human_time_spent TEXT;
-- Task lists
ALTER TABLE merge_requests ADD COLUMN task_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN task_completed_count INTEGER NOT NULL DEFAULT 0;
-- Close tracking
ALTER TABLE merge_requests ADD COLUMN closed_by_username TEXT;
-- Scheduling (API returns ISO datetimes; we store ms epoch for consistency)
ALTER TABLE merge_requests ADD COLUMN prepared_at INTEGER; -- ms epoch after iso_to_ms()
ALTER TABLE merge_requests ADD COLUMN merge_after INTEGER; -- ms epoch after iso_to_ms()
-- References (add relative, short + full already exist)
ALTER TABLE merge_requests ADD COLUMN references_relative TEXT;
-- Import tracking
ALTER TABLE merge_requests ADD COLUMN imported INTEGER NOT NULL DEFAULT 0;
ALTER TABLE merge_requests ADD COLUMN imported_from TEXT;
-- Premium/Ultimate
ALTER TABLE merge_requests ADD COLUMN approvals_before_merge INTEGER;
ALTER TABLE merge_requests ADD COLUMN confidential INTEGER NOT NULL DEFAULT 0;
-- Iteration (Premium/Ultimate, null on Free)
ALTER TABLE merge_requests ADD COLUMN iteration_id INTEGER;
ALTER TABLE merge_requests ADD COLUMN iteration_iid INTEGER;
ALTER TABLE merge_requests ADD COLUMN iteration_title TEXT;
ALTER TABLE merge_requests ADD COLUMN iteration_state INTEGER;
ALTER TABLE merge_requests ADD COLUMN iteration_start_date TEXT;
ALTER TABLE merge_requests ADD COLUMN iteration_due_date TEXT;
-- Record migration version
INSERT INTO schema_version (version, applied_at, description)
VALUES (7, strftime('%s', 'now') * 1000, 'Complete API field capture for issues and merge requests');
```
---
## Serde Struct Changes
### Existing type changes
```
GitLabReferences // Add: relative: Option<String> (with #[serde(default)])
// Existing fields short + full remain unchanged
GitLabIssue // Add #[derive(Default)] for test ergonomics
GitLabMergeRequest // Add #[derive(Default)] for test ergonomics
```
### New helper types needed
```
GitLabTimeStats { time_estimate, total_time_spent, human_time_estimate, human_total_time_spent }
GitLabTaskCompletionStatus { count, completed_count }
GitLabClosedBy (reuse GitLabAuthor shape: id, username, name)
GitLabEpic { id, iid, title, url, group_id }
GitLabIteration { id, iid, title, state, start_date, due_date }
```
### GitLabIssue: add fields
```
type: Option<String> // #[serde(rename = "type")] -- fallback-only (uppercase category); "type" is reserved in Rust
upvotes: i64 // #[serde(default)]
downvotes: i64 // #[serde(default)]
user_notes_count: i64 // #[serde(default)]
merge_requests_count: i64 // #[serde(default)]
confidential: bool // #[serde(default)]
discussion_locked: bool // #[serde(default)]
weight: Option<i64>
time_stats: Option<GitLabTimeStats>
task_completion_status: Option<GitLabTaskCompletionStatus>
has_tasks: bool // #[serde(default)]
references: Option<GitLabReferences>
closed_by: Option<GitLabAuthor>
severity: Option<String>
health_status: Option<String>
imported: bool // #[serde(default)]
imported_from: Option<String>
moved_to_id: Option<i64>
issue_type: Option<String> // canonical field (lowercase); preferred for DB storage over `type`
epic: Option<GitLabEpic>
iteration: Option<GitLabIteration>
```
### GitLabMergeRequest: add fields
```
upvotes: i64 // #[serde(default)]
downvotes: i64 // #[serde(default)]
user_notes_count: i64 // #[serde(default)]
source_project_id: Option<i64>
target_project_id: Option<i64>
milestone: Option<GitLabMilestone> // reuse existing type
merge_when_pipeline_succeeds: bool // #[serde(default)]
merge_commit_sha: Option<String>
squash_commit_sha: Option<String>
squash: bool // #[serde(default)]
squash_on_merge: bool // #[serde(default)]
has_conflicts: bool // #[serde(default)]
blocking_discussions_resolved: bool // #[serde(default)]
should_remove_source_branch: Option<bool>
force_remove_source_branch: Option<bool>
discussion_locked: bool // #[serde(default)]
time_stats: Option<GitLabTimeStats>
task_completion_status: Option<GitLabTaskCompletionStatus>
closed_by: Option<GitLabAuthor>
prepared_at: Option<String>
merge_after: Option<String>
imported: bool // #[serde(default)]
imported_from: Option<String>
approvals_before_merge: Option<i64>
confidential: bool // #[serde(default)]
iteration: Option<GitLabIteration>
```
---
## Transformer Changes
### IssueRow: add fields
All new fields map 1:1 from the serde struct except:
- `closed_at` -> `iso_to_ms()` conversion (already in serde struct, just not passed through)
- `time_stats` -> flatten to 4 individual fields
- `task_completion_status` -> flatten to 2 individual fields
- `references` -> flatten to 3 individual fields
- `closed_by` -> extract `username` only (consistent with author pattern)
- `author` -> additionally extract `id` and `name` (currently only `username`)
- `issue_type` -> store as-is (canonical, lowercase); fallback to lowercased `type` field if `issue_type` absent
- `epic` -> flatten to 5 individual fields (id, iid, title, url, group_id)
- `iteration` -> flatten to 6 individual fields (id, iid, title, state, start_date, due_date)
### NormalizedMergeRequest: add fields
Same patterns as issues, plus:
- `milestone` -> reuse `upsert_milestone_tx` from issue pipeline, add `milestone_id` + `milestone_title`
- `prepared_at`, `merge_after` -> `iso_to_ms()` conversion (API provides ISO datetimes)
- `source_project_id`, `target_project_id` -> direct pass-through
- `iteration` -> flatten to 6 individual fields (same as issues)
### Insert statement changes
Both `process_issue_in_transaction` and `process_mr_in_transaction` need their INSERT and ON CONFLICT DO UPDATE statements extended with all new columns. The ON CONFLICT clause should update all new fields on re-sync.
**Implementation note (reliability):** Define a single authoritative list of persisted columns per entity and generate/compose both SQL fragments from it:
- INSERT column list + VALUES placeholders
- ON CONFLICT DO UPDATE assignments
This prevents drift where a new field is added to one clause but not the other -- the most likely bug class with 40+ new columns.
---
## Prerequisite refactors (prep commits before main Phase A work)
### 1. Align issue transformer on `core::time`
The issue transformer (`transformers/issue.rs`) has a local `parse_timestamp()` that duplicates `iso_to_ms_strict()` from `core::time`. The MR transformer already uses the shared module. Before adding Phase A's optional timestamp fields (especially `closed_at` as `Option<String>`), migrate the issue transformer to use `iso_to_ms_strict()` and `iso_to_ms_opt_strict()` from `core::time`. This avoids duplicating the `opt` variant locally and establishes one timestamp parsing path across the codebase.
**Changes:** Replace `parse_timestamp()` calls with `iso_to_ms_strict()`, adapt or remove `TransformError::TimestampParse` (MR transformer uses `String` errors; align on that or on a shared error type).
### 2. Extract shared ingestion helpers
`upsert_milestone_tx` (in `ingestion/issues.rs`) and `upsert_label_tx` (duplicated in both `ingestion/issues.rs` and `ingestion/merge_requests.rs`) should be moved to a shared module (e.g., `src/ingestion/shared.rs`). MR ingestion needs `upsert_milestone_tx` for Phase A milestone support, and the label helper is already copy-pasted between files.
**Changes:** Create `src/ingestion/shared.rs`, move `upsert_milestone_tx`, `upsert_label_tx`, and `MilestoneRow` there. Update imports in both issue and MR ingestion modules.
---
## Files touched
| File | Change |
|------|--------|
| `migrations/007_complete_field_capture.sql` | New file |
| `src/gitlab/types.rs` | Add `#[derive(Default)]` to `GitLabIssue` and `GitLabMergeRequest`; add `relative: Option<String>` to `GitLabReferences`; add fields to both structs; add `GitLabTimeStats`, `GitLabTaskCompletionStatus`, `GitLabEpic`, `GitLabIteration` |
| `src/gitlab/transformers/issue.rs` | Remove local `parse_timestamp()`, switch to `core::time`; extend IssueRow, IssueWithMetadata, transform_issue() |
| `src/gitlab/transformers/merge_request.rs` | Extend NormalizedMergeRequest, MergeRequestWithMetadata, transform_merge_request(); extract `references_relative` |
| `src/ingestion/shared.rs` | New file: shared `upsert_milestone_tx`, `upsert_label_tx`, `MilestoneRow` |
| `src/ingestion/issues.rs` | Extend INSERT/UPSERT SQL; import from shared module |
| `src/ingestion/merge_requests.rs` | Extend INSERT/UPSERT SQL; import from shared module; add milestone upsert |
| `src/core/db.rs` | Register migration 007 in `MIGRATIONS` array |
---
## What this does NOT include
- No new API endpoints called
- No new tables (except reusing existing `milestones` for MRs)
- No CLI changes (new fields are stored but not yet surfaced in `lore issues` / `lore mrs` output)
- No changes to discussion/note ingestion (Phase A is issues + MRs only)
- No observability instrumentation (that's Phase B)
---
## Rollout / Backfill Note
After applying Migration 007 and shipping transformer + UPSERT updates, **existing rows will not have the new columns populated** until issues/MRs are reprocessed. Plan on a **one-time full re-sync** (`lore ingest --type issues --full` and `lore ingest --type mrs --full`) to backfill the new fields. Until then, queries on new columns will return NULL/default values for previously-synced entities.
---
## Resolved decisions
| Field | Decision | Rationale |
|-------|----------|-----------|
| `subscribed` | **Excluded** | User-relative field (reflects token holder's subscription state, not an entity property). Changes meaning if the token is rotated to a different user. Not entity data. |
| `_links` | **Excluded** | HATEOAS API navigation metadata, not entity data. Every URL is deterministically constructable from `project_id` + `iid` + GitLab base URL. Note: `closed_as_duplicate_of` inside `_links` contains a real entity reference -- extracting that is deferred to a future phase. |
| `epic` / `iteration` | **Flatten to columns** | Same denormalization pattern as milestones. Epic gets 5 columns (`epic_id`, `epic_iid`, `epic_title`, `epic_url`, `epic_group_id`). Iteration gets 6 columns (`iteration_id`, `iteration_iid`, `iteration_title`, `iteration_state`, `iteration_start_date`, `iteration_due_date`). Both nullable (null on Free tier). |
| `approvals_before_merge` | **Store best-effort** | Deprecated and scheduled for removal in GitLab API v5. Keep as `Option<i64>` / nullable column. Never depend on it for correctness -- it may disappear in a future GitLab release. |

File diff suppressed because it is too large Load Diff