diff --git a/docs/prd-observability.md b/docs/prd-observability.md new file mode 100644 index 0000000..dc4813b --- /dev/null +++ b/docs/prd-observability.md @@ -0,0 +1,866 @@ +# PRD: Observability Infrastructure for lore CLI + +**Status:** Draft +**Author:** Taylor + Claude +**Date:** 2026-02-04 + +--- + +## 1. Problem Statement + +lore currently has minimal observability. Logging is ephemeral (stderr only), there are no persistent log files, no performance metrics, no structured JSON log output, no verbosity controls beyond `RUST_LOG`, and no way to diagnose issues after the fact. When a sync fails at 3 AM in a cron job, or an embedding run takes 10x longer than usual, there is zero forensic data available. + +### Current State + +| Capability | Status | +|---|---| +| Log destination | stderr only, ephemeral | +| Log persistence | None | +| Structured output | Human-readable fmt only | +| Verbosity control | `RUST_LOG` env var (no CLI flag) | +| Performance metrics | Ad-hoc `Instant::now()` in 2 commands | +| Timing in robot JSON | `elapsed_ms` in search and sync `meta` only | +| Spans / correlation | None | +| Log rotation | None | +| Per-stage timing | None | +| Rate limit / retry visibility | `tracing::warn!` only | +| Error aggregation | None | +| Historical comparison | None | + +### What's Already in Place (to build on) + +- `tracing` (0.1) + `tracing-subscriber` (0.3) with `env-filter` feature +- Registry-based subscriber initialized in `src/main.rs:44-58` with a single `fmt::layer()` using `SuspendingWriter` +- `SuspendingWriter` (`src/cli/progress.rs:25-73`) that coordinates log output with indicatif `MultiProgress` — buffers each log line, calls `MULTI.suspend()` on drop to clear progress bars before writing to stderr +- `IngestDisplay` struct (`src/cli/commands/ingest.rs:65-104`) controlling UI verbosity with three modes: `interactive()` / `silent()` / `progress_only()` +- Robot mode JSON envelope: `{ "ok": true, "data": {...}, "meta": {...} }` — used consistently in sync, search, sync-status, and doctor commands +- XDG-compliant data directory at `~/.local/share/lore/` +- `sync_runs` table (migration 001) with schema: `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json` — **exists but is never written to** (no INSERT anywhere in the codebase; `sync_status.rs` reads from it but always gets zero rows) +- `uuid` crate (v1, v4 feature) already a dependency +- Structured fields used in tracing calls (e.g., `info!(owner = %self.owner, ...)`) +- `EnvFilter` currently hardcoded: `lore=info` + `warn` default directives +- Global CLI flags in `src/cli/mod.rs:9-43`: `--config`, `--robot`, `-J`, `--color`, `--quiet` (all `global = true`) + +--- + +## 2. Goals + +### Primary + +1. **Post-mortem debugging**: Any failed or slow run can be diagnosed after the fact from persistent, structured log files. +2. **Performance visibility**: Every sync/ingest/embed/search operation reports granular stage-level timing, both to the terminal and to persistent storage. +3. **Ergonomic verbosity**: Users and agents control log verbosity through CLI flags (`-v`, `-vv`, `-vvv`) without needing to know `RUST_LOG` syntax. +4. **Machine-parseable logs**: A JSON log mode for piping into log aggregators (jq, Datadog, Loki, etc.). +5. **Agent-friendly metrics**: Robot mode JSON output includes comprehensive timing breakdowns for every command, enabling automated monitoring. + +### Secondary + +6. **Log rotation and retention**: Log files don't grow unbounded; old logs are automatically cleaned up. +7. **Correlation IDs**: Every sync run gets a unique ID that connects log lines, database records, and robot output. +8. **Rate limit and retry transparency**: Every rate-limited request and retry is visible in logs with full context. +9. **Sync history with metrics**: The `sync_runs` table is enriched with per-stage timing, item counts, and error counts so `lore sync-status` becomes a real dashboard. + +### Non-Goals + +- External telemetry export (OpenTelemetry, Prometheus) -- out of scope for v1. +- Real-time log streaming / tailing UI. +- Alerting or notification systems. +- Distributed tracing across multiple lore instances. + +--- + +## 3. Research Foundation + +### 3.1 The Three Pillars of Observability + +Academic and industry consensus (Gholamian & Ward 2021, "A Comprehensive Survey of Logging in Software") identifies three pillars: + +1. **Logs** -- Discrete events with context. The foundation. +2. **Metrics** -- Numerical measurements over time (counters, gauges, histograms). +3. **Traces** -- Causally ordered spans representing operations. + +For a CLI tool (not a long-running service), the mapping is: + +| Pillar | CLI Equivalent | +|---|---| +| Logs | Structured log files per invocation | +| Metrics | Per-stage timing, item counts, error counts stored in DB | +| Traces | Span hierarchy within a single invocation (sync -> ingest issues -> fetch page N -> sync discussions) | + +### 3.2 Structured Logging Best Practices + +From Duan et al. 2025 ("PDLogger: Automated Logging Framework for Practical Software Development") and industry practice: + +- **Always structured**: JSON or key=value, never free-form prose in production logs. +- **Contextual fields propagate**: A sync_run_id set at the top level appears in every downstream log line. +- **Levels have semantic meaning**: + - `ERROR`: Operation failed, requires attention. + - `WARN`: Degraded behavior (rate limited, retry, skip). + - `INFO`: Significant state transitions (stage start/complete, items processed). + - `DEBUG`: Detailed operational data (page boundaries, individual API calls). + - `TRACE`: Wire-level detail (request/response bodies, SQL queries). + +### 3.3 CLI Verbosity Conventions + +From the GNU Coding Standards, POSIX conventions, and modern Rust CLI tools (ripgrep, fd, cargo): + +| Pattern | Meaning | Precedent | +|---|---|---| +| (default) | INFO for app, WARN for deps | cargo, rustc | +| `-q` / `--quiet` | Suppress non-error output | ripgrep, fd, cargo | +| `-v` | DEBUG for app | ripgrep, fd | +| `-vv` | DEBUG for app + deps | cargo | +| `-vvv` | TRACE for everything | cargo, curl | +| `RUST_LOG=...` | Fine-grained override | Universal in Rust | + +The `-v` flag should feel familiar to anyone who has used cargo, curl, or ssh. + +### 3.4 Log File Rotation + +`tracing-appender` (from the tokio-rs/tracing ecosystem) provides: + +- **Daily rotation**: New file per day, named `lore.2026-02-04.log`. +- **Non-blocking writes**: Dedicated writer thread, zero impact on main async runtime. +- **Configurable retention**: Delete files older than N days. + +This is the canonical solution in the Rust tracing ecosystem and requires no custom code. + +### 3.5 Performance Metrics for CLI Tools + +Inspired by hyperfine's approach to benchmarking and cargo's `--timings` flag: + +- Report wall-clock time per stage. +- Report item throughput (items/sec). +- Store historical runs for trend comparison. +- Present timing data in both human-readable and machine-readable formats. + +--- + +## 4. Design + +### 4.1 Architecture Overview + +``` + CLI Invocation + | + +----------+----------+ + | | + Interactive Mode Robot Mode + | | + +---stderr (human fmt) stdout (JSON envelope) + | | | + | progress bars { ok, data, meta: { + | colored output elapsed_ms, + | stages: [...], + | run_id + | }} + | + Log Subscribers (layered) + | + +----+----+--------+ + | | | + stderr file (future: + (fmt) (JSON) OTLP) +``` + +### 4.2 Subscriber Stack + +Replace the current single-layer subscriber with a layered registry. Each layer has its own filter: + +``` +registry() + .with(stderr_layer.with_filter(stderr_filter)) // Human-readable, SuspendingWriter, -v controlled + .with(file_layer.with_filter(file_filter)) // JSON, daily rotation, always DEBUG+ +``` + +**stderr layer**: Same `fmt::layer()` as today with `SuspendingWriter`, but level controlled by `-v` flags. When `--log-format json` is passed, this layer switches to `fmt::layer().json()` (same JSON format as file layer, but still routed through `SuspendingWriter` for progress bar coordination). + +**file layer**: Always-on JSON output to `~/.local/share/lore/logs/`, daily rotation via `tracing-appender`. Uses its own `EnvFilter` set to `lore=debug,warn` regardless of `-v` flags, ensuring post-mortem data is always available. The file layer does NOT use `SuspendingWriter` — it writes to a file, not stderr, so progress bar coordination is unnecessary. + +**Filter architecture**: Per-layer filtering (not a single shared `EnvFilter`) is required because the file layer must always be at DEBUG+ while stderr follows `-v`. `tracing-subscriber`'s `Layer::with_filter()` method enables this. + +**`RUST_LOG` override**: When `RUST_LOG` is set, it overrides BOTH layer filters. This is the expert escape hatch. + +**Current subscriber** (`src/main.rs:44-58`): +```rust +tracing_subscriber::registry() + .with( + tracing_subscriber::fmt::layer() + .with_target(false) + .with_writer(lore::cli::progress::SuspendingWriter), + ) + .with( + EnvFilter::from_default_env() + .add_directive("lore=info".parse().unwrap()) + .add_directive("warn".parse().unwrap()), + ) + .init(); +``` + +This will be replaced by the dual-layer setup. The `SuspendingWriter` integration and `with_target(false)` on the stderr layer remain unchanged. + +### 4.3 Verbosity Levels + +#### stderr layer (controlled by `-v` flags) + +| Flags | App Level | Dep Level | Behavior | +|---|---|---|---| +| (none) | INFO | WARN | Default. Stage transitions, summaries. | +| `-q` | WARN | ERROR | Errors and warnings only. | +| `-v` | DEBUG | WARN | Detailed app behavior. API pages, skip reasons. | +| `-vv` | DEBUG | INFO | App + dependency detail. HTTP client, SQLite. | +| `-vvv` | TRACE | DEBUG | Everything. Wire-level detail. | +| `RUST_LOG=...` | (overrides all) | (overrides all) | Expert escape hatch. | + +Precedence: `RUST_LOG` > `-v` flags > defaults. This matches cargo's behavior. + +#### file layer (independent of `-v` flags) + +| Condition | App Level | Dep Level | +|---|---|---| +| Always (default) | DEBUG | WARN | +| `RUST_LOG=...` set | (overrides) | (overrides) | + +The file layer always captures DEBUG+ for the `lore` crate and WARN+ for dependencies. This ensures post-mortem data is available even when the user ran with default stderr verbosity. `RUST_LOG` overrides both layers when set. + +#### New CLI flags + +Add to the `Cli` struct (`src/cli/mod.rs`): + +```rust +/// Increase log verbosity (-v, -vv, -vvv) +#[arg(short = 'v', long = "verbose", action = clap::ArgAction::Count, global = true)] +pub verbose: u8, + +/// Log format for stderr output: text (default) or json +#[arg(long = "log-format", global = true, value_parser = ["text", "json"], default_value = "text")] +pub log_format: String, +``` + +The `-v` flag uses `clap::ArgAction::Count` to support `-v`, `-vv`, `-vvv` as a single flag with increasing count. The `--log-format` flag controls whether stderr emits human-readable or JSON-formatted log lines. + +### 4.4 Structured Log File Output + +**Location**: `~/.local/share/lore/logs/lore.YYYY-MM-DD.log` + +**Format**: One JSON object per line (JSONL), produced by `tracing-subscriber`'s `fmt::layer().json()`: + +```json +{"timestamp":"2026-02-04T14:32:01.123Z","level":"INFO","target":"lore::ingestion","fields":{"message":"Discussion sync complete","project":"group/repo","issues_synced":42,"elapsed_ms":1234},"span":{"name":"ingest_issues","run_id":"a1b2c3"}} +``` + +**Rotation**: Daily via `tracing-appender::rolling::daily()`. + +**Retention**: Configurable, default 30 days. A `logs.retention_days` config field. Cleanup runs at startup (check directory, delete files older than N days). + +### 4.5 Tracing Spans + +Introduce spans for causal correlation within a single invocation: + +``` +sync (run_id=uuid) + +-- ingest_issues + | +-- fetch_pages (project="group/repo") + | +-- sync_discussions (project="group/repo") + | +-- fetch_resource_events (project="group/repo") + +-- ingest_mrs + | +-- fetch_pages (project="group/repo") + | +-- sync_discussions (project="group/repo") + +-- generate_docs + +-- embed +``` + +Each span records `elapsed_ms` on close. The `run_id` propagates to all child spans and log events, enabling `jq '.span.run_id == "a1b2c3"' lore.2026-02-04.log` to extract an entire run. + +### 4.6 Performance Metrics + +#### 4.6.1 Per-Stage Timing + +Every command collects a `Vec`: + +```rust +#[derive(Debug, Clone, Serialize)] +pub struct StageTiming { + pub name: String, // "ingest_issues", "fetch_pages", etc. + #[serde(skip_serializing_if = "Option::is_none")] + pub project: Option, // Which project, if applicable + pub elapsed_ms: u64, + pub items_processed: usize, + #[serde(skip_serializing_if = "is_zero")] + pub items_skipped: usize, + #[serde(skip_serializing_if = "is_zero")] + pub errors: usize, + #[serde(skip_serializing_if = "Vec::is_empty")] + pub sub_stages: Vec, // Nested child stages +} +``` + +**Collection mechanism**: Stage timing is materialized from tracing spans, not plumbed manually through function signatures. Phase 2 adds `#[instrument]` spans to each sync stage. Phase 3 adds a custom `tracing-subscriber` layer that records span enter/exit times and structured fields, then extracts the span tree into `Vec` when the root span closes. + +This means: +- No mutable timing collector threaded through `run_ingest` → `fetch_pages` → `sync_discussions` +- Spans are the single source of truth for timing +- `StageTiming` is a materialized view of the span tree +- The custom layer implements `on_close` to capture `elapsed` and `on_record` to capture structured fields like `items_processed` + +**Where to define**: `src/core/metrics.rs` (new file — genuinely new functionality that doesn't fit in any existing file) + +#### 4.6.2 Robot JSON Meta Enhancement + +Currently: +```json +{ "ok": true, "data": {...}, "meta": { "elapsed_ms": 1234 } } +``` + +Proposed: +```json +{ + "ok": true, + "data": { ... }, + "meta": { + "run_id": "a1b2c3d4", + "elapsed_ms": 45230, + "stages": [ + { + "name": "ingest_issues", + "elapsed_ms": 12340, + "items_processed": 150, + "items_skipped": 30, + "errors": 0, + "sub_stages": [ + { "name": "fetch_pages", "project": "group/repo", "elapsed_ms": 5200, "items_processed": 150 }, + { "name": "sync_discussions", "project": "group/repo", "elapsed_ms": 6800, "items_processed": 42, "items_skipped": 108 } + ] + }, + { + "name": "ingest_mrs", + "elapsed_ms": 18900, + "items_processed": 85, + "items_skipped": 12, + "errors": 1 + }, + { "name": "generate_docs", "elapsed_ms": 8500, "items_processed": 235 }, + { "name": "embed", "elapsed_ms": 5490, "items_processed": 1024 } + ] + } +} +``` + +#### 4.6.3 Sync History Enrichment + +**Prerequisite bug fix**: The `sync_runs` table (migration 001) exists with columns `id`, `started_at`, `heartbeat_at`, `finished_at`, `status`, `command`, `error`, `metrics_json` — but **no code ever writes to it**. The `sync_status.rs` command reads from it but always gets zero rows. This must be fixed before enrichment. + +**Step 1: Wire up sync_runs lifecycle** (prerequisite, in Phase 4) + +Add INSERT/UPDATE calls to the sync and ingest command handlers: + +```rust +// On sync/ingest start: +INSERT INTO sync_runs (started_at, heartbeat_at, status, command) +VALUES (?now_ms, ?now_ms, 'running', ?command_name) +RETURNING id; + +// On sync/ingest success: +UPDATE sync_runs +SET finished_at = ?now_ms, status = 'succeeded', metrics_json = ?metrics +WHERE id = ?run_id; + +// On sync/ingest failure: +UPDATE sync_runs +SET finished_at = ?now_ms, status = 'failed', error = ?error_msg, metrics_json = ?metrics +WHERE id = ?run_id; +``` + +**Where**: Add a `SyncRunRecorder` helper in `src/core/db.rs` or `src/core/sync_run.rs` that encapsulates the INSERT/UPDATE lifecycle. Called from `run_sync()` in `src/cli/commands/sync.rs` and `run_ingest()` in `src/cli/commands/ingest.rs`. + +**Step 2: Schema migration** (migration 014) + +Add dedicated queryable columns alongside the existing `metrics_json`: + +```sql +-- Migration 014: sync_runs enrichment for observability +ALTER TABLE sync_runs ADD COLUMN run_id TEXT; +ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0; +ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0; + +-- Index for correlation queries +CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id); +``` + +The existing `metrics_json` column stores the detailed `Vec` as a JSON array. No need for a separate `stages_json` column. + +**Step 3: Enhanced sync-status display** + +`lore sync-status` (`src/cli/commands/sync_status.rs`) currently shows only the last run. Enhance to show recent runs with metrics: + +``` +Recent sync runs: + Run a1b2c3 | 2026-02-04 14:32 | 45.2s | 235 items | 1 error + Run d4e5f6 | 2026-02-03 14:30 | 38.1s | 220 items | 0 errors + Run g7h8i9 | 2026-02-02 14:29 | 42.7s | 228 items | 0 errors +``` + +Robot mode (`lore --robot sync-status`): +```json +{ + "ok": true, + "data": { + "runs": [ + { + "run_id": "a1b2c3d4-e5f6-4a7b-8c9d-0e1f2a3b4c5d", + "started_at": "2026-02-04T14:32:01.123Z", + "elapsed_ms": 45230, + "status": "succeeded", + "command": "sync", + "total_items_processed": 235, + "total_errors": 1, + "stages": [...] + } + ], + "cursors": [...], + "summary": {...} + } +} +``` + +The `stages` array is parsed from `metrics_json` and included in the robot output. Interactive mode shows the summary table above; `lore --robot sync-status --run a1b2c3` shows a single run's full stage breakdown. + +#### 4.6.4 Human-Readable Timing + +At the end of `lore sync` (interactive mode), print a timing summary: + +``` +Sync complete in 45.2s + Ingest issues .... 12.3s (150 items, 42 discussions) + Ingest MRs ....... 18.9s (85 items, 1 error) + Generate docs .... 8.5s (235 documents) + Embed ............ 5.5s (1024 chunks) +``` + +Gated behind `display.show_text` so it doesn't appear in progress_only or silent modes. + +### 4.7 Rate Limit and Retry Transparency + +Currently, rate limits emit a `tracing::warn!`. Enhance to: + +- Log at INFO level (not just WARN) with structured fields: `info!(path, attempt, retry_after_secs, "Rate limited, retrying")`. +- Count total rate-limit hits per run and include in stage timing. +- In `-v` mode, show retry progress on stderr: ` Retrying /api/v4/projects/123/issues (429, waiting 2s)`. + +### 4.8 Configuration + +Add a new `logging` section to `Config` (`src/core/config.rs`): + +```rust +#[derive(Debug, Clone, Deserialize)] +pub struct LoggingConfig { + /// Directory for log files. Default: ~/.local/share/lore/logs/ + #[serde(default)] + pub log_dir: Option, + + /// Days to retain log files. Default: 30. Set to 0 to disable file logging. + #[serde(default = "default_retention_days")] + pub retention_days: u32, + + /// Enable JSON log files. Default: true. + #[serde(default = "default_true")] + pub file_logging: bool, +} + +fn default_retention_days() -> u32 { 30 } +fn default_true() -> bool { true } +``` + +Add to the `Config` struct: + +```rust +#[serde(default)] +pub logging: LoggingConfig, +``` + +With `config.json`: +```json +{ + "logging": { + "log_dir": null, + "retention_days": 30, + "file_logging": true + } +} +``` + +Defaults are sane so existing configs continue working with zero changes. + +**CLI flags** (added to `Cli` struct in `src/cli/mod.rs`): + +| Flag | Type | Default | Description | +|---|---|---|---| +| `-v` / `--verbose` | count (u8) | 0 | Increase stderr log verbosity. Stacks: `-v`, `-vv`, `-vvv`. | +| `--log-format` | text \| json | text | Stderr log format. `json` emits one JSON object per log line (same schema as file layer). | + +These are global flags (`global = true`) consistent with the existing `--quiet`, `--robot`, etc. + +--- + +## 5. Implementation Plan + +### Phase 1: Verbosity Flags + Structured File Logging + +**Scope**: CLI flags, dual-layer subscriber, file logging, rotation, retention, `--log-format`. + +**Files touched**: +- `Cargo.toml` — add `tracing-appender` dependency +- `src/cli/mod.rs` — add `-v`/`--verbose` (count) and `--log-format` flags to `Cli` struct +- `src/main.rs` — replace subscriber initialization (lines 44-58) with dual-layer setup +- `src/core/config.rs` — add `LoggingConfig` struct and `logging` field to `Config` +- `src/core/paths.rs` — add `get_log_dir()` helper (XDG data dir + `/logs/`) +- `src/cli/commands/doctor.rs` — add log file location and disk usage check + +**Implementation steps**: +1. Add `-v` / `--verbose` (count, `u8`) and `--log-format` (text|json) flags to `Cli` struct. +2. Add `tracing-appender` dependency to `Cargo.toml`. +3. Add `LoggingConfig` to `Config` with `#[serde(default)]`. +4. Add `get_log_dir()` to `src/core/paths.rs` (mirrors `get_db_path()` pattern). +5. Replace subscriber init in `main.rs`: + - Build `stderr_filter` from `-v` count (or `RUST_LOG` if set). + - Build `file_filter` as `lore=debug,warn` (or `RUST_LOG` if set). + - stderr layer: `fmt::layer().with_writer(SuspendingWriter)` with `stderr_filter`. When `--log-format json`, chain `.json()`. + - file layer: `fmt::layer().json().with_writer(tracing_appender::rolling::daily(log_dir, "lore"))` with `file_filter`. + - Combine via `registry().with(stderr_layer.with_filter(stderr_filter)).with(file_layer.with_filter(file_filter))`. +6. Implement log retention at startup: scan `log_dir`, delete files matching `lore.*.log` pattern older than `retention_days`. Run before subscriber init so deleted files aren't held open. +7. Add log file check to `lore doctor`: report log directory path, number of log files, total disk usage. In robot mode, add a `logging` field to `DoctorChecks` with `log_dir`, `file_count`, `total_bytes`, `oldest_file`. + +**New dependencies**: `tracing-appender` (0.2) + +**Interaction with `-q`/`--quiet`**: The existing `--quiet` flag suppresses non-error terminal output via `IngestDisplay::silent()`. It should NOT affect file logging (file layer is always on). When `-q` and `-v` are both passed, `-q` wins for stderr (set stderr filter to WARN+). File layer remains at DEBUG+. + +**Tests** (see Section 6.1 for details): +- Unit: `EnvFilter` construction from verbosity count (0→INFO, 1→DEBUG, 2→DEBUG+deps, 3→TRACE) +- Unit: `RUST_LOG` overrides `-v` flags +- Unit: `-q` + `-v` interaction (quiet wins) +- Unit: `LoggingConfig` deserialization with missing/partial/full fields +- Unit: Log retention deletes old files, preserves recent ones +- Integration: Subscriber produces JSON lines to a test file +- Integration: `SuspendingWriter` still works with dual-layer stack (no garbled output) + +### Phase 2: Spans + Correlation IDs + +**Scope**: Tracing spans, UUID-based `run_id`, span recording for JSON logs. + +**Depends on**: Phase 1 (subscriber must support span recording). + +**Files touched**: +- `src/cli/commands/sync.rs` — add root span with `run_id` field to `run_sync()` +- `src/cli/commands/ingest.rs` — add `#[instrument]` spans to `run_ingest()` and its stages +- `src/ingestion/orchestrator.rs` — add spans for `fetch_pages`, `sync_discussions`, `fetch_resource_events` +- `src/documents/regenerator.rs` — add span for `generate_docs` stage +- `src/embedding/pipeline.rs` — add span for `embed` stage +- `src/main.rs` — generate `run_id` before calling command handler, pass as field + +**Implementation steps**: +1. Generate `run_id` using `Uuid::new_v4().to_string()[..8]` (first 8 chars of UUIDv4) at command entry in `main.rs`. No new dependency needed — `uuid` v1 with v4 feature is already in `Cargo.toml`. +2. Create root span: `let _root = tracing::info_span!("sync", run_id = %run_id).entered();` (or equivalent for each command). +3. Add `#[instrument(skip_all, fields(stage = "ingest_issues"))]` to ingest stages. +4. Add `#[instrument(skip_all, fields(project = %project_path))]` to per-project functions. +5. Ensure the file layer's JSON formatter includes span context. `tracing-subscriber`'s `fmt::layer().json()` includes the current span chain by default when the registry has span storage enabled. +6. Verify: parse a log file, confirm every line includes `span.run_id`. + +**New dependencies**: None (`uuid` already present). + +**Tests**: +- Unit: `run_id` is a valid 8-character hex string +- Integration: Run a sync-like operation with spans, parse JSON log output, verify every line contains `run_id` in span context +- Integration: Nested spans produce correct parent-child relationships in JSON output + +### Phase 3: Performance Metrics Collection + +**Scope**: `StageTiming` struct, span-to-metrics extraction, robot JSON enrichment, timing summary. + +**Depends on**: Phase 2 (spans must exist to extract timing from). + +**Files touched**: +- `src/core/metrics.rs` — new file: `StageTiming` struct, `MetricsLayer` (custom tracing layer), span-to-timing extraction +- `src/cli/commands/sync.rs` — consume `Vec` from `MetricsLayer`, include in `SyncMeta` +- `src/cli/commands/ingest.rs` — same pattern for standalone ingest +- `src/main.rs` — register `MetricsLayer` in the subscriber stack + +**Implementation steps**: +1. Define `StageTiming` struct with `sub_stages: Vec` in `src/core/metrics.rs`. +2. Implement `MetricsLayer` as a custom `tracing_subscriber::Layer`: + - `on_new_span`: Record span ID, name, parent, start time. + - `on_record`: Capture structured fields (`items_processed`, `items_skipped`, `errors`) recorded via `Span::record()`. + - `on_close`: Calculate `elapsed_ms`, build `StageTiming` entry, attach to parent. + - Provide `fn extract_timings(&self, run_id: &str) -> Vec` to materialize the span tree after the root span closes. +3. Store `MetricsLayer` reference (behind `Arc`) so command handlers can call `extract_timings()` after `run_sync()` completes. +4. Extend `SyncMeta` and `SyncJsonOutput` to include `run_id: String` and `stages: Vec`. +5. Print human-readable timing summary at end of interactive sync (gated behind `IngestDisplay::show_text`). + +**Span field recording**: Sync stages must record item counts as span fields for `MetricsLayer` to capture: +```rust +let span = tracing::info_span!("ingest_issues"); +let _guard = span.enter(); +// ... do work ... +span.record("items_processed", count); +span.record("items_skipped", skipped); +``` + +**Tests**: +- Unit: `StageTiming` serialization matches expected JSON (including nested `sub_stages`) +- Unit: `MetricsLayer` correctly builds span tree from synthetic span events +- Unit: `MetricsLayer` handles spans with no children (leaf stages like `embed`) +- Unit: `MetricsLayer` handles concurrent spans (multiple projects in parallel) +- Integration: `lore --robot sync` output includes `meta.stages` array with correct nesting +- Integration: Interactive sync prints timing summary table to stderr + +### Phase 4: Sync History Enrichment + +**Scope**: Wire up `sync_runs` INSERT/UPDATE lifecycle, schema migration, enhanced sync-status. + +**Depends on**: Phase 3 (needs `Vec` to store in `metrics_json`). + +**Files touched**: +- `migrations/014_sync_runs_enrichment.sql` — new migration: add `run_id`, `total_items_processed`, `total_errors` columns + index +- `src/core/sync_run.rs` — new file: `SyncRunRecorder` struct encapsulating INSERT on start, UPDATE on finish +- `src/cli/commands/sync.rs` — create `SyncRunRecorder` before pipeline, finalize after +- `src/cli/commands/ingest.rs` — same pattern for standalone ingest +- `src/cli/commands/sync_status.rs` — enhance to show recent runs with metrics, parse `metrics_json` + +**Implementation steps**: +1. Create migration `014_sync_runs_enrichment.sql`: + ```sql + ALTER TABLE sync_runs ADD COLUMN run_id TEXT; + ALTER TABLE sync_runs ADD COLUMN total_items_processed INTEGER DEFAULT 0; + ALTER TABLE sync_runs ADD COLUMN total_errors INTEGER DEFAULT 0; + CREATE INDEX idx_sync_runs_run_id ON sync_runs(run_id); + ``` + Note: Migration number 014 assumes no other migration is added before this phase. If concurrent work adds migration 014, renumber accordingly. +2. Implement `SyncRunRecorder`: + ```rust + pub struct SyncRunRecorder { id: i64, conn: Connection } + impl SyncRunRecorder { + pub fn start(conn: &Connection, command: &str, run_id: &str) -> Result; + pub fn succeed(self, metrics: &[StageTiming], total_items: usize, total_errors: usize) -> Result<()>; + pub fn fail(self, error: &str, metrics: Option<&[StageTiming]>) -> Result<()>; + } + ``` +3. In `run_sync()`: create `SyncRunRecorder::start()` before pipeline, call `.succeed()` or `.fail()` after. +4. In `run_ingest()`: same pattern. +5. Enhance `sync_status.rs`: + - Query last N runs (default 10) instead of just the last 1. + - Parse `metrics_json` column to extract stage breakdown. + - Show `run_id`, duration, item counts, error counts in both interactive and robot modes. + - Add `--run ` flag to `sync-status` for single-run detail view. + +**Tests**: +- Unit: `SyncRunRecorder::start` inserts a row with status='running' +- Unit: `SyncRunRecorder::succeed` updates status, sets finished_at, writes metrics_json +- Unit: `SyncRunRecorder::fail` updates status, sets error, sets finished_at +- Unit: Migration 014 applies cleanly on top of migration 013 +- Integration: `lore sync` creates a sync_runs row; `lore sync-status` displays it +- Integration: `lore --robot sync-status` JSON includes `runs` array with stage breakdowns +- Integration: Failed sync records error in sync_runs with partial metrics + +### Phase 5: Rate Limit + Retry Instrumentation + +**Scope**: Enhanced logging in GitLab client, retry counters in stage timing. + +**Depends on**: Phase 2 (spans for context), Phase 3 (StageTiming for counters). + +**Files touched**: +- `src/gitlab/client.rs` (or wherever the HTTP client with retry logic lives) — add structured fields to retry/rate-limit log events +- `src/core/metrics.rs` — add `rate_limit_hits` and `retries` fields to `StageTiming` + +**Implementation steps**: +1. Find the retry/rate-limit handling code (likely in the GitLab HTTP client). Add structured tracing fields: + ```rust + info!( + path = %request_path, + attempt = attempt_number, + retry_after_secs = retry_after, + status_code = 429, + "Rate limited, retrying" + ); + ``` +2. Add `rate_limit_hits: usize` and `retries: usize` fields to `StageTiming` (with `#[serde(skip_serializing_if = "is_zero")]`). +3. In `MetricsLayer`, count rate-limit and retry events within each span and include in `StageTiming`. +4. In `-v` mode, the existing stderr layer already shows INFO+ events, so retry activity becomes visible automatically. No additional work needed beyond step 1. + +**Tests**: +- Unit: Rate-limit log events include all required structured fields +- Unit: `StageTiming` serialization includes `rate_limit_hits` and `retries` when non-zero, omits when zero +- Integration: Simulate 429 response, verify log line has `path`, `attempt`, `retry_after_secs` fields +- Integration: After simulated retries, `StageTiming` counts match expected values + +--- + +## 6. Acceptance Criteria + +### 6.1 Phase 1: Verbosity Flags + Structured File Logging + +**Functional criteria**: +- [ ] `lore sync` writes JSON log lines to `~/.local/share/lore/logs/lore.YYYY-MM-DD.log` with zero configuration. +- [ ] `lore -v sync` shows DEBUG-level `lore::*` output on stderr; dependency output stays at WARN. +- [ ] `lore -vv sync` shows DEBUG-level `lore::*` + INFO-level dependency output on stderr. +- [ ] `lore -vvv sync` shows TRACE-level output for everything on stderr. +- [ ] `RUST_LOG=lore::gitlab=trace lore sync` overrides `-v` flags for both stderr and file layers. +- [ ] `lore --log-format json sync` emits JSON-formatted log lines on stderr (same schema as file layer). +- [ ] Log files rotate daily (new file per calendar day). +- [ ] Files matching `lore.*.log` older than `retention_days` are deleted on startup. +- [ ] Existing behavior is unchanged when no new flags are passed (INFO on stderr, human-readable format). +- [ ] `--quiet` suppresses non-error stderr output. `-q` + `-v` together: `-q` wins (stderr at WARN+). +- [ ] `--quiet` does NOT affect file logging (file layer remains at DEBUG+). +- [ ] `lore doctor` reports: log directory path, number of log files, total disk usage in bytes. Robot mode includes a `logging` field in the checks JSON. +- [ ] File layer always logs at DEBUG+ for `lore::*` crate regardless of `-v` flags. + +**Test specifications**: +- `test_verbosity_filter_construction`: Given verbosity count 0/1/2/3, assert the resulting `EnvFilter` matches the expected directives table. +- `test_rust_log_overrides_verbose`: Set `RUST_LOG=lore=trace`, pass `-v` (count=1), assert the filter uses TRACE (not DEBUG). +- `test_quiet_overrides_verbose`: Pass `-q` and `-v` together, assert stderr filter is WARN+. +- `test_logging_config_defaults`: Deserialize an empty `{}` JSON as `LoggingConfig`, assert `retention_days=30`, `file_logging=true`, `log_dir=None`. +- `test_logging_config_partial`: Deserialize `{"retention_days": 7}`, assert `file_logging=true` default preserved. +- `test_log_retention_cleanup`: Create temp dir with files named `lore.2026-01-01.log` through `lore.2026-02-04.log`. Run retention with `retention_days=7`. Assert files older than 7 days are deleted, recent files preserved. +- `test_log_retention_ignores_non_log_files`: Create temp dir with `lore.2026-01-01.log` and `other.txt`. Run retention. Assert `other.txt` is NOT deleted. +- `test_json_log_output_format`: Capture file layer output, parse each line as JSON, assert keys: `timestamp`, `level`, `target`, `fields`, `span`. +- `test_suspending_writer_dual_layer`: Run a tracing event with both layers active and a progress bar. Assert no garbled output on stderr (no interleaved progress bar fragments in log lines). + +### 6.2 Phase 2: Spans + Correlation IDs + +**Functional criteria**: +- [ ] Every log line within a sync run includes `run_id` in the JSON span context. +- [ ] `jq 'select(.spans[] | .run_id != null)' lore.2026-02-04.log` extracts all lines from a run. +- [ ] Nested spans produce a chain: log lines inside `fetch_pages` include both the `fetch_pages` span and the parent `ingest_issues` span in their span context. +- [ ] `run_id` is an 8-character hex string (truncated UUIDv4). +- [ ] Spans are visible in `-vv` stderr output as bracketed context. + +**Test specifications**: +- `test_run_id_format`: Generate 100 run_ids, assert each is 8 chars, all hex characters. +- `test_run_id_uniqueness`: Generate 1000 run_ids, assert no duplicates. +- `test_span_context_in_json_logs`: Run a mock sync with spans, capture JSON log output, parse and verify each line has `spans` array containing `run_id`. +- `test_nested_span_chain`: Create parent span "sync" with child "ingest_issues" with child "fetch_pages". Emit a log event inside "fetch_pages". Assert the JSON log line's span chain includes all three span names. +- `test_span_elapsed_on_close`: Create a span, sleep 10ms, close it. Verify the close event records `elapsed_ms >= 10`. + +### 6.3 Phase 3: Performance Metrics Collection + +**Functional criteria**: +- [ ] `lore --robot sync` JSON includes `meta.run_id` (string) and `meta.stages` (array). +- [ ] Each stage in `meta.stages` has: `name`, `elapsed_ms`, `items_processed`. +- [ ] Top-level stages (ingest_issues, ingest_mrs, generate_docs, embed) have `sub_stages` arrays. +- [ ] Sub-stages include `project` field when applicable. +- [ ] `lore sync` (interactive) prints a timing summary table on stderr, gated behind `IngestDisplay::show_text`. +- [ ] `lore -q sync` does NOT print the timing summary. +- [ ] Zero-value fields (`items_skipped: 0`, `errors: 0`) are omitted from JSON output. + +**Test specifications**: +- `test_stage_timing_serialization`: Create a `StageTiming` with sub_stages, serialize to JSON, assert structure matches PRD example. +- `test_stage_timing_zero_fields_omitted`: Create `StageTiming` with `errors: 0`, serialize, assert no `errors` key in output. +- `test_metrics_layer_single_span`: Create `MetricsLayer`, enter/exit one span with recorded fields, extract timings, assert one `StageTiming` entry. +- `test_metrics_layer_nested_spans`: Create parent + child spans, extract timings, assert parent has child in `sub_stages`. +- `test_metrics_layer_parallel_spans`: Create two sibling spans (simulating two projects), extract timings, assert both appear as sub_stages of parent. +- `test_sync_meta_includes_stages`: Mock a sync pipeline, verify robot JSON output parses correctly with `meta.stages`. +- `test_timing_summary_format`: Capture stderr during interactive sync, verify timing table format matches PRD example. + +### 6.4 Phase 4: Sync History Enrichment + +**Functional criteria**: +- [ ] `lore sync` creates a row in `sync_runs` with status='running' at start, updated to 'succeeded'/'failed' at finish. +- [ ] `lore ingest issues` also creates a `sync_runs` row. +- [ ] `sync_runs.run_id` matches the `run_id` in log files and robot JSON. +- [ ] `sync_runs.metrics_json` contains the serialized `Vec`. +- [ ] `sync_runs.total_items_processed` and `total_errors` are populated. +- [ ] `lore sync-status` shows the last 10 runs with: run_id, timestamp, duration, item count, error count. +- [ ] `lore --robot sync-status` JSON includes `runs` array with `stages` parsed from `metrics_json`. +- [ ] Failed syncs record the error message and any partial metrics collected before failure. +- [ ] Migration 014 applies cleanly and is idempotent (safe to re-run). + +**Test specifications**: +- `test_sync_run_recorder_start`: Call `start()`, query sync_runs, assert one row with status='running'. +- `test_sync_run_recorder_succeed`: Call `start()` then `succeed()`, assert row has status='succeeded', finished_at set, metrics_json parseable. +- `test_sync_run_recorder_fail`: Call `start()` then `fail()`, assert row has status='failed', error set. +- `test_sync_run_recorder_fail_with_partial_metrics`: Call `start()`, collect some metrics, then `fail()`. Assert metrics_json contains partial data. +- `test_migration_014_applies`: Apply all migrations 001-014 on a fresh DB. Assert `sync_runs` has `run_id`, `total_items_processed`, `total_errors` columns. +- `test_migration_014_idempotent`: Apply migration 014 twice. Assert no error on second apply. +- `test_sync_status_shows_runs`: Insert 3 sync_runs rows, run `print_sync_status()`, assert output includes all 3 with correct formatting. +- `test_sync_status_json_includes_stages`: Insert a sync_runs row with metrics_json, run robot-mode sync-status, parse JSON, assert `runs[0].stages` is an array. + +### 6.5 Phase 5: Rate Limit + Retry Instrumentation + +**Functional criteria**: +- [ ] Rate-limit events (HTTP 429) log at INFO with structured fields: `path`, `attempt`, `retry_after_secs`, `status_code`. +- [ ] Retry events (non-429 transient errors) log with: `path`, `attempt`, `error`. +- [ ] `StageTiming` includes `rate_limit_hits` and `retries` counts (omitted when zero). +- [ ] `lore -v sync` shows retry activity on stderr (visible because it's INFO+). +- [ ] Rate limit counts are included in `metrics_json` stored in `sync_runs`. + +**Test specifications**: +- `test_rate_limit_log_fields`: Simulate a 429 response, capture log output, parse JSON, assert fields: `path`, `attempt`, `retry_after_secs`, `status_code`. +- `test_retry_log_fields`: Simulate a transient error + retry, capture log, assert fields: `path`, `attempt`, `error`. +- `test_stage_timing_rate_limit_counts`: Simulate 3 rate-limit hits within a span, extract `StageTiming`, assert `rate_limit_hits == 3`. +- `test_stage_timing_retry_counts`: Simulate 2 retries, extract `StageTiming`, assert `retries == 2`. +- `test_rate_limit_fields_omitted_when_zero`: Create `StageTiming` with zero rate limits, serialize, assert no `rate_limit_hits` key. + +--- + +## 7. Resolved Decisions + +1. **Log format**: Use `tracing-subscriber`'s built-in JSON formatter (`fmt::layer().json()`). Zero custom code, battle-tested, and ecosystem tools (Grafana Loki, Datadog) already parse this format. The schema difference from our robot JSON envelope is cosmetic and not worth the maintenance burden of a custom formatter. + +2. **Span recording**: Always-on. lore is I/O-bound (GitLab API + SQLite), so the nanosecond-level overhead of span storage and chain lookup is unmeasurable against our millisecond-scale operations. Conditional recording would add subscriber construction complexity for zero practical benefit. + +3. **Log file location**: `~/.local/share/lore/logs/` (XDG data directory). Logs are NOT reproducible — you can generate new logs, but you cannot regenerate the exact diagnostic output from a past run. They are forensic artifacts that users would notice missing, so they belong in data, not cache. + +4. **Retention**: In scope for Phase 1. Startup cleanup: scan log directory, delete files matching `lore.*.log` older than `retention_days` (default 30). Simple, no background threads, no external dependencies. Runs before subscriber initialization so deleted file handles aren't held. + +5. **Stage timing granularity**: Per-project with nested sub-stages. When one project has 500 MRs and another has 3, knowing which one consumed the time budget is the difference between "sync was slow" and actionable diagnosis. The `StageTiming` struct includes an optional `project` field and a `sub_stages: Vec` field for nesting. + +6. **Stage timing collection mechanism**: Materialized from tracing spans, not plumbed manually. A custom `MetricsLayer` in the subscriber stack records span enter/exit/record events and builds the `StageTiming` tree. This avoids threading a mutable collector through every function signature and makes spans the single source of truth for timing data. Phase 2 adds spans; Phase 3 adds the layer that reads them. + +7. **run_id format**: First 8 characters of `Uuid::new_v4().to_string()` (e.g., `"a1b2c3d4"`). The `uuid` crate (v1, v4 feature) is already a dependency. No new crate needed. 8 characters provide ~4 billion unique values — more than sufficient for local CLI invocations. + +8. **File log level**: Always DEBUG+ for `lore::*` crate, WARN+ for dependencies, regardless of `-v` flags. This ensures post-mortem data is always richer than what was shown on stderr. `RUST_LOG` overrides both layers when set. + +9. **sync_runs lifecycle**: The table exists (migration 001) but nothing writes to it. Phase 4 wires up the INSERT (on start) / UPDATE (on finish) lifecycle AND adds enrichment columns in a single migration. The existing `metrics_json` column stores the detailed `Vec` array — no need for a separate `stages_json` column. + +10. **JSON stderr via --log-format**: A `--log-format text|json` global flag controls stderr log format. Default is `text` (human-readable). When `json`, stderr uses the same JSON formatter as the file layer, routed through `SuspendingWriter` for progress bar coordination. This enables `lore sync 2>&1 | jq` workflows without reading log files. + +--- + +## 8. Phase Dependency Graph + +``` +Phase 1 (Subscriber + Flags) + | + v +Phase 2 (Spans + run_id) + | + +------+------+ + | | + v v +Phase 3 Phase 5 +(Metrics) (Rate Limit Logging) + | | + v | +Phase 4 | +(Sync History) <--+ +``` + +**Parallelization opportunities**: +- Phase 1 must complete before anything else. +- Phase 2 must complete before Phase 3 or Phase 5. +- Phase 3 and Phase 5 can run in parallel (Phase 5 only needs spans from Phase 2, not MetricsLayer from Phase 3). +- Phase 4 depends on Phase 3 (needs `Vec` to store). Phase 5's `rate_limit_hits`/`retries` fields on `StageTiming` can be added to Phase 4's stored data after Phase 5 completes, or Phase 4 can store them as zero initially. + +**Agent assignment suggestion**: +- Agent A: Phase 1 → Phase 2 (sequential, foundational infrastructure) +- Agent B: Phase 3 (after Phase 2 completes) +- Agent C: Phase 5 (after Phase 2 completes, parallel with Phase 3) +- Agent B or D: Phase 4 (after Phase 3 completes) + +--- + +## 9. References + +- Gholamian, S. & Ward, P. (2021). "A Comprehensive Survey of Logging in Software." arXiv:2110.12489. +- Duan, S. et al. (2025). "PDLogger: Automated Logging Framework for Practical Software Development." arXiv:2507.19951. +- tokio-rs/tracing ecosystem: `tracing`, `tracing-subscriber`, `tracing-appender`. +- GNU Coding Standards: Verbosity and diagnostic output conventions. + Rust CLI Working Group: Recommendations for error reporting and verbosity. diff --git a/gitlore-sync-explorer.html b/gitlore-sync-explorer.html new file mode 100644 index 0000000..c62551a --- /dev/null +++ b/gitlore-sync-explorer.html @@ -0,0 +1,844 @@ + + + + + +Gitlore Sync Pipeline Explorer + + + + + + +
+
+

Full Sync Overview

+ 4 stages +
+ +
+ + +
+
+
+
Stage 1
+
Ingest Issues
+
Fetch issues + discussions + resource events from GitLab API
+
Cursor-based incremental sync.
Sequential discussion fetch.
Queue-based resource events.
+
+
+
+
Stage 2
+
Ingest MRs
+
Fetch merge requests + discussions + resource events
+
Page-based incremental sync.
Parallel prefetch discussions.
Queue-based resource events.
+
+
+
+
Stage 3
+
Generate Docs
+
Regenerate searchable documents for changed entities
+
Driven by dirty_sources table.
Triple-hash skip optimization.
FTS5 index auto-updated.
+
+
+
+
Stage 4
+
Embed
+
Generate vector embeddings via Ollama for semantic search
+
Hash-based change detection.
Chunked, batched API calls.
Non-fatal — graceful if Ollama down.
+
+
+
+
Concurrency Model
+
    +
  • Stages 1 & 2 process projects concurrently via buffer_unordered(primary_concurrency)
  • +
  • Each project gets its own SQLite connection; rate limiter is shared
  • +
  • Discussions: sequential (issues) or batched parallel prefetch (MRs)
  • +
  • Resource events use a persistent job queue with atomic claim + exponential backoff
  • +
+
+
+
Sync Flags
+
    +
  • --full — Resets all cursors & watermarks, forces complete re-fetch
  • +
  • --no-docs — Skips Stage 3 (document generation)
  • +
  • --no-embed — Skips Stage 4 (embedding generation)
  • +
  • --force — Overrides stale single-flight lock
  • +
  • --project <path> — Sync only one project (fuzzy matching)
  • +
+
+
+
Single-Flight Lock
+
    +
  • Table-based lock (AppLock) prevents concurrent syncs
  • +
  • Heartbeat keeps the lock alive; stale locks auto-detected
  • +
  • Use --force to override a stale lock
  • +
+
+
+ + +
+
+
API Call
+
Transform
+
Database
+
Decision
+
Error Path
+
Queue
+
+
+
+
1
+
Fetch Issues Cursor-Based Incremental Sync
+
+
+
GitLab API Call
paginate_issues() with
updated_after = cursor - rewind
+
+
Cursor Filter
updated_at > cursor_ts
OR tie_breaker check
+
+
transform_issue()
GitLab API shape →
local DB row shape
+
+
Transaction
store_payload → upsert →
mark_dirty → relink
+
+
+
+
Update Cursor
Every 100 issues + final
sync_cursors table
+
+
+
+
+
2
+
Discussion Sync Sequential, Watermark-Based
+
+
+
Query Stale Issues
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
+
+
Paginate Discussions
Sequential per issue
paginate_issue_discussions()
+
+
Transform
transform_discussion()
transform_notes()
+
+
Write Discussion
store_payload → upsert
DELETE notes → INSERT notes
+
+
+
✓ On Success (all pages fetched)
+
+
Remove Stale
DELETE discussions not
seen in this fetch
+
+
Advance Watermark
discussions_synced_for_
updated_at = updated_at
+
+
✗ On Pagination Error
+
+
Skip Stale Removal
Watermark NOT advanced
Will retry next sync
+
+
+
+
+
+
3
+
Resource Events Queue-Based, Concurrent Fetch
+
+
+
Cleanup Obsolete
DELETE jobs where entity
watermark is current
+
+
Enqueue Jobs
INSERT for entities where
updated_at > watermark
+
+
Claim Jobs
Atomic UPDATE...RETURNING
with lock acquisition
+
+
Fetch Events
3 concurrent: state +
label + milestone
+
+
+
✓ On Success
+
+
Store Events
Transaction: upsert all
3 event types
+
+
Complete + Watermark
DELETE job row
Advance watermark
+
+
✗ Permanent Error (404 / 403)
+
+
Skip Permanently
complete_job + advance
watermark (coalesced)
+
+
↻ Transient Error
+
+
Backoff Retry
fail_job: 30s x 2^(n-1)
capped at 480s
+
+
+
+
+ + +
+
+
API Call
+
Transform
+
Database
+
Diff from Issues
+
Error Path
+
Queue
+
+
+
+
1
+
Fetch MRs Page-Based Incremental Sync
+
+
+
GitLab API Call
fetch_merge_requests_page()
with cursor rewind
Page-based, not streaming
+
+
Cursor Filter
Same logic as issues:
timestamp + tie-breaker
Same as issues
+
+
transform_merge_request()
Maps API shape →
local DB row
+
+
Transaction
store → upsert → dirty →
labels + assignees + reviewers
3 junction tables (not 2)
+
+
+
+
Update Cursor
Per page (not every 100)
Per page boundary
+
+
+
+
+
2
+
MR Discussion Sync Parallel Prefetch + Serial Write
+
+
+
Key Differences from Issue Discussions
+
    +
  • Parallel prefetch — fetches all discussions for a batch concurrently via join_all()
  • +
  • Upsert pattern — notes use INSERT...ON CONFLICT (not delete-all + re-insert)
  • +
  • Sweep stale — uses last_seen_at timestamp comparison (not set difference)
  • +
  • Sync health tracking — records discussions_sync_attempts and last_error
  • +
+
+
+
Query Stale MRs
updated_at > COALESCE(
discussions_synced_for_
updated_at, 0)
Same watermark logic
+
+
Batch by Concurrency
dependent_concurrency
MRs per batch
Batched processing
+
+
+
+
Parallel Prefetch
join_all() fetches all
discussions for batch
Parallel (not sequential)
+
+
Transform In-Memory
transform_mr_discussion()
+ diff position notes
+
+
Serial Write
upsert discussion
upsert notes (ON CONFLICT)
Upsert, not delete+insert
+
+
+
✓ On Full Success
+
+
Sweep Stale
DELETE WHERE last_seen_at
< run_seen_at (disc + notes)
last_seen_at sweep
+
+
Advance Watermark
discussions_synced_for_
updated_at = updated_at
+
+
✗ On Failure
+
+
Record Sync Health
Watermark NOT advanced
Tracks attempts + last_error
Health tracking
+
+
+
+
+
+
3
+
Resource Events Same as Issues
+
+
+
Identical to Issue Resource Events
+
    +
  • Same queue-based approach: cleanup → enqueue → claim → fetch → store/fail
  • +
  • Same watermark column: resource_events_synced_for_updated_at
  • +
  • Same error handling: 404/403 coalesced to empty, transient errors get backoff
  • +
  • entity_type = "merge_request" instead of "issue"
  • +
+
+
+
+ + +
+
+
Trigger
+
Extract
+
Database
+
Decision
+
Error
+
+
+
+
1
+
Dirty Source Queue Populated During Ingestion
+
+
+
mark_dirty_tx()
Called during every issue/
MR/discussion upsert
+
+
dirty_sources Table
INSERT (source_type, source_id)
ON CONFLICT reset backoff
+
+
+
+
+
2
+
Drain Loop Batch 500, Respects Backoff
+
+
+
Get Dirty Sources
Batch 500, ORDER BY
attempt_count, queued_at
+
+
Dispatch by Type
issue / mr / discussion
→ extract function
+
+
Source Exists?
If deleted: remove doc row
(cascade cleans FTS + embeds)
+
+
+
+
Extract Content
Structured text:
header + metadata + body
+
+
Triple-Hash Check
content_hash + labels_hash
+ paths_hash all match?
+
+
SAVEPOINT Write
Atomic: document row +
labels + paths
+
+
+
✓ On Success
+
+
clear_dirty()
Remove from dirty_sources
+
+
✗ On Error
+
+
record_dirty_error()
Increment attempt_count
Exponential backoff
+
+
≡ Triple-Hash Match (skip)
+
+
Skip Write
All 3 hashes match →
no WAL churn, clear dirty
+
+
+
+
+
Full Mode (--full)
+
    +
  • Seeds ALL entities into dirty_sources via keyset pagination
  • +
  • Triple-hash optimization prevents redundant writes even in full mode
  • +
  • Runs FTS OPTIMIZE after drain completes
  • +
+
+
+ + +
+
+
API (Ollama)
+
Processing
+
Database
+
Decision
+
Error
+
+
+
+
1
+
Change Detection Hash + Config Drift
+
+
+
find_pending_documents()
No metadata row? OR
document_hash mismatch? OR
config drift?
+
+
Keyset Pagination
500 documents per page
ordered by doc ID
+
+
+
+
+
2
+
Chunking Split + Overflow Guard
+
+
+
split_into_chunks()
Split by paragraph boundaries
with configurable overlap
+
+
Overflow Guard
Too many chunks?
Skip to prevent rowid collision
+
+
Build ChunkWork
Assign encoded chunk IDs
per document
+
+
+
+
+
3
+
Ollama Embedding Batched API Calls
+
+
+
Batch Embed
32 chunks per Ollama
API call
+
+
Store Vectors
sqlite-vec embeddings table
+ embedding_metadata
+
+
+
✓ On Success
+
+
SAVEPOINT Commit
Atomic per page:
clear old + write new
+
+
↻ Context-Length Error
+
+
Retry Individually
Re-embed each chunk solo
to isolate oversized one
+
+
✗ Other Error
+
+
Record Error
Store in embedding_metadata
for retry next run
+
+
+
+
+
Full Mode (--full)
+
    +
  • DELETEs all embedding_metadata and embeddings rows first
  • +
  • Every document re-processed from scratch
  • +
+
+
+
Non-Fatal in Sync
+
    +
  • Stage 4 failures (Ollama down, model missing) are graceful
  • +
  • Sync completes successfully; embeddings just won't be updated
  • +
  • Semantic search degrades to FTS-only mode
  • +
+
+
+ +
+ + +
+
+ + Watermark & Cursor Reference +
+
+ + + + + + + + + + + + +
TableColumn(s)Purpose
sync_cursorsupdated_at_cursor + tie_breaker_idIncremental fetch: "last entity we saw" per project+type
issuesdiscussions_synced_for_updated_atPer-issue discussion watermark
issuesresource_events_synced_for_updated_atPer-issue resource event watermark
merge_requestsdiscussions_synced_for_updated_atPer-MR discussion watermark
merge_requestsresource_events_synced_for_updated_atPer-MR resource event watermark
dirty_sourcesqueued_at + next_attempt_atDocument regeneration queue with backoff
embedding_metadatadocument_hash + chunk_max_bytes + model + dimsEmbedding staleness detection
pending_dependent_fetcheslocked_at + next_retry_at + attemptsResource event job queue with backoff
+
+
+
+ + +
+
+

Node Details

+ +
+
+
+ + + +